Showing posts with label commander. Show all posts
Showing posts with label commander. Show all posts

Tuesday, May 14, 2019

Using Nightmare.js to Generate a Sitemap From Confluence

Introduction

In a recent project, I had a task to generate a list of documents in a particular Confluence space. I chose to explore my options using Nightmare.js. Using this Node.js (version 10.11.0) module, it allowed me to programmatically enter my credentials into Confluence, navigate to a specific document, and gather a list of documents (thanks to the target document using the Children Display macro that listed all the documents of the parent page of the target space). I also wanted this script to take arguments (flags) such as the username, password, spacekey, output file, and a delay value so that the process can be automated for a variety of reasons.

Required skills and npm packages

This tutorial requires a number of skills and/or npm modules to complete everything mentioned herein:
  • Confluence (5.x): You should be comfortable with creating pages that utilize the Children Display macro
  • Nightmare (3.0.1): have some familiarity with the basics of this module
  • Commander (2.19.0): have some familiarity with the basics of this module
  • Cheerio (1.0.0-rc.2): have some familiarity with the basics of this module
  • CSS: basic knowledge of how to select elements
  • JavaScript: fair knowledge of how to use JavaScript

Setting up requirements

First, we set off with requiring a number of modules:

const Nightmare = require("nightmare");
const cheerio = require('cheerio');
const program = require('commander');
const fs = require('fs');

....

Set up nightmare and flag options

The next two lines sets up nightmare to display it's process as it's going through the steps we'll program it to navigate and a selector to find the content we're looking for in our target document. The confluenceSelector is the CSS selector that will be used to find the desired content in the main body of the Confluence document.

....
const nightmare = Nightmare({
    show: true
});
const confluenceSelector = '#main-content';

....

Note: you don't want to see an Electron window pop up and nightmare to do it's stuff, set show to false.

Next, we set up the flags and their usage using commander's features:

...
program
  .version('0.0.1')
  .usage('-u <username> -p <password> -s <spacekey> -f <output.txt> -d <milliseconds>')
  .option('-u, --user', '*required* Username id')
  .option('-p, --password', '*required* User\'s password')
  .option('-s, --spacekey', '*required* Spacekey for the Confluence space')
  .option('-f --file', 'Text file to be used for tracking Confluence document names. Can be set to either true (defaults to the spacekey naming scheme) or a file name.')
  .option('-d, --delay', 'Delay (in milliseconds) to wait for server response')
  .parse(process.argv);

...

With the flags set, we now need to parse them into an object that we'll use throughout the rest of the script. We loop through the program.rawArgs value provided by the commander module. In this loop, we are looking for specific flags so we can associate the flag with the value associated with it.

...
var argument = {};

for (var i = 0; i < program.rawArgs.length; i++) {
  if (program.rawArgs[i] == '--user' || program.rawArgs[i] == '-u') {
    arguments.user = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--password' || program.rawArgs[i] == '-p') {
    arguments.pass = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--spacekey' || program.rawArgs[i] == '-s') {
    arguments.spacekey = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--delay' || program.rawArgs[i] == '-d') {
    arguments.delay = parseInt(program.rawArgs[i + 1]);
  }
  if (program.rawArgs[i] == '--file' || program.rawArgs[i] == '-f') {
    arguments.file = program.rawArgs[i + 1];
  }
}

...


Since the delay flag is optional, we should set up a fallback if the user doesn't supply one. In this case, we're setting the delay to 10 seconds though you can adjust this delay value to a number you're comfortable with your Confluence server responding a login page request.

...
if (!arguments.delay) {
  arguments.delay = 10000;
  console.log('Server response delay not set. Assuming ' + arguments.delay + ' millisecond delay.');
}

...

Now we should set up the file path where we keep the site map information. If the user doesn't supply a file to output our data to, the script will use a fallback based on the submitted spacekey name.

...
if (arguments.file) {
  if (arguments.file.length > 5) {
    var confluenceSiteMap = arguments.file;
  } else {
    var confluenceSiteMap = arguments.spacekey + '-site_map.txt';
  }
} else {
  var confluenceSiteMap = confluenceSiteMap.txt;
}

...

The next thing our script will need is the Confluence URL to the site map document. Using the Children Display macro in your target Confluence space, we can gather all the document links in a single space by scraping this one document. Note: you should set up this Confluence document accordingly before executing this script and ensure it's named Site Map. Otherwise, you'll need to change the values in arguments.confluence.

...

if (arguments.spacekey) {
  arguments.confluence = <base Confluence URL> + '/display/' + arguments.spacekey + '/Site+Map';
}

...

With the arguments parsed, we should check that the user supplied the required flags. If any of these flags weren't submitted, then the script should gracefully exit.

...
if (!arguments.user || !arguments.pass || !arguments.spacekey) {
  if (!arguments.user) { // user id is required
    console.log('Username is required.');
  }
  if (!arguments.pass) { // password is required
    console.log('Password is required.')
  }

  if (!arguments.spacekey) {
    console.log('Spacekey is required.')
  }

  process.exit(1);

...

Pull content with nightmare

With the required flags set, we can now request a document from Confluence using your credentials. This chunk of code starts the nightmare.js process by navigating the Electron browser to the site map page in Confluence. The process belows assumes that a login is required when the target page is loaded, enters user supplied username and password in the appropriate fields (denoted by their element ids), click the login button (denoted by it's element id), wait for a period of time (hopefully long enough for the server to respond), grab the content from the predetermined CSS selector via the evaluate method, return the data for parsing later, and close the Electron browser.

...
} else {
  console.log('Getting document link list from ' + arguments.confluence);
  nightmare
    .goto(arguments.confluence)
    .type('#os_username', arguments.user)
    .type('#os_password', arguments.pass)
    .click('#loginButton')
    .wait(arguments.delay)
    .evaluate(confluenceSelector => {
      return {
        html: document.querySelector(confluenceSelector).innerHTML
      }
    }, confluenceSelector)
    .end()

...

Parse content with Cheerio

Now that nightmare.js has retrieved the document in question, we use the then method to load the HTML content into cheerio.js to generate a list of links. Generally speaking, the links listed in a Confluence document usually follow the li span a selector pattern inside the body of the document. Here, we use the output variable to hold the list of links found in the retrieve data.

...
.then(obj => {
  $ = cheerio.load(obj.html.toString());

  var output = '';

  $('li span a').each(function() {
    output += $(this).html() + '\n';
  });

...

Then, we write out the list of links we found in the Confluence document to our predetermined text file.

... 
  fs.writeFileSync(confluenceSiteMap, output, 'utf8');
})

...

Finally, we use the catch method to report back any errors.

...
  .catch(error => {
    console.error(error);
  });
}


Wrapping up

With the script complete, we should save it something like confluenceSitemap.js. From there, we can execute this command to generate our list of links text file: node confluenceSitemap.js -u <username> -p <password> -s <spacekey> -f <links.txt>

Wednesday, January 16, 2019

Scraping a Web Document Using Nightmare.js

Introduction

I recently learned about another method to harvest content from a website using nightmare.js. Using other libraries such as request.js (with cheerio.js) works fine but if one needs to get around a login or has a need to navigate to get at the content, these libraries won't work. Enter nightmare.js and Electron. This document walks one through a basic setup of using nightmare.js to navigate to a site, login, and grab content from a specific element.

Requirements

You should be fairly comfortable with JavaScript and CSS in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory: 
Note: This tutorial was written with Node.js (version 10.11.0).

Scraping Content with Nightmare.js

Like any node.js app, let's start off with setting up the basics requiring various modules. In this case, we are using nightmare.js to navigate a site, fs to write out the content to disk, and commander to set up flags for the script's arguments.

const Nightmare = require("nightmare");
const fs = require('fs');
const program = require('commander');
const nightmare = Nightmare({ show: true });
const selector = '.content';
...


Note: if you don't want to see Electron "jumping through all the hoops" to get at the content, you can set show to false. I think using commander library makes using this script easier to use as the input isn't order dependent. Finally, the selector variable is where the target content is located. In this case, the variable will be looking for an element with the content class. This variable can use any CSS selector method that you would like to use to get at the desired content.

CLI setup

Next, we'll set up the flags for the script. In this case, we should only accept three required flags: user (id), (user) password, and the URL of the target document.

...
program
  .version('0.1.0')

  .usage('[required options] -u <username> -p <password> -url <url>')
  .option('-u, --user', 'Username id')
  .option('-p, --password', 'User\'s password')
  .option('-url, --url', 'URL for site')
  .parse(process.argv);
...


Setting the user credentials and URL

Now we should set up the values passed into the flags as variables to by used by the script. The arguments object will contain the user, password, and URL values. I mentioned how to set up a Node.js CLI earlier.

...
var arguments = {};

for (var i = 0; i < program.rawArgs.length; i++) {
  if (program.rawArgs[i] == '--user' || program.rawArgs[i] == '-u') {
  
  arguments.user = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--password' || program.rawArgs[i] == '-p') {
  
  arguments.pass = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--url' || program.rawArgs[i] == '-url') {
  
  arguments.url = program.rawArgs[i + 1];
  }
}
...


Note: commander doesn't not process some special characters (e.g. ', ", !, $, and so on) for a variety of reasons. We won't get into that here today. So, if your password uses any of these special characters, it may not pass the string properly to the target server.

Exiting if required parameters are missing

Next, we'll set up the flags for the script. In this case, we should only accept three flags: user (id), (user) password, and the URL of the target document.

...
if (arguments.user && arguments.pass && arguments.url) {
  ...
  <main routine>
  ...
} else {
  if (!arguments.user || !arguments.pass || !arguments.url) {
    if (!arguments.user) {
      console.log('Username is required.');
    }
    if (!arguments.pass) {
      console.log('Password is required.')
    }
    if (!arguments.url) {
      console.log('URL is required.')
    }
  }
  process.exit(1);
}

Main routine

And now for the main (routine) attraction!
We'll use nightmare to navigate Electron to our desired document, clicked the login button, wait a bit (hopefully long enough for the server to respond), enter our credentials, submit said credentials, wait again, grab the content, write out the content, and announce any errors.

...
nightmare
  .goto(arguments.url)
  .click('#login')
  .wait(5000)
  .type('#usernameInput', arguments.user)
  .type('#passwordInput', arguments.pass)
  .click('#submit')
  .wait(10000)
  .evaluate(selector => {
     return {
    html: document.querySelector(selector).innerHTML,
    title: document.title
  }
  }, selector)
  .end()
  .then(obj => {
    console.log('Processed ' + obj.title);
    fs.writeFileSync('./downloads/' + obj.title + '.html', obj.html);
  })
  .catch(error => { // catch any errors
    console.error('Failed to obtain content from ' + arguments.url);
  });
...

  • The goto method allows nightmare to load up the desire document
  • click method clicks on an element. In this case we're going to clicked on a button with the id of login and (eventually) the user login button.
  • The wait method simply pauses the routine x number of milliseconds. This is often needed to wait for the server to respond to previous fired events.
  • The type method allows for text to be entered into fields. In this case, we are submitting our user id and password into the document elements with the ids of usernameInput and submitButton.
  • The evaluate method tells nightmare to look in the document for a element with the provided CSS selector. From there, we want to return to items to the script: the desired content and the title of the document as we'll use it for the name of the file we'll write out later.
  • The end method closes the Electron browser
  • After the script has retrieved the desired content, it's now time to do some processing on the returned object using the then method. In this method, we let the user know the name of the file the script is writing out and then write out the file with the desired content. Note, in this step, one can "massage" the content to fix their needs using cheerio.js or any other preferred method.
  • Finally, the catch method is used to catch any errors. Here, the script is using it generically to inform the user that the gathering process failed.

Friday, July 14, 2017

Inserting A Date Stamp into HTML Documents

Introduction

In Using Node.js for Text Processing, I showed you how a simple method to use Node.js processing content of HTML files. In this tutorial, I'll show you another technique for adding content to HTML files using Node.js as a command line interface (CLI) by supplying a script with arguments (flags), set up a basic help message, and append a date stamp to an user supplied HTML file. If this is your tune, let's jam!

Requirements

You should be fairly comfortable with JavaScript in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory: 
Note: This tutorial was written with Node.js (version  ~0.12.7).

Building the script

First off, let's create a new file called append_date.js. As with any typical Node.js script, we start off with a few variables requiring our packages:

var fs = require('fs');
var cheerio = require('cheerio');
var program = require('commander');
...


In this snippet, we use fs to read in the target file, cheerio to manipulate the content with jQuery-like features, and commander for the CLI features of our script. Commander is a wonderful library that I recently discovered and works very well with my documentation and web developing needs.

Setting up the CLI

Like any well written CLI, we should start off with defining how it works. In the following snippet, we define the version of the CLI, how to use it (.usage), and any options (.option) available to the user. Commander is kind enough to provide a basic help output if the user types in <command> -h/--help as well as any usage and options you provide. Commander will display the default help message plus any info defined in the snippet below:

...
program
  .version('0.0.3')
  .usage('[required options] -t <path/file.html> -d <date>')
  .option('-t, --target [target]', '*required* target of HTML file (including the path and file name)')
  .option('-d, --date [date]', '*required* date to append to the HTML file')
  .parse(process.argv);
...

Main processing

Next, we check if both the target file and date flags have been supplied. From there, the script will load the target file and convert it to a string. The script will next check to see if a container element (div#content) exist and decide to create or update it. If the div#content element exists, the script will then look to either create or update the span.date element. With the elements in place, the script will now write out the updated HTML file.

...
if (program.target && program.date) {
  console.log('Updating ' + program.target);
  $ = cheerio.load(fs.readFileSync(program.target).toString()); // load file and convert it to a string

  if ($('div#content').length <= 0) { // if div#content doesn't exist, create it
    console.log(program.target + ' does not have div#content element. Adding it and datestamp now.');
    $('body').append('\t<div id="content">\n\t\t\t<span class="date">Last updated: ' + program.date + '</span>\n\t\t</div>\n\t');
  } else { // if div#content exists, look for span.date
    console.log(program.target + ' has div#content element; looking for date stamp')
    if ($('span.date')) { // if the span.date exists, update it
      console.log(program.target + ' has span.date element; updating it now');
      $('span.date').text('Last updated: ' + program.date);
    } else { // span.date doesn't exist; create it
      console.log(program.target + ' dos not have span.date element; adding it now');
      $('div#content').append('\t<span class="date">Last updated: ' + program.date + '</span>\n\t\t'); // add date under the content div
    }
  }

  var updated = $.html(); // collect updated html content
  fs.writeFileSync(program.target,updated); // save out updated HTML file
...



Fail checks

Like any good CLI, it should fail gracefully by informing the user that something was missing or failed in some way. To accomplish that, our script should check if the target and date flags were not filled out as they will be required for the script to carry out it's purpose. In the following code snippet, the first else if statement confirms if the target file wasn't supplied by the user. The second else if statement confirms if the user supplied a date flag. The final else statement is a catch all if something goes skips a beat.

...
} else if (!program.target) {
  console.log('Target for the HTML file was not provided; quitting.');
  process.exit(1);
} else if (!program.date) {
  console.log('No date stamp was provided; quitting');
  process.exit(1);
} else {
  console.log('undefined error; quitting');
  process.exit(1);
}

This completes the script. At this point, we should save the file and call it appendDate.js.

Note: this script doesn't verify the date string you enter into the span.date element. You could add a check for the program.date content to conform to a specific format using Regex (like /(\d[0-9]){2}-(\d[0-9]{1})-(\d[0-9]{1})/g) but that is entirely up to you.

CLI in action

With script completed, lets test it out on a simple HTML file. Copy the following code and save the file name as test.html in the same directory as your node script:

<!DOCTYPE html>
<html>
<head>
  <title>Date Stamp Test</title>
</head>
<body>
</body>
</html>


Let's test our script by executing the following command: node appendDate.js -t test.html -d 2017-03-30 (or whatever date string you wish to use). Examine the test.html. You should see our date stamp appended to the document with our custom date stamp.


Review

In ~40 lines of code, we created a script that takes arguments from the command line for a target file and a date string, loaded the target file, modified the target file with the date string, and wrote out the target file with the update HTML.

I hope this tutorial was useful and thank you for reading it.