Wednesday, January 16, 2019

Scraping a Web Document Using Nightmare.js

Introduction

I recently learned about another method to harvest content from a website using nightmare.js. Using other libraries such as request.js (with cheerio.js) works fine but if one needs to get around a login or has a need to navigate to get at the content, these libraries won't work. Enter nightmare.js and Electron. This document walks one through a basic setup of using nightmare.js to navigate to a site, login, and grab content from a specific element.

Requirements

You should be fairly comfortable with JavaScript and CSS in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory: 
Note: This tutorial was written with Node.js (version 10.11.0).

Scraping Content with Nightmare.js

Like any node.js app, let's start off with setting up the basics requiring various modules. In this case, we are using nightmare.js to navigate a site, fs to write out the content to disk, and commander to set up flags for the script's arguments.

const Nightmare = require("nightmare");
const fs = require('fs');
const program = require('commander');
const nightmare = Nightmare({ show: true });
const selector = '.content';
...


Note: if you don't want to see Electron "jumping through all the hoops" to get at the content, you can set show to false. I think using commander library makes using this script easier to use as the input isn't order dependent. Finally, the selector variable is where the target content is located. In this case, the variable will be looking for an element with the content class. This variable can use any CSS selector method that you would like to use to get at the desired content.

CLI setup

Next, we'll set up the flags for the script. In this case, we should only accept three required flags: user (id), (user) password, and the URL of the target document.

...
program
  .version('0.1.0')

  .usage('[required options] -u <username> -p <password> -url <url>')
  .option('-u, --user', 'Username id')
  .option('-p, --password', 'User\'s password')
  .option('-url, --url', 'URL for site')
  .parse(process.argv);
...


Setting the user credentials and URL

Now we should set up the values passed into the flags as variables to by used by the script. The arguments object will contain the user, password, and URL values. I mentioned how to set up a Node.js CLI earlier.

...
var arguments = {};

for (var i = 0; i < program.rawArgs.length; i++) {
  if (program.rawArgs[i] == '--user' || program.rawArgs[i] == '-u') {
  
  arguments.user = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--password' || program.rawArgs[i] == '-p') {
  
  arguments.pass = program.rawArgs[i + 1];
  }
  if (program.rawArgs[i] == '--url' || program.rawArgs[i] == '-url') {
  
  arguments.url = program.rawArgs[i + 1];
  }
}
...


Note: commander doesn't not process some special characters (e.g. ', ", !, $, and so on) for a variety of reasons. We won't get into that here today. So, if your password uses any of these special characters, it may not pass the string properly to the target server.

Exiting if required parameters are missing

Next, we'll set up the flags for the script. In this case, we should only accept three flags: user (id), (user) password, and the URL of the target document.

...
if (arguments.user && arguments.pass && arguments.url) {
  ...
  <main routine>
  ...
} else {
  if (!arguments.user || !arguments.pass || !arguments.url) {
    if (!arguments.user) {
      console.log('Username is required.');
    }
    if (!arguments.pass) {
      console.log('Password is required.')
    }
    if (!arguments.url) {
      console.log('URL is required.')
    }
  }
  process.exit(1);
}

Main routine

And now for the main (routine) attraction!
We'll use nightmare to navigate Electron to our desired document, clicked the login button, wait a bit (hopefully long enough for the server to respond), enter our credentials, submit said credentials, wait again, grab the content, write out the content, and announce any errors.

...
nightmare
  .goto(arguments.url)
  .click('#login')
  .wait(5000)
  .type('#usernameInput', arguments.user)
  .type('#passwordInput', arguments.pass)
  .click('#submit')
  .wait(10000)
  .evaluate(selector => {
     return {
    html: document.querySelector(selector).innerHTML,
    title: document.title
  }
  }, selector)
  .end()
  .then(obj => {
    console.log('Processed ' + obj.title);
    fs.writeFileSync('./downloads/' + obj.title + '.html', obj.html);
  })
  .catch(error => { // catch any errors
    console.error('Failed to obtain content from ' + arguments.url);
  });
...

  • The goto method allows nightmare to load up the desire document
  • click method clicks on an element. In this case we're going to clicked on a button with the id of login and (eventually) the user login button.
  • The wait method simply pauses the routine x number of milliseconds. This is often needed to wait for the server to respond to previous fired events.
  • The type method allows for text to be entered into fields. In this case, we are submitting our user id and password into the document elements with the ids of usernameInput and submitButton.
  • The evaluate method tells nightmare to look in the document for a element with the provided CSS selector. From there, we want to return to items to the script: the desired content and the title of the document as we'll use it for the name of the file we'll write out later.
  • The end method closes the Electron browser
  • After the script has retrieved the desired content, it's now time to do some processing on the returned object using the then method. In this method, we let the user know the name of the file the script is writing out and then write out the file with the desired content. Note, in this step, one can "massage" the content to fix their needs using cheerio.js or any other preferred method.
  • Finally, the catch method is used to catch any errors. Here, the script is using it generically to inform the user that the gathering process failed.