Thursday, March 14, 2019

Using Cheerio and Request to Scrape

Introduction

I've been heavily involved in content migration in the last few months. As a result, I've had look for solutions in pulling content from one site and push it into another. Often times, the source site wouldn't have an API to make my life easier. Enter cheerio and request npm modules. This tutorial will walk you through a basic routine of requesting a document and pulling content from a select set of elements.

Requirements

You should be fairly comfortable with JavaScript and CSS selectors in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory:
Note: This tutorial was written with Node.js (version 10.11.0).

Setting up requirements

As mentioned earlier, this script will use cheerio to parse content with jQuery-like features and request to fetch content from a document. Next, we need to accept two arguments when executing this script: 1- A source document and 2- a selector to specify which element to pull content from.

const cheerio = require('cheerio');
const request = require('request');
const url = process.argv[2];
const selector = process.argv[3];
....

Input error handling

If the user doesn't supply an URL and a selector, the script should fail right away instead of attempting to extract something.

....
if (!url || !selector) {
  console.log('You need to supply both an URL and a selector.');
  process.exit(1);
} else {
  <main routine>
}

Requesting and processing the body

The main routine of this script is to request a document and process it using cheerio so we get at select parts of the content. If there isn't any issue in requesting the document and the status is good, then we pass the body of the document to cheerio. From there, you can add whatever features you like to process the content.

request(url, (err, resp, body) => {
  if (!err && resp.statusCode == 200) {
    $ = cheerio.load(body.toString());
    $(selector).each(function() {
      // do something with the content
      console.log($(this).html());
    });
  } else if (err) {
    console.log(err);
  }
});

Usage

With the script complete, we should complete the following steps to use it to pull content from the web.
  1. Save this file as request.js.
  2. Open a terminal in the same directory as request.js.
  3. Execute node request <URL> <selector> replacing the URL with the web document you'd like to pull content from and replace selector with the element id or class you wish want to pull content from. For example, try this one: node request.js https://crudthedocs.blogspot.com/2019/01/scraping-web-document-using-nightmarejs.html '.post-title.entry-title'
  4. Observe the output in the terminal.