Introduction
I've been heavily involved in content migration in the last few months. As a result, I've had look for solutions in pulling content from one site and push it into another. Often times, the source site wouldn't have an API to make my life easier. Enter cheerio and request npm modules. This tutorial will walk you through a basic routine of requesting a document and pulling content from a select set of elements.
Requirements
You should be fairly comfortable with JavaScript and CSS selectors in general and have some working knowledge of how Node.js works prior to digging into this tutorial.
Required npm packages
In this tutorial, we'll need to ensure the following packages have been install in your project directory:
Note: This tutorial was written with Node.js (version 10.11.0).
Setting up requirements
As mentioned earlier, this script will use cheerio to parse content with jQuery-like features and request to fetch content from a document. Next, we need to accept two arguments when executing this script: 1- A source document and 2- a selector to specify which element to pull content from.
const cheerio = require('cheerio');
const request = require('request');
const url = process.argv[2];
const selector = process.argv[3];
....
Input error handling
If the user doesn't supply an URL and a selector, the script should fail right away instead of attempting to extract something.
....
if (!url || !selector) {
console.log('You need to supply both an URL and a selector.');
process.exit(1);
} else {
<main routine>
}
Requesting and processing the body
The main routine of this script is to request a document and process it using cheerio so we get at select parts of the content. If there isn't any issue in requesting the document and the status is good, then we pass the body of the document to cheerio. From there, you can add whatever features you like to process the content.
request(url, (err, resp, body) => {
if (!err && resp.statusCode == 200) {
$ = cheerio.load(body.toString());
$(selector).each(function() {
// do something with the content
console.log($(this).html());
});
} else if (err) {
console.log(err);
}
});
Usage
With the script complete, we should complete the following steps to use it to pull content from the web.
- Save this file as
request.js
. - Open a terminal in the same directory as
request.js
. - Execute
node request <URL> <selector>
replacing the URL with the web document you'd like to pull content from and replace selector with the element id or class you wish want to pull content from. For example, try this one:node request.js https://crudthedocs.blogspot.com/2019/01/scraping-web-document-using-nightmarejs.html '.post-title.entry-title'
- Observe the output in the terminal.