Thursday, December 14, 2017

Quick tip: Unwrap for Cheerio

Overview

I recently completed a request to parse, cleanup, and remove select elements from hundreds of HTML documents. Not a major task considering I have a node.js script in my toolbox that can handle this. At least, that's what I thought I had. When it came to removing select elements and leaving it's children elements and content intact, I found that there was a series of span elements wrapping around other elements. Apparently, the source documentation for these HTML files uses spans for all sorts of formatting features. Outside it's original source, these span elements serve absolutely no purpose. So, I was tasked with with removing them.

Like before, I figured I could use the cheerio module to use jQuery-like features (in particular unwrap to remove this unwanted element) and be done with it. Unfortunately, cheerio version 1.0.0 doesn't have an unwrap method.

After some time researching this problem, I found the contents method in cheerio and figured I can use this method to accomplish my task.

The code

Here is the entire code followed by the breakdown.

var fs = require('fs');
var cheerio = require('cheerio');
var shell = require('shelljs');

shell.ls('documents/*.html').map(function(file) {
  console.log('Unwrapping elements from ' + file);
  $ = cheerio.load(fs.readFileSync(file).toString());
  $('span[class^="unnecessary"]').each(function(i,elem) {
    var contents = $(this).contents();
    $(this).replaceWith(contents);
  });
  fs.writeFileSync(file,$.html());
});


Like most node.js scripts, we start off with requiring a few modules. Next we use shell's ls method to find all HTML documents in the documents directory and use the map method on this returned array to parse each found document.

Like before, we use cheerio to load each HTML document into the script with jQuery-like features.
Now, here is the tricky part: Find the element you want to remove, grab the content of said element and replace the selected element with it's contents. Essentially what this part of the script does is loads the selected element into an variable called contents and replaces the current element with the items of the content variable.

From there, we write out the modified document and carry on with the next one in the returned array.

No comments:

Post a Comment