Wednesday, December 14, 2016

Using Node.js for Text Processing

Intro

As a tech writer who is responsible for writing and publishing documentation in various formats, I've found a need to combine my hobby of toying around with JavaScript and document publication. In particular, I'm tasked with pulling information from an Atlassian's Confluence site down into a static HTML file set. However, the method I use (Export EclipseHelp with a custom template) doesn't reliably generate clean or consistent HTML documents. While the original intent of this tutorial was to update content extracted from Confluence, it can work on any HTML file.

I figure there are better ways of doing what I'm about to demonstrate, but my needs are rather particular (as in, this script needs to function as part of a bigger puzzle I employ for publication). If you have suggestions to improve it, I'd love to hear it.

This document doesn't cover how to export HTML from Confluence. What will be covered is a script I came up with that will complete a find and replace function on all HTML files in a particular directory.

Node.js requirements

This little script needs only three modules to read, write, gather a file list, and use jQuery-like features:

var fs = require('fs');
var cheerio = require('cheerio');
var shell = require('shelljs');

Documents, meet array

Using a shell module, I gather all the HTML files in a particular directory:

var fileNames = shell.ls('documents/*.html');

String it up

Read each document as a string if the document has the extension of .html:

for (i in fileNames) {
  if (fileNames[i].indexOf(".html") > -1) {
    $ = cheerio.load(fs.readFileSync(fileNames[i]).toString());
    ...
  }
}


While it may seem a bit redundant to look through the array matching the HTML file type, the array returned in fileNames can end with an empty element in the array and cause our script to throw an error at the end.

Here we use the cheerio module to add jQuery-like features to our script so we can do things like select elements and modify them in a number of ways.

Process the string

Check if an element with the class of footer. If it exists, remove it.

if ($('div.footer').length > 0) {
  console.log("Removing footer from ../" + fileNames[i]);
  $('div.footer').remove();
} else {
  console.log(fileNames[i] + " has no div.footer element.");
}


At this step in the script, we can have the actively selected HTML document be processed in a multitude of ways (e.g updating elements in the header, injecting Bootstrap grid system, swapping image locations, adding date stamps, and so on).

Update and save

Update the string (document) and save it out.

var removed = $.html();
fs.writeFileSync(fileNames[i],removed);

Full code:

var fs = require('fs');
var cheerio = require('cheerio');
var shell = require('shelljs');
var fileNames = shell.ls('documents/*.html');

for (i in fileNames) {
  if (fileNames[i].indexOf(".html") > -1) {
    $ = cheerio.load(fs.readFileSync(fileNames[i]).toString());
    if ($('div.footer').length > 0) {
      console.log("Removing footer from ../" + fileNames[i]);
      $('div.footer').remove();
    } else {
     console.log(fileNames[i] + " has no div.footer element.");
    }

    var removed = $.html();
    fs.writeFileSync(fileNames[i],removed); // save out HTML file
  }
}