CRUD the Docs: processing

Monday, August 14, 2017

Update: Using Node.js for Text Processing

Last month, I gave a lightning talk on "Using Node.js for Text Processing" at the Monthly Front End PDX Meetup and I'd like to share my slides and updated code sample in this month's post.

For the most part, my presentation didn't change that much but what did change is some of the methods I recently started using which made my code more efficient.

Requirements

The required modules section of the script stays the same:

var fs = require('fs');
var cheerio = require('cheerio');
var shell = require('shelljs');
...

Looping through the documents

For a little more efficiency, I didn't declare a variable to hold the list of HTML documents and instead, piped the shell method of ls directly into a map function which allows the script to loop through each item it finds in the /documents directory that matches the HTML file format.:

...
shell.ls('documents/*.html').map(function(file) {

...
}

Convert documents to a string

Then, load the document to a string with jQuery-like features (thanks to the cheerio module):

...
$ = cheerio.load(fs.readFileSync(file).toString());
...

Process content

Finally, we do an if/then statement to find what we are looking for, remove it, and save out the file:

...
if ($('div.footer').length > 0) {
  $('div.footer').remove();
  fs.writeFileSync(file,$.html());
}
...

The complete script

Here is the complete revised script:

var fs = require('fs');
var cheerio = require('cheerio');
var shell = require('shelljs');

shell.ls('documents/*.html').map(function(file) {
  $ = cheerio.load(fs.readFileSync(file).toString());
  if ($('div.footer').length > 0) {
    $('div.footer').remove();
    fs.writeFileSync(file,$.html());
  }
}

Slide deck

Here are the slides I presented: https://docs.google.com/presentation/d/1R0GALRoOzNgTz0gcHIzpf0YQVhM6pLhA22ybM1x6YiQ/edit?usp=sharing

Wednesday, December 14, 2016

Using Node.js for Text Processing

Intro

As a tech writer who is responsible for writing and publishing documentation in various formats, I've found a need to combine my hobby of toying around with JavaScript and document publication. In particular, I'm tasked with pulling information from an Atlassian's Confluence site down into a static HTML file set. However, the method I use (Export EclipseHelp with a custom template) doesn't reliably generate clean or consistent HTML documents. While the original intent of this tutorial was to update content extracted from Confluence, it can work on any HTML file.

I figure there are better ways of doing what I'm about to demonstrate, but my needs are rather particular (as in, this script needs to function as part of a bigger puzzle I employ for publication). If you have suggestions to improve it, I'd love to hear it.

This document doesn't cover how to export HTML from Confluence. What will be covered is a script I came up with that will complete a find and replace function on all HTML files in a particular directory.

Node.js requirements

This little script needs only three modules to read, write, gather a file list, and use jQuery-like features:


var fs = require('fs');

var cheerio = require('cheerio');

var shell = require('shelljs');

Documents, meet array

Using a shell module, I gather all the HTML files in a particular directory:


var fileNames = shell.ls('documents/*.html');

String it up

Read each document as a string if the document has the extension of .html:


for (i in fileNames) {

  if (fileNames[i].indexOf(".html") > -1) {

    $ = cheerio.load(fs.readFileSync(fileNames[i]).toString());

    ...

  }

}

While it may seem a bit redundant to look through the array matching the HTML file type, the array returned in fileNames can end with an empty element in the array and cause our script to throw an error at the end.

Here we use the cheerio module to add jQuery-like features to our script so we can do things like select elements and modify them in a number of ways.

Process the string

Check if an element with the class of footer. If it exists, remove it.


if ($('div.footer').length > 0) {

  console.log("Removing footer from ../" + fileNames[i]);

  $('div.footer').remove();

} else {

  console.log(fileNames[i] + " has no div.footer element.");

}

At this step in the script, we can have the actively selected HTML document be processed in a multitude of ways (e.g updating elements in the header, injecting Bootstrap grid system, swapping image locations, adding date stamps, and so on).

Update and save

Update the string (document) and save it out.


var removed = $.html();

fs.writeFileSync(fileNames[i],removed);

Full code:


var fs = require('fs');

var cheerio = require('cheerio');

var shell = require('shelljs');

var fileNames = shell.ls('documents/*.html');


for (i in fileNames) {

  if (fileNames[i].indexOf(".html") > -1) {

    $ = cheerio.load(fs.readFileSync(fileNames[i]).toString());

    if ($('div.footer').length > 0) {

      console.log("Removing footer from ../" + fileNames[i]);

      $('div.footer').remove();

    } else {

     console.log(fileNames[i] + " has no div.footer element.");

    }



    var removed = $.html();

    fs.writeFileSync(fileNames[i],removed); // save out HTML file

  }

}