CRUD the Docs: 2019

Thursday, November 14, 2019

Salary Calculator and Negotiating

If you're like me, you often get frustrated and/or confused about how much you should be getting paid as a technical writer. The Salary Calculator by Robert Half is a great tool that takes a lot of the guess work out of salary requirements. For example, I plugged in the following:

Area of Specialization: "Technology & IT"
Job Category: "Software & Application Development"
Job Title: "Technical Writer"
State: "Oregon" (as I happen to live here currently)
City: "Portland"

At the writing of this article, the current salary range was $63.5k - $107.7k with the median being $76k. There should be a few things to keep in mind here with this salary range:

Bonuses and benefits are not included
You should add another 5-10% to your salary if you possess certain skills and/or certifications

Now that I know how much I'm worth and the fact that I feel very confident with my skills and work experience, I can confidently ask for the aforementioned salary range when applying for a new job.

If you are already employed, I found the article by Robert Half called How to Negotiate Salary After You Get a Job Offer rather useful for giving tips and advice on how to ask for more pay on the job.

Monday, October 14, 2019

Customizing Confluence: Last Modified Date

Introduction

At times of exporting content from Confluence, you may find yourself in a situation where a date stamp is required in the output files. Depending on your process for exporting content, by default, Confluence doesn't export time data (created or last updated) through its native means. If you have the budget, there are a number of plugins that can provide you with a macro to use in your documents but if you find your budget is tight, you can use the following code to create your own user macro to obtain a document's last updated date.

Required skills

You should be comfortable and/or knowledgable with the following:

HTML
jQuery
Creating user macros in Confluence
Modifying page layouts and space layouts in Confluence

Creating an user macro to display the last modified date

To create an user macro that display the last modified date, you can copy/paste the following code into your instance of Confluence:


Macro name: last-modified

Macro title: Last Modified

Description: Displays last modified date of the document.

Categories: Reporting

Macro Body Processing: No macro body

## @noparams

<span id="lastModifiedDate" style="font-size: 0.7em">Last Modified: $action.dateFormatter.formatDateTime($content.lastModificationDate)</span>

This simple little macro grabs the last modified date from Confluence and will display the value wherever you insert the user macro on your document. The font size is made to be purposely smaller so it won't be to noticeable on your document.

Modify layouts

If your process of exporting content includes the rendered HTML document, then instead of inserting a macro on every document can be replaced by updating the space layout with a little jQuery function to obtain the last updated date and insert it into the body of the wiki content element.

Note: Modifying the Content Layout of a space can be dangerous. If you remove a line from the Page Layout, you could corrupt the space and possibly make Confluence unstable. Use this method with care.

If you are using the Default Theme, insert the following code in Page Layout (Space Tools > Look and Feel > Layout > Create Custom under the Content Layouts in the Page Layout section) after the page setters:


<!-- last updated insertion -->

<script>

  $(document).ready(function() {

    $('div#main-content').append('<span style="font-size: 0.7em;">Last updated: ' + $('a.last-modified').text() + '</span>');

  });

</script>

<!-- end last updated insertion -->

If you are using the Documentation Theme, insert the following code in the footer of the page layout:


<!-- last updated insertion -->

 $(document).ready(function() {

   $('div.wiki-content').append('<span style="font-size: 0.7em;">Last updated: ' + $('a.last-modified').text() + '</span>')

 });

<!-- end last updated insertion -->

This method of inserting a last modified timestamp is, IMHO, the easiest and safest way to add a timestamp to every document in a space.

Friday, September 27, 2019

Managing Writers: Interview with Richard Hamilton (podcast)

I recently had a chance to listen to Tom Johnson's podcast entitled Managing Writers: Interview with Richard Hamilton and I found it to be very insightful. I totally agree that documentation metrics are difficult to nail down and pageviews aren't always the best metric (though a decent one). I personally haven't found a good metric of productivity for tech writers. (If you have one, I'd love to hear about it.)

Documentation managers (writers or otherwise) are best served by simply staying aware of what their tech writers are doing and how heavily loaded they are on a regular basis. Having regular check-ins and one-on-ones is the best way to tell if a writer is overloaded or not.

Saturday, September 14, 2019

Migrating Content From One Confluence Instance to Another

Introduction

From time to time, as a Confluence administrator, you'll be called upon to migrate a space to a new Confluence instance. This guide provides some tried and true steps to generate a list of spaces in Confluence, figure out how active a space is (so you can decide if it should be archived, migrated, or simply removed), how to find a space owner, add a warning message to a space, set a space to "read-only" mode, and delete a space.

Required skills

You should be comfortable and/or knowledgable with the following:

HTML
Managing Confluence space themes
Basic Confluence administration
Exporting and importing Confluence spaces

Migrating Confluence content

Generate a list of all the spaces in your old Confluence (instance). This list will be used as a checklist for tracking all the spaces that have been migrated to the new instance.
1. To get a list of spaces, go to Spaces > Space Directory. This page will show you a list of all the spaces in your instance.
Start reviewing the spaces for level of activity. I believe it is safe to say that if a space hasn't had any visitors in 6+ months, then it should be marked as archived and prioritized for either removal or migration.
1. To see the level of activity, go into the space and then Space Tools > Activity. In the Activity page, set the Period for months and review the previous six months of activity by clicking on the previous button for the month section.
2. To archive a space, go to the space in Confluence and then Space Tools > Overview. In the Overview page, click on Edit space details button. From there, change the Status to Archived and click on the Save button.
Identify and speak with space owner(s) and get their permission to archive and/or migrate a space. Also ask them if the space should have a higher or lower priority for migration.
1. To see who the space owner is, open the space and go to Space Tools > Permissions. Under Individual Users, you should typically see someone who has all permissions to the space. Another way to see who created the space is to go to the space's home page. This home page will have information about who created this page (which should be the person who created the space (i.e. the owner of the space) unless of course the Confluence administrator is responsible with the task of creating spaces.
Work with space owner(s) on active spaces so you don't interrupt their work and set a date for migration.
Add a migration warning to the space that it will be migrated on a designated date.
1. To add a warning label to a space and the target space is using the Documentation Theme, follow these steps:
  1. Navigate to the space's Themes page (space > Space Admin > Themes).
  2. In the Messages section under Header, add the following code:
    {html} <div style="background-color: red; color: white; padding: 5px;">This space will be migrated on <span style="color: yellow"><designated_migration_date></span></div> {html}
  3. This will add a message on top of every page in the space and the unstylish colors will definitely grab the attention of everyone viewing the page.
  4. Note: it's a good idea to give everyone at least a two weeks notice about the migration.
For unvisited spaces or spaces about to be migrated, put the space into "read-only" mode.
1. To make a space "read-only", go to Space Tools > Permissions.
2. In the Permissions page, you should see a list of space admins and groups. It would be best to leave the permission scheme alone for the space admins but if you have a group where all users fall under, change it so it is only set to All View to checked and everything else is unchecked.
Export spaces on designated dates. Start with higher priority spaces and work your way down to the low priority spaces. See Export and Import a Confluence Space for instructions on how to export a space.
Import the exported space into the new instance. See Export and Import a Confluence Space for details on how to import a space.
Go into the newly imported space and confirm that everything is in proper order (content and attachments are fine, macros are working as expected, permissions are good, and so on). Note: this step may take a bit of time. I recommend that you get help from the original space owner(s) to confirm that the migration went well.
Change migration notice to migrated and marked for removal. Update warning of the exported space in the old instance that it has been migrated (with a link to the new space) and add a removal date.
1. To add a removal warning, repeat the sub-steps listed in step 5.
  1. Navigate to the space's Themes page (space > Space Admin > Themes).
  2. In the Messages section under Header, add the following code:
    {html} <div style="background-color: red; color: white; padding: 5px;">This space has been migrated to <a style="color: white; text-decoration: underline;" href="url_of_new_confluence_instance/display/<spacekey>">url_of_new_confluence_instance/display/<spacekey></a> and will be removed from this wiki on <designated_removal_date></div> {html}
  3. This will add a message on top of every page in the space and the unstylish colors will definitely grab the attention of everyone viewing the page.
  4. Note: it's a good idea to give everyone at least a two weeks notice about the removal.
On the designated removal date, delete the space old space in the old instance of Confluence. While this is a very dangerous step, keep in mind that you have the exported zip file and the newly created space in the new instance. If anything goes wrong, you can always re-import the space back into the original Confluence instance.

To delete a space, go to Space Tools > Overview. Click the Delete Space button. You may be prompted to enter your credentials so be ready to enter that information. Click the Ok button to start the deletion process. Depending on how big the space is, this could take a few seconds or several minutes. Check with the Time Remaining counter (I find it's mostly accurate 90% of the time but that will vary from server to server based on your server's configuration).

Tip: You may want to include a message at the top of the newly imported space in the new instance where one can find the old space in the old instance. Just use the sub-steps mentioned in either step 5 or 10 and add the necessary info to point back to the old space.

Thursday, August 29, 2019

Should it be Capitalized

Every once in a while I come across fun little tidbits of knowledge. Today, I found this knowledge-nugget which guides one in how to capitalization in a title or headline.

Tuesday, August 27, 2019

Creating Authentic Human Connections Within A Remote Team

I recently read Creating Authentic Human Connections Within A Remote Team posted by Smashing Magazine and I really connected with this article. I have been working as a remote tech writer for three years now and I can say that this experience and what Randy Tolentino wrote is very true. I especially that "Reading emotions across the distance" section was point on. However, I don't agree that using emojis is necessarily a good solution. I think the use of emojis greatly depends on the personality of the person on the other side of the screen. Personally, if I'm having a back and forth with someone on an IM, I just ask if I can video conference with them for 5-10 minutes. That face to face time is much better at connecting to that other person and reinforce that we are humans and not just resources (as Randy mentions in this article).

Wednesday, August 14, 2019

Modifying an Exported Space From Confluence

Introduction

During times of Confluence migrations (from one instance to another), you may find yourself in a situation where the new Confluence instance has a space that has the same spacekey as an old space that you are attempting to import. This month's post will show you how to get around that issue.

Required skills

This document assumes you know how to export a Confluence space already and are comfortable with editing XML files.

Exporting and editing

To start, you will need to expand the exported zip file (that comes from exporting a space). In this expanded directory, you will need to edit the exportDescriptor.properties file. In this file, modify the spaceKey property to the desired spacekey and save your change.

In entities.xml, search and replace the following items listed in this table (replacing NEWKEY with an unused and new spacekey you wish to use in the new Confluence instance):

Search for	Replace with
`[CDATA[OLDKEY]`	`[CDATA[NEWKEY]`
`OLDKEY`	`NEWKEY`
`spaceKey=OLDKEY`	`spaceKey=NEWKEY`
`[OLDKEY:`	`[NEWKEY:`
`key=OLDKEY]`	`key=NEWKEY]`
`<spaceKey>OLDKEY</spaceKey>`	`<spaceKey>NEWKEY</spaceKey>`
`ri:space-key="OLDKEY"`	`ri:space-key="NEWKEY"`
`ri:space-key=OLDKEY`	`ri:space-key=NEWKEY`
`<ac:parameter ac:name="spaces">OLDKEY</ac:parameter>`	`<ac:parameter ac:name="spaces">NEWKEY</ac:parameter>`
`<ac:parameter ac:name="spaceKey">OLDKEY</ac:parameter>`	`<ac:parameter ac:name="spaceKey">NEWKEY</ac:parameter>`
`<property name="lowerDestinationSpaceKey"><![CDATA[NEWKEY]]></property>`	`<property name="lowerDestinationSpaceKey"><![CDATA[newkey]]></property>`
`<property name="lowerKey">![CDATA[NEWKEY]]></property>`	`<property name="lowerKey"><![CDATA[newkey]]></property>`
`spaceKey=OLDKEY`	`spaceKey=NEWKEY`
`spacekey=oldkey`	`spacekey=newkey`

With those two files updated, re-zip all content back together, rename it to the original zip file (you may need to remove the old zip file or just rename it just in case), and upload it to your new Confluence instance as you would normally to import a space.

Additional details can be found here: https://confluence.atlassian.com/confkb/how-to-copy-or-rename-a-space-in-confluence-169578.html

Sunday, July 14, 2019

Generate a Path/File Report of HTML Documents

Introduction

While there a bazillions of ways to generate a text file with all the directories and the respective files (Bash comes to mind), I wanted to explore Shell.js for doing this task so it could be chained together in a bigger set of Node.js tools I've been cobbling together recently. I also wanted to make it simpler than recent scripts I've written by foregoing Commander.js and just have it accept one argument which would be the directory to generate the report from.

In theory, I should be able to execute this command and get a text file back with all the files found there and all the nested files and directories: node app.js <directory>

Required skills and npm packages

You should be fairly comfortable with JavaScript and have some exposure to shell.js (0.8.3).

Required modules and variable setup

We'll need to require two modules (shell and fs), grab the user supplied directory (path), and set up a variable to hold the list of items found therein (output).


const shell = require('shelljs');

const fs = require('fs');

const path = process.argv[2];

var output = '';

...

Generate the report

If the path is supplied, then we should inform the user that the script in generating the report, recursively gather all the contents of the target directories, and save out the data to a text file called directorySiteMap.txt.


...

if (path) {

  console.log('Generating directorySiteMap.txt');

  shell.ls('-LR', path).map(function(file) {

    output += file + '\n';

  });



  fs.writeFileSync(directorySiteMap.txt, output);

...

Quit if path isn't supplied

If the path isn't supplied, the script should state as such to the user and gracefully quit.


...

} else {

  console.log('Directory argument is required. Quitting.');



  process.exit(1);

}

Wrapping up

Now we should save this script as directorySiteMap.js and execute it using this command: node directorySiteMap.js <directory>

Friday, June 14, 2019

Write The Docs 2019

The following is a summary of some of the presentations I attended and enjoyed.

Draw the Docs presented by Alicja Raszkowska was interesting as she advocated for using more graphics (particularly cartoons) in technical documentation. While I enjoy the notion of this, one must know their audience before they can start adding cartoons to illustrate their product and/or points. She is also developing a tool called mermaid that creates visual content similar to Visio but with custom images and markdown input.

Sarah Moir's presentation called "Just Add Data: Make it easier to prioritize your documentation" makes a good case for using analytics and other feedback to sort out prioritization of which documents should get the tech writer's attention.

Matt Reiner gave a very energetic presentation called "Show Me the Money: How to Get Your Docs the Love and Support They Deserve" which outlines how to make a business case for getting more resources for documentation. In Matt's presentation, he provides a good and detailed method for creating a business case and how to pitch it to management. I believe this is a good resource for all tech writers!

"How to edit other people's content without pissing them off" by Ingrid Towey was an interesting presentation on editing other people's content. The four principles are as follows: Assure that the content originator that we are all on the same side, when editing content, it's an edit and not an edict, explain why you're editing their content (preferable before you do it), and get help when thinks don't go smoothly. Good idea if one isn't already applying this.

Kathleen Juell's "Writer? Editor? Teacher?" presentation basically provided parallels to how tech writers can leverage teaching philosophy (particular college level) to technical writing. The topics she covered was basic documentation layout, design, and goals, providing templates, peer editing/reviews, and writing like as a teacher or an editor (clarify, explain, and goals). As a former college teacher myself, I see the lines between a teacher and tech write to be very blurry.

Shannon Crabill provided some thoughts and guidelines for how to manage documentation for an open source project in her talk called "Documenting for Open Source". Some tips include avoid assuming the technical knowledge of your readers (one should include a requirements section in your guides as to not lead on the readers who may get frustrated layer in the document when they discover they cannot complete it), README files are required, how to get users started, provide yourself or your team with templates (to avoid issues like duplicate PRs), and always provide links to any and all resources.

Heather Stenson provided some thoughts on how to get non-writers to contribute to documentation in a presentation called "Any friend of the docs is a friend of mine: Cultivating a community of documentation advocates". She defined who "friends of docs" are (those who write but are not technical writers), the different levels of friends of docs, how to get people to contribute more, strategies to find, support, communicate, and provide feedback to these friends, how to overcome obstacles friends of docs may encounter, and how to continue building this doc-friendly culture.

Chris Bush gave a dry-humor filled presentation called "SDK Reference Manuals: A flow-based approach". Overall it was dry but reassured that the process for creating, maintaining, and updating SDK docs haven't really changed all that much in years.

This conference also live-streamed and posted all their presenters on this YouTube playlist: https://www.youtube.com/playlist?list=PLZAeFn6dfHpmuHCu5qsIkmp9H5jFD-xq-

Tuesday, May 14, 2019

Using Nightmare.js to Generate a Sitemap From Confluence

Introduction

In a recent project, I had a task to generate a list of documents in a particular Confluence space. I chose to explore my options using Nightmare.js. Using this Node.js (version 10.11.0) module, it allowed me to programmatically enter my credentials into Confluence, navigate to a specific document, and gather a list of documents (thanks to the target document using the Children Display macro that listed all the documents of the parent page of the target space). I also wanted this script to take arguments (flags) such as the username, password, spacekey, output file, and a delay value so that the process can be automated for a variety of reasons.

Required skills and npm packages

This tutorial requires a number of skills and/or npm modules to complete everything mentioned herein:

Confluence (5.x): You should be comfortable with creating pages that utilize the Children Display macro
Nightmare (3.0.1): have some familiarity with the basics of this module
Commander (2.19.0): have some familiarity with the basics of this module
Cheerio (1.0.0-rc.2): have some familiarity with the basics of this module
CSS: basic knowledge of how to select elements
JavaScript: fair knowledge of how to use JavaScript

Setting up requirements

First, we set off with requiring a number of modules:


const Nightmare = require("nightmare");

const cheerio = require('cheerio');

const program = require('commander');

const fs = require('fs');

....

Set up nightmare and flag options

The next two lines sets up nightmare to display it's process as it's going through the steps we'll program it to navigate and a selector to find the content we're looking for in our target document. The confluenceSelector is the CSS selector that will be used to find the desired content in the main body of the Confluence document.

....


const nightmare = Nightmare({

    show: true

});

const confluenceSelector = '#main-content';

....

Note: you don't want to see an Electron window pop up and nightmare to do it's stuff, set show to false.

Next, we set up the flags and their usage using commander's features:

...


program

  .version('0.0.1')

  .usage('-u <username> -p <password> -s <spacekey> -f <output.txt> -d <milliseconds>')

  .option('-u, --user', '*required* Username id')

  .option('-p, --password', '*required* User\'s password')

  .option('-s, --spacekey', '*required* Spacekey for the Confluence space')

  .option('-f --file', 'Text file to be used for tracking Confluence document names. Can be set to either true (defaults to the spacekey naming scheme) or a file name.')

  .option('-d, --delay', 'Delay (in milliseconds) to wait for server response')

  .parse(process.argv);

...

With the flags set, we now need to parse them into an object that we'll use throughout the rest of the script. We loop through the program.rawArgs value provided by the commander module. In this loop, we are looking for specific flags so we can associate the flag with the value associated with it.

...
var argument = {};


for (var i = 0; i < program.rawArgs.length; i++) {

  if (program.rawArgs[i] == '--user' || program.rawArgs[i] == '-u') {

    arguments.user = program.rawArgs[i + 1];

  }

  if (program.rawArgs[i] == '--password' || program.rawArgs[i] == '-p') {

    arguments.pass = program.rawArgs[i + 1];

  }

  if (program.rawArgs[i] == '--spacekey' || program.rawArgs[i] == '-s') {

    arguments.spacekey = program.rawArgs[i + 1];

  }

  if (program.rawArgs[i] == '--delay' || program.rawArgs[i] == '-d') {

    arguments.delay = parseInt(program.rawArgs[i + 1]);

  }

  if (program.rawArgs[i] == '--file' || program.rawArgs[i] == '-f') {

    arguments.file = program.rawArgs[i + 1];

  }

}

...

Since the delay flag is optional, we should set up a fallback if the user doesn't supply one. In this case, we're setting the delay to 10 seconds though you can adjust this delay value to a number you're comfortable with your Confluence server responding a login page request.

...


if (!arguments.delay) {

  arguments.delay = 10000;

  console.log('Server response delay not set. Assuming ' + arguments.delay + ' millisecond delay.');

}

...

Now we should set up the file path where we keep the site map information. If the user doesn't supply a file to output our data to, the script will use a fallback based on the submitted spacekey name.

...


if (arguments.file) {

  if (arguments.file.length > 5) {

    var confluenceSiteMap = arguments.file;

  } else {

    var confluenceSiteMap = arguments.spacekey + '-site_map.txt';

  }

} else {

  var confluenceSiteMap = confluenceSiteMap.txt;

}

...

The next thing our script will need is the Confluence URL to the site map document. Using the Children Display macro in your target Confluence space, we can gather all the document links in a single space by scraping this one document. Note: you should set up this Confluence document accordingly before executing this script and ensure it's named Site Map. Otherwise, you'll need to change the values in arguments.confluence.

...

if (arguments.spacekey) {

  arguments.confluence = <base Confluence URL> + '/display/' + arguments.spacekey + '/Site+Map';

}

...

With the arguments parsed, we should check that the user supplied the required flags. If any of these flags weren't submitted, then the script should gracefully exit.

...


if (!arguments.user || !arguments.pass || !arguments.spacekey) {

  if (!arguments.user) { // user id is required

    console.log('Username is required.');

  }

  if (!arguments.pass) { // password is required

    console.log('Password is required.')

  }

  if (!arguments.spacekey) {
    console.log('Spacekey is required.')
  }



  process.exit(1);

...

Pull content with nightmare

With the required flags set, we can now request a document from Confluence using your credentials. This chunk of code starts the nightmare.js process by navigating the Electron browser to the site map page in Confluence. The process belows assumes that a login is required when the target page is loaded, enters user supplied username and password in the appropriate fields (denoted by their element ids), click the login button (denoted by it's element id), wait for a period of time (hopefully long enough for the server to respond), grab the content from the predetermined CSS selector via the evaluate method, return the data for parsing later, and close the Electron browser.

...


} else {

  console.log('Getting document link list from ' + arguments.confluence);

  nightmare

    .goto(arguments.confluence)

    .type('#os_username', arguments.user)

    .type('#os_password', arguments.pass)

    .click('#loginButton')

    .wait(arguments.delay)

    .evaluate(confluenceSelector => {

      return {

        html: document.querySelector(confluenceSelector).innerHTML

      }

    }, confluenceSelector)

    .end()

...

Parse content with Cheerio

Now that nightmare.js has retrieved the document in question, we use the then method to load the HTML content into cheerio.js to generate a list of links. Generally speaking, the links listed in a Confluence document usually follow the li span a selector pattern inside the body of the document. Here, we use the output variable to hold the list of links found in the retrieve data.

...

.then(obj => {
  $ = cheerio.load(obj.html.toString());


  var output = '';


  $('li span a').each(function() {
    output += $(this).html() + '\n';
  });

...

Then, we write out the list of links we found in the Confluence document to our predetermined text file.

...

  fs.writeFileSync(confluenceSiteMap, output, 'utf8');
})

...

Finally, we use the catch method to report back any errors.

...
.catch(error => {

    console.error(error);
  });

}

Wrapping up

With the script complete, we should save it something like confluenceSitemap.js. From there, we can execute this command to generate our list of links text file: node confluenceSitemap.js -u <username> -p <password> -s <spacekey> -f <links.txt>

Sunday, April 14, 2019

Comparing Published and Unpublished Documents in Confluence

Introduction

I recently had a challenge to upload over a thousand HTML documents to Confluence. I won't go into the details of what scripts I created using various Node.js modules, but I did want to share with you how I maintained a list of documents that were or were not published to Confluence.

Requirements

You should be comfortable with a terminal interface, managing documents in Confluence, and Confluence CLI plugin.

Using the Confluence CLI

I wrote a script that generates a list of documents and media files from a specified directory (I'll share that script and it's processes another time). From there I used the Confluence CLI plugin to report back a list of files that have already been uploaded to Confluence. The command was pretty simple:

confluence --action getPageList --id "<parent page id>" --descendents > uploaded_docs.txt

Note: the Confluence command itself needs to be setup as an alias in your Bash profile. The instructions for setting up the Confluence CLI plugin mentions how do some of this. My Bash alias looks something like this:

alias confluence="<path to confluence script>./confluence.sh --server <base Confluence URL> --user <user> --password <pasword>"

With that alias setup and a little forward thinking about how the space was going to be structured under a single document, I saved myself some time by parenting all the documents under this one ultimate parent document. (I wrote a script that handles that task as well which I'll share another time.) Having a single parent document, the CLI command reported back all the documents I needed to work with in one single execution of this command. Otherwise, I would have had to identify each parent document, execute this command on parent document, and tally up all the uploaded documents.

From here, with the two lists in hand, it was now a simple matter of finding the differences. There are several options out there to accomplish this but in the end, I just used Excel and used the conditional formatting feature to highlight the duplicates and the ones that weren't highlighted were the ones that needed to uploaded.

Maybe in the future I'll write a script that does this automatically from the two lists and share that process as well.

Thursday, March 14, 2019

Using Cheerio and Request to Scrape

Introduction

I've been heavily involved in content migration in the last few months. As a result, I've had look for solutions in pulling content from one site and push it into another. Often times, the source site wouldn't have an API to make my life easier. Enter cheerio and request npm modules. This tutorial will walk you through a basic routine of requesting a document and pulling content from a select set of elements.

Requirements

You should be fairly comfortable with JavaScript and CSS selectors in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory:

cheerio (version ~0.22.0)
request (version ~2.81.0)

Note: This tutorial was written with Node.js (version 10.11.0).

Setting up requirements

As mentioned earlier, this script will use cheerio to parse content with jQuery-like features and request to fetch content from a document. Next, we need to accept two arguments when executing this script: 1- A source document and 2- a selector to specify which element to pull content from.


const cheerio = require('cheerio');

const request = require('request');

const url = process.argv[2];

const selector = process.argv[3];

....

Input error handling

If the user doesn't supply an URL and a selector, the script should fail right away instead of attempting to extract something.


....

if (!url || !selector) {

  console.log('You need to supply both an URL and a selector.');

  process.exit(1);

} else {

  <main routine>

}

Requesting and processing the body

The main routine of this script is to request a document and process it using cheerio so we get at select parts of the content. If there isn't any issue in requesting the document and the status is good, then we pass the body of the document to cheerio. From there, you can add whatever features you like to process the content.


request(url, (err, resp, body) => {

  if (!err && resp.statusCode == 200) {

    $ = cheerio.load(body.toString());

    $(selector).each(function() {

      // do something with the content

      console.log($(this).html());

    });

  } else if (err) {

    console.log(err);

  }

});

Usage

With the script complete, we should complete the following steps to use it to pull content from the web.

Save this file as request.js.
Open a terminal in the same directory as request.js.
Execute node request <URL> <selector> replacing the URL with the web document you'd like to pull content from and replace selector with the element id or class you wish want to pull content from. For example, try this one: node request.js https://crudthedocs.blogspot.com/2019/01/scraping-web-document-using-nightmarejs.html '.post-title.entry-title'
Observe the output in the terminal.

Thursday, February 14, 2019

Creating a CLI For a Node.js Script

Introduction

I've been noodling around with allowing my Node.js script accept arguments and decided it was time to document some of the basics of using a library to give my scripts the flexibility of a CLI flags.

The npm package commander allows any Node.js script to accept flags (unordered arguments) and display usage information or warnings. This tutorial will walk you through the basics of setting up a CLI, confirm that a required flag was submitted, and display any errors with the CLI arguments.

Requirements

You should be fairly comfortable with JavaScript in general and have some working knowledge of how Node.js and command line interfaces works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory:

commander (version ~2.9.0)

Note: This tutorial was written with Node.js (version 10.11.0).

Setting Up a Node.js Script With a CLI

Let's start out by requiring the commander module:

const program = require('commander');

Set the flags

Next, we need to set up what options our CLI will have. In this case, we'll set the usage info and foo and bar flags making the foo flag required. The version method can be any number you wish and is entirely optional but it's nice to let your users know how many iterations this script has, right? The usage method tells the users how to use this script as a CLI.


program

  .version('0.0.1')

  .usage('-f <foo> -b <bar>')

  .option('-f, --foo', '*Required* foo')

  .option('-b, --bar', 'bar')

  .parse(process.argv);

Getting the arguments

With the flags set up, the script needs to be able to get at the arguments. We'll use the arguments object to hold those values entered by the user in the terminal. The program object has a nested object called rawArgs that we can iterate through looking for matches to the flags we want to associate with the process argument input. It should be noted that rawArgs escape some special characters like single and double quotes, exclamation points, dollar signs, and so on.


var arguments = {};



for (var i = 0; i < program.rawArgs.length; i++) {

  if (program.rawArgs[i] == '--foo' || program.rawArgs[i] == '-f') {

    arguments.user = program.rawArgs[i + 1];

  }

  if (program.rawArgs[i] == '--bar' || program.rawArgs[i] == '-b') {

    arguments.pass = program.rawArgs[i + 1];

  }

}

Fail or success

With the arguments properly stored, we can now either fail or allow the script to continue with the main routine. Since we are only requiring the foo flag, we'll set the script to fail it is not supplied by the user. Otherwise, the script will continue onto the main routine.


if (!arguments.foo) {

  if (!arguments.foo) {

    console.log('Foo is required.');

  }

} else {

  console.log(arguments.foo, arguments.bar);

  ..main routine...

}

Wednesday, January 16, 2019

Scraping a Web Document Using Nightmare.js

Introduction

I recently learned about another method to harvest content from a website using nightmare.js. Using other libraries such as request.js (with cheerio.js) works fine but if one needs to get around a login or has a need to navigate to get at the content, these libraries won't work. Enter nightmare.js and Electron. This document walks one through a basic setup of using nightmare.js to navigate to a site, login, and grab content from a specific element.

Requirements

You should be fairly comfortable with JavaScript and CSS in general and have some working knowledge of how Node.js works prior to digging into this tutorial.

Required npm packages

In this tutorial, we'll need to ensure the following packages have been install in your project directory:

fs (version ~0.0.2)
commander (version ~2.9.0)
nightmare.js (version ~3.0.1)

Note: This tutorial was written with Node.js (version 10.11.0).

Scraping Content with Nightmare.js

Like any node.js app, let's start off with setting up the basics requiring various modules. In this case, we are using nightmare.js to navigate a site, fs to write out the content to disk, and commander to set up flags for the script's arguments.


const Nightmare = require("nightmare");

const fs = require('fs');

const program = require('commander');
const nightmare = Nightmare({ show: true });

const selector = '.content';

...

Note: if you don't want to see Electron "jumping through all the hoops" to get at the content, you can set show to false. I think using commander library makes using this script easier to use as the input isn't order dependent. Finally, the selector variable is where the target content is located. In this case, the variable will be looking for an element with the content class. This variable can use any CSS selector method that you would like to use to get at the desired content.

CLI setup

Next, we'll set up the flags for the script. In this case, we should only accept three required flags: user (id), (user) password, and the URL of the target document.


...

program
  .version('0.1.0')

.usage('[required options] -u <username> -p <password> -url <url>')

  .option('-u, --user', 'Username id')
  .option('-p, --password', 'User\'s password')
  .option('-url, --url', 'URL for site')
  .parse(process.argv);

...

Setting the user credentials and URL

Now we should set up the values passed into the flags as variables to by used by the script. The arguments object will contain the user, password, and URL values. I mentioned how to set up a Node.js CLI earlier.


...

var arguments = {};



for (var i = 0; i < program.rawArgs.length; i++) {
  if (program.rawArgs[i] == '--user' || program.rawArgs[i] == '-u') {

arguments.user = program.rawArgs[i + 1];

  }
  if (program.rawArgs[i] == '--password' || program.rawArgs[i] == '-p') {

arguments.pass = program.rawArgs[i + 1];

  }
  if (program.rawArgs[i] == '--url' || program.rawArgs[i] == '-url') {

arguments.url = program.rawArgs[i + 1];

  }

}

...

Note: commander doesn't not process some special characters (e.g. ', ", !, $, and so on) for a variety of reasons. We won't get into that here today. So, if your password uses any of these special characters, it may not pass the string properly to the target server.

Exiting if required parameters are missing

Next, we'll set up the flags for the script. In this case, we should only accept three flags: user (id), (user) password, and the URL of the target document.


...

if (arguments.user && arguments.pass && arguments.url) {

  ...

  <main routine>

  ...

} else {

  if (!arguments.user || !arguments.pass || !arguments.url) {

    if (!arguments.user) {

      console.log('Username is required.');

    }

    if (!arguments.pass) {

      console.log('Password is required.')

    }

    if (!arguments.url) {

      console.log('URL is required.')

    }

  }

  process.exit(1);

}

Main routine

And now for the main (routine) attraction!
We'll use nightmare to navigate Electron to our desired document, clicked the login button, wait a bit (hopefully long enough for the server to respond), enter our credentials, submit said credentials, wait again, grab the content, write out the content, and announce any errors.


...

nightmare

  .goto(arguments.url)

  .click('#login')

  .wait(5000)

  .type('#usernameInput', arguments.user)

  .type('#passwordInput', arguments.pass)

  .click('#submit')
  .wait(10000)

  .evaluate(selector => {

  
  return {

    html: document.querySelector(selector).innerHTML,

    title: document.title

  }

  }, selector)
  .end()
  .then(obj => {

    console.log('Processed ' + obj.title);

    fs.writeFileSync('./downloads/' + obj.title + '.html', obj.html);

  })

  .catch(error => { // catch any errors

    console.error('Failed to obtain content from ' + arguments.url);

  });

...

The goto method allows nightmare to load up the desire document
click method clicks on an element. In this case we're going to clicked on a button with the id of login and (eventually) the user login button.
The wait method simply pauses the routine x number of milliseconds. This is often needed to wait for the server to respond to previous fired events.
The type method allows for text to be entered into fields. In this case, we are submitting our user id and password into the document elements with the ids of usernameInput and submitButton.
The evaluate method tells nightmare to look in the document for a element with the provided CSS selector. From there, we want to return to items to the script: the desired content and the title of the document as we'll use it for the name of the file we'll write out later.
The end method closes the Electron browser
After the script has retrieved the desired content, it's now time to do some processing on the returned object using the then method. In this method, we let the user know the name of the file the script is writing out and then write out the file with the desired content. Note, in this step, one can "massage" the content to fix their needs using cheerio.js or any other preferred method.
Finally, the catch method is used to catch any errors. Here, the script is using it generically to inform the user that the gathering process failed.