Friday, April 14, 2017

Automating Content Extraction From Confluence Using Exporter Plugins and Bash

Intro

As a technical writer, one of the many tasks I must work with on a regular basis is pulling content from multiple sources and compiling them into a source file for publication. Early on, the process I used was a very hands on process of manually generating a compressed file from Confluence and other sources, manipulate the contents of the extracted content, bundle everything up, and publish the refined package.

This tutorial's goal is to get novice users familiar with one method of automating content extraction by using a simple Bash script file that pulls content from Confluence using an exporter scheme URL.

The process contains two components: an export scheme URL and a Bash script. I have written two tutorials on how to extract content using two of Scroll's exporter plugins. Refer to Part 2 and/or Part 3 in the Export Content from Confluence series for details on how to generate the export URL using either Scroll's EclipseHelp or HTML Exporter plugins. You will need to complete at least one of these tutorials in order to complete this tutorial as you will need the export scheme URL. The Bash script we will create handles the manual part of extracting content from Confluence.

This tutorial is written for Mac users. The Bash features used in this document has not been tested on Linux but you know your way around wget, this tutorial should work just fine.

Prerequisites

Confluence Command Line Interface

Did you know that Confluence has a CLI? Check out and install Confluence Command Line Interface (CLI) as we may need it to gain access to Confluence via the command line to check against extracted content but I'll leave that up to you to decide if you want to use it or not.

For installation, please review Confluence CLI Installation and Use.

For getting started, reference, examples, and much more info, please review Confluence CLI User's Guide.

wget

Another component to automating the export process is using a command line network transfer tool such as wget. There are a few options out there for handling CLI transfers (such as curl) but I've found wget to be rather flexible, handles redirects well, and stable for my documentation needs. If you have arguments for or against, I'd love to hear them in the comments.

Setting up a "docbot" account for export

Prior to automating the export process from Confluence, you will want to create a non-human account that only has view and export permissions. If you don't plan on sharing or automating content extraction from Confluence, you can use your personal account but I've seen many tech writers get bitten by using their personal accounts.

In this case, I named this new account "docbot". If you don't have the proper credentials to create Atlassian accounts, please contact your Confluence administrator and request the account be created. Otherwise, create a new user account:
  1. Navigate to the Confluence Admin page.
  2. Click on Users under Users & Security section.
  3. Click Add Users.
  4. Enter docbot in Username field.
  5. Enter docbot in Full Name field.
  6. If you have a group email address that is shared with the tech writers, I recommend using that for the Email field. Otherwise, enter your email address.
With the docbot account created, we need to provide it with the proper permissions.
  1. Navigate to your target space's Space Admin page and click on Permissions.
  2. Under the Individual Users section, click the Edit Permission button.
  3. Locate the docbot account and enable only the following permissions:
    • All > View
    • Space > Export
  4. Once those two permissions have been set, click Save all.
Your docbot account should now have the proper permissions to export content from Confluence.

Bash script

Once you have ran through the process of creating and saving an export scheme from either one of the aforementioned plugins, you will need to apply the REST URL in our Bash script.

The Bash script will contain up to four lines: up to three variables and one command. You may wish to add a few additional lines before and after the export process to set up directories like adding a few commands for setting up an export archive and post processing the downloaded file (like renaming the .jar file to a .zip file, unzipping, it and so on).

...
USER='docbot'
PASS='<docbot's password>'
URL='<exporter scheme URL>'

wget --content-disposition "$URL&os_username=$USER&os_password=$PASS"
...


Breakdown of this script

The first three lines are just variables we will pass into the wget command. The fourth line is the backbone of the whole operation. Lets example each component of this command:
  • wget - network transfer command
  • --content-disposition - flag that will force the download to preserve the file name. Note: This flag is will experimental though I've never hand any problems with it.
  • "$URL&os_username=$USER&os_password=$PASS" - string that gets passed into the wget command. When pulling content from Confluence using the exporter scheme URL, you need to specify the exporter URL, provide the user requesting it (which it get checked for proper credentials, and the password). If everything checks out properly on Confluence's side, your command will pull down a compressed file based on the settings in your exporter scheme.
As mentioned earlier, one could use curl to pull content from Confluence but I found that command to be unreliable at times on my Mac. Maybe it was may flag options or some other setting but wget has worked very well for me for years using this configuration.

Note: if your Bash script will be shared with other users or in a hosted environment, you may want to localize the USER and PASS variables in your Bash profile.

From here, you can add any steps to the Bash script to include post processing, executing node scripts to modify extracted content, or whatever else you need to include in this automated process. Or, if you need to use this process on a regular basis, you can simply set a cron or Jenkins job.

Happy automating!