Many times one wants a portable document that contains help pages for a product or platform and works both offline and online across all kinds of devices regardless of the OS. Despite claims that PDF is dead or is soon going to die, it continues to be a predominant format for creating and reading technical documents.
I wanted to try and convert the ever growing volume of help pages for AppSheet which are published using Zendesk to PDF. Ideally, I envisioned that this would be totally automated whereby a script would run on my Mac every night which would collect all the new updates on Zendesk using wget and generate a PDF. Little did I realize that it is not always as simple to work with Zendesk HTML pages. In this blog post, I will provide a short summary of my journey and what I was able to accomplish.
I first started to consider converting the HTML pages to markdown and started down this path. I chose markdown because of its simplicity and the fact that it doesn’t involve a steep learning curve for people to get used to. I thought I would simply download all the pages using wget, run html2text and get all the pages as markdown. I did not yet know how I was going to convert the markdown to PDF but luckily, I was reminded about the DITA-OT markdown plugin. Although this method kind of worked, I was unable to really control the markdown files that were generated by html2text and also, since I could not leverage many DITA capabilities required for content reuse with this approach, I abandoned it.
Having played with Beautifulsoup before, I thought I could use it to get the HTML, parse it and convert it to markdown using html2text. I found a python script that did something close to what I was attempting, so I used it as a guide and wrote some python code to fetch pages from Zendesk and markdownify them. To transform markdown to PDF, my idea was to use the DITA-OT markdown plugin. Although my script worked, and I was able to use the DITA-OT to transform a few pages, there were several issues with this approach because of differences in encoding and I had to do a lot of manual editing. Since there were over 120 pages, I decided to shelve it aside and look for some other way.
I should note that while I was trying this method, I also looked at Pandoc,
I would have preferred to convert the HTML pages to DITA using one script but it would have involved a lot more work so I opted instead to clean up the HTML, create simpler HTML pages and run these pages through the XHTML to DITA transform that is available in my favorite XML editor, Oxygen.
I had to go through the process of modifying my python code, fixing validation errors in DITA and building the whole PDF document several times to fix numerous conversion errors but finally, I was able to accomplish most of what I set out to do. One thing I had to was to aggressively down size and change the DPI of a lot of the graphics files as they were too large and would not fit in a page. I used XnViewMP to do this as it has excellent batch conversion capabilities. It is possible that some graphics became way too small in this process.
I still don’t have all the gotchas worked out to completely automate the process but I have posted my code and the generated files to a github repository. If anyone is interested in improving the scripts to benefit all the AppSheet users, I encourage them to please fork the repository and contribute whatever they can.