Migrating an HTML website to WordPress, pt. 1

Deprecated: Function get_page_by_title is deprecated since version 6.2.0! Use WP_Query instead. in /homepages/42/d513373005/htdocs/wp-includes/functions.php on line 6085

Backstory: I started this site just after the World Wide Web was unleashed upon the public. At first, all was painstakingly hand-coded HTML. When WYSIWYG tools became popular – like Dreamweaver – there was a choice for the technically-minded: Userland Frontier. Frontier was an unbelievably powerful content management system (CMS), with features a decade ahead of the curve.

Frontier is dead. Userland is dead. TWENTY-FIVE years later there’s still no equal alternatives. My site has lain fallow because of the time cost of maintaining an environment able to run Frontier.

Blogging, WordPress, Twitter, and Facebook have taken the focus away from personally-created and -curated websites. Twitter and Facebook seem too ephemeral for my likes, appropriate for sharing the news of finding a stone in my shoe, but not for travel musings and photos that I want to make available for the long-term.

Part of me has been waiting for another empowering CMS to arrive, but that’s been lonely and unsatisfying.

After years of using WordPress to make pro bono websites for others, I’ve decided it’s worth breathing life into old content and make it easier to create new content than to collect notes and photos on my hard drive.

Being the dyed-in-the-wool geek that I am, I’m keeping my process notes here, where others may find hints into their own needs.

Starting point

My website is 13559 pictures wrapped by 1709 web pages. That’s too many pages to ignore automation; I’m going to use software to do as much of the migration work as possible.

Below are some command-line invocations on UNIX. Herein I’ll share the salient commands that worked for my situation. It’ll be enough for you to get started finding, installing, and using the same thing.

How’d I figure that out?

% find . -name *.html | wc -l 1709
% find . -name *.gif -o -name *.jpg -o -name *.jpeg | wc -l 13559 % `

Creation Time

The first thing that came to mind was to capture the creation time of each page, so I could note it visibly in the content. Sadly, POSIX doesn’t capture creation time, so that’s been lost. (Despair not; I put timestamps onto pages as it occurred to me, so there’s that.)

I mention this early failure to remind us all that there’s no way to go but up :-)

HTML sucks

Don’t get me wrong: HTML certainly has its place, but it’s horrible to intermix the fresh user content with the HTML necessary to specify format and layout.

A much better solution would be to use Markdown, a very lightweight markup language (to identify the parts of content; header, paragraph) and CSS (to specify the styling of everything you see).

Being able to use Markdown within WordPress requires you install a plug-in. I chose Parsedown for WordPress. Then I chose JetPack.

But I have 1709 pages with HTML gobbly-gook all over the place. How depressing. Nobody can crunch through so many pages, making detailed cuts and edits to tease out the content. What we need is software and hardware to do all this.

The open source software community

In the early days of computing, there was a zero-sum mindset mandating competition; software was proprietary, locked to a particular type of computer. Severely hobbling programmers, the advent of a cooperative worldview was perhaps the most important factor to the ever-faster evolution of the tech sector.

Finally, programmers could bypass the “not invented here” mentality and benefit from and build upon the work of others. Apache, for example, an open-source free-for-everyone web server, is the engine that drives most of the Internet. Freely sharing your passion and expertise and enabling them to create is a very happy thing. It gives non-programmers abilities they’d otherwise never have, and it allows programmers to build better things without the time cost of maintaining software outside of their core competency.

That’s a long aside to get back on our track of migrating a website. Where were we? Ah, converting old HTML pages into clean Markup.

Imagine the time consumed by opening 1709 documents one-by-one, selecting a chunk of text, deleting, and saving. Add making complex edits on each document. Factor in the inevitable errors a tired human would make, and the time necessary to track down and correct those errors.

We need something to help us convert HTML to Markdown.

First things first

I started processing my HTML files before I realized that I needed to pre-process them first, removing all the common stuff that is duplicated in each of them (and which would be added now by the new CMS).

The wrong kind of line-endings

tr and sed are UNIX programs to translate text.

After I pulled the website files down to my computer I realized that some of them had their line-endings as “carriage returns”, not “newlines” (and not “CRLF” either). The UNIX translator utility, tr, makes short work of this.

NOTE: As we walk through this process we’ll be following a file I’ll call x.html.

This is the canonical use of the command:

tr ‘\r’ ‘\n’ < foobar.html > foobar.t

but as I want to use find to traverse the entire web page tree and do this in one fell swoop – and find and pies don’t play well together – what I actually did was to generate commands to transform one web page at a time and fed them to the shell:

find . -name *.html -exec echo “cat {} | tr ‘\r’ ‘\n’ | tee {}.t” \; | sh

Checkpoint: We’ve made x.html.t – complete with corrected line-endings – to sit alongside x.html and its bad line-endings.

Static headers and footers

Web pages, like book pages and business letters, have a header above and a footer below the main content. The new CMS will generate these for us, so it’ll make migration much easier if we can isolate just the page content.

Luckily for us, headers and footers often have text that’s not found in the content, and so if we can identify something that demarks the sections we can just delete from

the top of the file to the end of the header
the top of the footer to the end of the file

In this example web page you’ll see comment lines which surround the content. This is the best case for us.

A web page:
HEADER HEADER HEADER HEADER HEADER HEADER HEADER HEADER HEADER HEADER HEADER
<!– post –>
CONTENT CONTENT CONTENT CONTENT CONTENT CONTENT CONTENT CONTENT CONTENT CONTENT
<!– post navigation –>
FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER FOOTER

The UNIX stream editor sed easily chomps through a file like this.

sed ‘1,//d;//,$d’ file

So what I actually used to traverse the entire tree of HTML pages was:

find . -name *.t -exec echo “sed -e ‘1,/, Helvetica/d;/< !– page footer –>/,\$d’ -e ‘/local navigational elements/,\$d’ -e ‘s+

+\n\n\n+g’ {} | sed -e ‘1d;\$d’ | tee {}.html” \; | sh

Now we’ve got a file full of textual content and HTML formatting instructions. We want to tease out the complicated HTML and replace it with lightweight Markdown mark-up commands.

An example of how error-prone and exhausting it’d be to do this manually: following is the mark-up needed to describe the tabular way I put the arrow image and some text next to each other. I’ve highlighted the content in which we care. Can you imagine skimming page after page of HTML, looking for the nuggets, and then excising the HTML? Ugh.

Protip: Unweave interwoven HTML into separate CSS & Markdown files!

So I hope it’s clear why time spent up-front in figuring out how to migrate this website is a much better time investment than undertaking the grueling, mind-numbing manual process.

pandoc

While I came across a few free, open-source ways to convert from HTML to Markdown, the recommendations for pandoc stood head and shoulders above the rest.

There are many free tools out there; my choices reflect only the best I could find at the moment I looked, with whatever time I had. You may find better. Let me know.

The manual page for pandoc is pretty straight-forward. This is what I wound up using

find . -name *.t.html -exec echo “pandoc -s –normalize –preserve-tabs –wrap=none –from html –to markdown_strict –output {}.md {}” \; | sh

Checkpoint: Welcome x.html.tr.html.md Have you yet asked yourself why I’m breadcrumbing my progress? Because I’m working on a lot of files en masse I never know when the process will take an unexpectedly-formatted input and mangle the output. With all of our steps preserved right here we can manually recover a small issue, or re-start the process mid-way. (Reading from left to right you can follow each transformation, captured in the filename.)

Next steps

These must be resolved before the migration can be considered complete.

Page References

Confirm that a WordPress permalink to a post can be made to mimic the existing structure.

Then convert Markdown links of the following format by removing the nested ../ portion. Then existing pages refer to other existing pages.

Las Palmas de Gran Canaria

Image References

Image references of the form

have not been converted to Markdown. Figure out if / how bulk uploading of images can result in a permalink that’s the same as our existing structure.

SQL needed to insert posts, dates, images

While you’re in the neighborhood, Michael-of-tomorrow, figure out what SQL we need to set post-creation times, etc.

Linking between pages by title

Q: If the existing structure can be mimicked then no more must be done for links between pages. Going forward, though, how do I link to a page by title?

A: SOLVED! See [link] shortcode error: can't find "WordPress shortcode link for Posts & Pages".

Migrating an HTML website to WordPress, pt. 1

Starting point

Creation Time

HTML sucks

The open source software community

First things first

The wrong kind of line-endings

Static headers and footers

pandoc

Next steps

Page References

Image References

SQL needed to insert posts, dates, images

Linking between pages by title

Related

Leave a ReplyCancel reply

Share this:

Starting point

Creation Time

HTML sucks

The open source software community

First things first

The wrong kind of line-endings

Static headers and footers

pandoc

Next steps

Page References

Image References

SQL needed to insert posts, dates, images

Linking between pages by title

Share this:

Related

Leave a ReplyCancel reply

Related