See Also: Home Links Personal Site Blogroll  FriendFeed CV

Tags:

Topic Image

Automatic Template Migration

The following notes cover a task undertaken in 2002 to batch convert hundreds of HTML files from a legacy website from one template style to another. The effort turned into more of a project in scale but the lessons learned were of use for subsequent migrations of other sites into more manageble forms.

The reason for migrating this content from one form to another was to remove the hard-coded tables-based layout embedded in each page and pour the content (minus navigation, branding and other crome) into a simpler template form that uses external SSI files to build the page layout and peripheral non-content components on the fly.

The plan for migration was to generate a list of files to be processed, identify the table layout used in each page, group pages by table-layout type, then hopefully establish rules for transformation of each of the common layout-groups into a new SSI template style for each subsite.

One major problem with HTML content on a larger web site with multiple authors is the diversity of text and WYSIWYG tools used to edit the content, and the varying interest on the part of authors in standards-conformance or ensuring the HTML is valid. Modern browsers are fairly tolerant of mark-up errors and a scripted solution to extract actual content from web pages would need to be similarly complex and tolerant.

The activities were broken down into the following areas:

  • Page Layout Identification
  • Content Extraction
  • Querying Extracted Content
  • Export To New Template/Layout
  • Putting it All Together

1. Page Layout Identification:

Table Signature Report

Wanted to get an idea of the number of variations in table layouts used across the site so went through the following process

  • Generated DIR listing of .shtml files, excluding unwanted files and directories, e.g.

find ./ -name "*.shtml" | egrep -v "paths|we|dont|want" > dirlist.txt
      (found 322 files on 8/11/2001)
  • Wrote and ran a table extractor script over each file generating a signature report database:
    showTables.pl < dirlist.txt > report.txt
    output format: "/filepath/filename:signature", 
    sample: ./folder/file.shtml:
    <TABLE><TR><TD><TABLE><TR><TD>......</TR></TABLE></TD></TR></TABLE>
  • Sorted the report file and ran a script against it which tallied up the number of docs sharing the same sig. unique_tally.pl < report.txt > sigs.txt
  • Sorted the sigs file to see which sigs are the most common sort -n +1 -t : sigs.txt > sigs2.txt
  • Ideally, where a sig is used by more than 'n' files, generate translation rule to crunch those files from old format to new and where doc count for a sig is below 'n' need to migrate page manually

Turns out there were 144 unique signatures in 322 individual files so the idea of automating based on a simple extraction of TABLE signiature was not practicable. The top 20 sigs accounted for 179 of the 322 total files and the sigs quickly drop off to the point where you ended up with many sigs with only one document match.

Removal of H2s

Decided to Re-run the table extractor and removed <H2> tags from the sig file database as their position in the page is probably not as significant as the <H1> tags, and if there are more than one <H2> (which is likely) we get a new sig instance when really the page is probably much the same layout.

  • The new report file was a little smaller without the markers for the <H2> tags.
  • When sorted and run through the unique tally script the resulting sig frequency file showed 102 unique sigs the top 20 of which now account for 227 of the 322 files. A slight improvement.
  • Decided to take a list of typical page layout variations as identified by the client, extract the filenames, and run the table extractor against these.
  • Took the sig results from that, sorted them, and ran the unique tally script against it which resulted in 16 unique sigs out of the original 36 files
  • Took each of these 16 and identified the start and end points of the content using my De Tabliser tool and removed the nested tables between these points as they were user authored content tables.

Web Fetch vs File-System

The above were analysis was on file-system fetched pages where any content included by SSI doesnt appear. Some of the reverse navigation used on the site was wrapped in tables so this generated a different sig in a file system fetch to one done as a web request.

  • Repeated table extractor but this time using URLs which produced a different sig report and sig frequency report with 15 unique sigs across the 36 files.
  • Did the same for the full directory listing, created a new dirlist file with URLs instead of filesystem references.
  • Repeated table extractor but this time using URLs, sorted output and ran the unique tally script which produced a different sig report and sig frequency report with 100 unique sigs across the 322 files.

The top 20 now accounted for 228 of the 322 files.

http://webdev.co.nz/projects/templates/tablesigs_urls.gif

Outside-In vs Whole Signature

Clearly the whole signature identifier was not going to be overly useful. We could reasonably transform 2/3rds of the files, but would be left with around a hundred files which would need manual or carefully reviewed automagic transformation to the new style.

Another option would be to work from the outside-in towards the content window for all files instead of making a complete match on table structure to see if we can match file layout that way, thereby ignoring user created tables in the content window.

Identification of Layout Groupings

After eyeballing the sorted URL fetched sig report for H1 only signatures I split the file list into groups with identical start-end stems. The resulting file showed there were only four major start-end stem sets that account for all but 17 of the total 322 files.

With these groups identified it shouldn't be difficult to batch process the files in each group using some parsing rules.

Re-ran this eye-balling with an updated copy of the existing live client site and widened the groupings to 11 major template styles for a little more flexibility. Also added a new column with the template grouping identifier at the end of each line saved in a new results file.

Created a new table extracting Perl script and added an array of signature stubs for each of the four major groupings and ran it over the file-set generating the usual table-sig but then doing a regexp comparing it to each of the start-end table sig stub pairs. There were numerous files which matched multiple sig stub pairs.

The challenge then was to re-order the sig-stubs so we dont get multiple matches. As template 2 has a longer start-stub then we can assume that if a match is made for it before the following shorter stub of template 3 then we can use template 2 for the conversion. Storing Template Grouping Data

Took the new results file and converted it to an XML file dirlist.xml for use in the migration stage to follow. Then used an XSL stylesheet to transform that into an XHTML page which could be used later by the client for testing of the conversion results.

2. Content Extraction

I had chosen to write the transformation scripts in Perl so had a good look through CPAN to see what modules were available for extraction of content from table based layouts of HTML content. One key requirement being that the module be based on a Parser rather than some read-line-and-regexp type hack which would soon grow out of control.

After a few emails to custodians of various Perl modules there appeared to be three feasible options worth looking at to handle extraction of the data from the structures identified above then dropping that data into a new template form.

  1. Use AnyData::HTMLTable to generate node-structure then query using SQL/DBI
  2. Use HTML::TreeBuilder with custom handler conversion in stream-mode
  3. Use HTML::TreeBuilder to generate XML node-structure then query using XPath

1. Any-Data and SQL Method

There was something I really liked about this suggestion, i.e. querying HTML using SQL, but seeing as the Any-Data modules were still largely immature version wise I thought I'd tackle the Tree-Builder method first.

The following diagram shows Perl modules and dependencies and a process that could do what was required...

http://webdev.co.nz/projects/templates/anyData_modules.gif

2. Tree-Builder and stream-mode handlers

This would be the best means of having solid and low-level control during the parse of the source content, but I assumed the complexity of the handler logic to correctly process nested elements would quickly become unmanageable.

I therefore began leaning towards conversion to XML and subsequent XPath query method following.

3. Tree-Builder and XPath Method

Tree-Builder is a part of the HTML-Tree distribution, was originally authored by Gisle Aas and is now maintained by Sean Burke. The bundle was first released in March 2001 and is currently at revision 3.11.

The following diagram lists the process that could do whats required. Perl modules and dependencies are shown...

http://webdev.co.nz/projects/templates/treeBuilder_modules.gif

After updating HTML::Parser so I could install the more recent copy of HTML::DOMbo I wrote a new Perl script to manage the parse-and-convert-to-XML stage. It writes the XML tree out to STDOUT which worked well when called from a shell script.

Next I need to get the XPath query sorted against the in-memory DOM structure. Installed XML::XPath and knocked out a new script which took the XML source and ran an XPath-Query against the DOM that's been loaded (assuming the source was well-formed)

While attempting to parse the content I got non-well-formed error messages at the point in source files where a non-breaking space (&nbsp;) had been replaced by a sequence of real spaces.. I would've kinda expected the method to_XML_DOM in DOMbo to have encoded this properly on the way out but looks like its not.

There were also acute a characters, á / á which the script didn't like. Needed to find out why TreeBuilder->dump and DOMbo->as_XML_DOM are doing this, but TreeBuilderNew Page->as_HTML wasnt. Assumed it's gotta be something in HTML::Parser or HTML::Entities.

Tried using Dave Raggett's HTML-tidy script which also supports XHTML transformation and it replaces the &nbsp; sequences with special entity &#160; so obviously its a requirement of well-formed XML to do this special-entity transformation, but DOMBo wasnt doing this replacement, its just writing out the high-ascii (>127) value out as a raw character when it should be written as &#160;

After replacing all &nbsp; characters in the source file the html_XML process seemed OK. Looked like all but PCDATA would need replacing but this ssemed a bit flakey. Another option would be to create an additional DTD which included all the special entities that are contained in the files to be converted, and let the XML conversion process haul in their declarations. This also seemed a bit fragile however.

Changed html_XML.pl to call the encode_entities function but its default encoding action is to encode everthing, including the <, >, & and " characters, which is useless as the output is no longer XML, its a sequence of escape codes.

Fortunately encode_entities() allows you to specify which characters are to be considered unsafe and in need of encoding, but as per the above paragraph I really don't fancy determining the whole set of codes in advance. It seemed HTML-tidy would have to do as it seems to manage this translation to numerical entities very well.

3. Querying The Extracted Content

Now that I'd committed to the XPath query method it was time to actually read something about XPath queries. Had been keeping a set of notes and typical queries over the last few months and started to build a common set of queries that I thought would be useful for HTML extraction.

Given the TABLE based layout used throughout the site to be converted a typical query might be something like...

/html/body/table/tr/td[2]/*

Which returns all content in the 2nd table cell <td> of the 1st table row <tr> of the 1st <table> in the page body. This extracted content fragment can then be poured into a new output page style/template.

What's required is a large enough set of similar queries such that a group of them can be applied to a given page depending on the location of the significant content areas in that page. As I'd identified about 10 different page layout styles there would be 10 XPath query sets that would need to be assembled. Fortunately many of the queries were very similar.

4. Export To New Template/Layout

Now that I could extract content fragments I needed to then pour them into a container file or template that will make some external call to handle assembly of the objects that are peripheral to the actual body of the content window. Things like headers, footers, branding elements, common site navigation.

Also needed to be able to specify different page layouts for the various sub-sites and possibly provide print-friendly versions of content. Another added complication was that some sub-sites (and even folders within them) had files in the same folder which are of a different template layout so output template layout cannot be driven by path alone.

Server-Side Includes

The way we'd been handling this on other projects was to use Apache extended server-side includes (XSSI) by creating SSI variables that will contain all the objects not directly associated with content and echo them before and after the content.

Each of the echos was stored in the HTML file but apart from that the file was basically all content, plus opening/closing BODY and HTML container tags and the Meta data in the HEAD section.

A typical stand-alone content page would then look something like this...

http://webdev.co.nz/projects/templates/template.gif

The advantage of this sort of mechanism is that the file itself contains content only, plus the meta data associated with it. There are no messy HTML tables mixed in with the content. We can change the whole look of the page by changing the logic or contents of the main include file linked at the top and all other pages that link to it without any changes to the individual content pages themselves.

Plus we can easily extract the content from the file at a later date if we need to migrate it to another file format, to another platform or include mechanism, or into a content management system for example without having to dig through non-content markup in the file.

5. Putting it All Together

Given that the page layout identification has taken place already and that an XML index file has been created containing the filepath and template group of each file, the pseudo code for the migration script might be something like this...

for each file to be processed

  • parse file into HTML tree
  • convert HTML tree to XML nodelist
  • identify template group ID for file
  • fetch XPath queries for template group
  • execute XPath queries to extract content fragments
  • fetch output containers for template group
  • merge output containers and content framgents
  • output assembly as XHTML
  • convert XHMTL to HTML
  • replace source file with new file

The First Single-File Dry Run

After running the XPath query and spitting the results out into the SSI file we get a new version of the original page that looks pretty good.

Batch Processing

Of course as soon as I started to batch the process with the final extractor script there were plenty of small but annoying inconsistencies that popped up between various documents and template layouts which required work-arounds.

  • there were so many special entities that HTML-Tidy was the only feasible option for conversion to XML
  • the CSS files linked in the head section of the old site pages were now redundant and interfered with rendering using the new stylesheets developed for 2002. Added ad XPath query that selected all tags from the head section except those that had an attribute type='text/css'
  • for conversion back to HTML the embedded SSI in the <body> tag was rejected by HTML-Tidy so this had to be removed from the output containers, this wasn't a problem as the markup that was in there can be set using CSS
  • had planned to do the HTML -> XHMTL -> HTML conversion in-memory, but with the need to use the external HTML-Tidy utility had to export the XML to an external file

Summary:

The process took a few twists and turns along the way but overall it was a successful outcome. It was always going to be challenging to manage unpredictable, loose and relatively unstructured markup into a structured and manageable form.

The lessons learned have been of benefit on other projects using the same modules and/or similar approach, and the experience was rewarding and satisfying.

Resources:

  • Transforming Unstructured Content into Meaningful XML
  • Improve your HTML now, by making it into XML
  • A Guide to XML


See Also: Web Projects | Web Development | Content Management | Notes Index