News, Events, Trends, Activities, Conferences and Notes to do with Web Culture, Development, New Media, Content Management, Mobile and PDA Access and Web Infrastructure
|
See Also: Home Links Personal Site Blogroll FriendFeed CV | |
Wiki Menu:Tags:
|
Automatic Template MigrationThe following notes cover a task undertaken in 2002 to batch convert hundreds of HTML files from a legacy website from one template style to another. The effort turned into more of a project in scale but the lessons learned were of use for subsequent migrations of other sites into more manageble forms. The reason for migrating this content from one form to another was to remove the hard-coded tables-based layout embedded in each page and pour the content (minus navigation, branding and other crome) into a simpler template form that uses external SSI files to build the page layout and peripheral non-content components on the fly. The plan for migration was to generate a list of files to be processed, identify the table layout used in each page, group pages by table-layout type, then hopefully establish rules for transformation of each of the common layout-groups into a new SSI template style for each subsite. One major problem with HTML content on a larger web site with multiple authors is the diversity of text and WYSIWYG tools used to edit the content, and the varying interest on the part of authors in standards-conformance or ensuring the HTML is valid. Modern browsers are fairly tolerant of mark-up errors and a scripted solution to extract actual content from web pages would need to be similarly complex and tolerant. The activities were broken down into the following areas:
1. Page Layout Identification:
Table Signature ReportWanted to get an idea of the number of variations in table layouts used across the site so went through the following process
find ./ -name "*.shtml" | egrep -v "paths|we|dont|want" > dirlist.txt
(found 322 files on 8/11/2001)
Turns out there were 144 unique signatures in 322 individual files so the idea of automating based on a simple extraction of TABLE signiature was not practicable. The top 20 sigs accounted for 179 of the 322 total files and the sigs quickly drop off to the point where you ended up with many sigs with only one document match.
Removal of H2sDecided to Re-run the table extractor and removed <H2> tags from the sig file database as their position in the page is probably not as significant as the <H1> tags, and if there are more than one <H2> (which is likely) we get a new sig instance when really the page is probably much the same layout.
Web Fetch vs File-SystemThe above were analysis was on file-system fetched pages where any content included by SSI doesnt appear. Some of the reverse navigation used on the site was wrapped in tables so this generated a different sig in a file system fetch to one done as a web request.
The top 20 now accounted for 228 of the 322 files.
Outside-In vs Whole SignatureClearly the whole signature identifier was not going to be overly useful. We could reasonably transform 2/3rds of the files, but would be left with around a hundred files which would need manual or carefully reviewed automagic transformation to the new style. Another option would be to work from the outside-in towards the content window for all files instead of making a complete match on table structure to see if we can match file layout that way, thereby ignoring user created tables in the content window.
Identification of Layout GroupingsAfter eyeballing the sorted URL fetched sig report for H1 only signatures I split the file list into groups with identical start-end stems. The resulting file showed there were only four major start-end stem sets that account for all but 17 of the total 322 files. With these groups identified it shouldn't be difficult to batch process the files in each group using some parsing rules. Re-ran this eye-balling with an updated copy of the existing live client site and widened the groupings to 11 major template styles for a little more flexibility. Also added a new column with the template grouping identifier at the end of each line saved in a new results file. Created a new table extracting Perl script and added an array of signature stubs for each of the four major groupings and ran it over the file-set generating the usual table-sig but then doing a regexp comparing it to each of the start-end table sig stub pairs. There were numerous files which matched multiple sig stub pairs. The challenge then was to re-order the sig-stubs so we dont get multiple matches. As template 2 has a longer start-stub then we can assume that if a match is made for it before the following shorter stub of template 3 then we can use template 2 for the conversion. Storing Template Grouping Data Took the new results file and converted it to an XML file dirlist.xml for use in the migration stage to follow. Then used an XSL stylesheet to transform that into an XHTML page which could be used later by the client for testing of the conversion results.
2. Content ExtractionI had chosen to write the transformation scripts in Perl so had a good look through CPAN to see what modules were available for extraction of content from table based layouts of HTML content. One key requirement being that the module be based on a Parser rather than some read-line-and-regexp type hack which would soon grow out of control. After a few emails to custodians of various Perl modules there appeared to be three feasible options worth looking at to handle extraction of the data from the structures identified above then dropping that data into a new template form.
1. Any-Data and SQL MethodThere was something I really liked about this suggestion, i.e. querying HTML using SQL, but seeing as the Any-Data modules were still largely immature version wise I thought I'd tackle the Tree-Builder method first. The following diagram shows Perl modules and dependencies and a process that could do what was required...
2. Tree-Builder and stream-mode handlersThis would be the best means of having solid and low-level control during the parse of the source content, but I assumed the complexity of the handler logic to correctly process nested elements would quickly become unmanageable. I therefore began leaning towards conversion to XML and subsequent XPath query method following.
3. Tree-Builder and XPath MethodTree-Builder is a part of the HTML-Tree distribution, was originally authored by Gisle Aas and is now maintained by Sean Burke. The bundle was first released in March 2001 and is currently at revision 3.11. The following diagram lists the process that could do whats required. Perl modules and dependencies are shown...
After updating HTML::Parser so I could install the more recent copy of HTML::DOMbo I wrote a new Perl script to manage the parse-and-convert-to-XML stage. It writes the XML tree out to STDOUT which worked well when called from a shell script. Next I need to get the XPath query sorted against the in-memory DOM structure. Installed XML::XPath and knocked out a new script which took the XML source and ran an XPath-Query against the DOM that's been loaded (assuming the source was well-formed) While attempting to parse the content I got non-well-formed error messages at the point in source files where a non-breaking space ( ) had been replaced by a sequence of real spaces.. I would've kinda expected the method to_XML_DOM in DOMbo to have encoded this properly on the way out but looks like its not.
There were also acute a characters, á / á which the script didn't like. Needed to find out why Tried using Dave Raggett's HTML-tidy script which also supports XHTML transformation and it replaces the sequences with special entity   so obviously its a requirement of well-formed XML to do this special-entity transformation, but DOMBo wasnt doing this replacement, its just writing out the high-ascii (>127) value out as a raw character when it should be written as   After replacing all characters in the source file the html_XML process seemed OK. Looked like all but PCDATA would need replacing but this ssemed a bit flakey. Another option would be to create an additional DTD which included all the special entities that are contained in the files to be converted, and let the XML conversion process haul in their declarations. This also seemed a bit fragile however. Changed html_XML.pl to call the encode_entities function but its default encoding action is to encode everthing, including the <, >, & and " characters, which is useless as the output is no longer XML, its a sequence of escape codes. Fortunately encode_entities() allows you to specify which characters are to be considered unsafe and in need of encoding, but as per the above paragraph I really don't fancy determining the whole set of codes in advance. It seemed HTML-tidy would have to do as it seems to manage this translation to numerical entities very well.
3. Querying The Extracted ContentNow that I'd committed to the XPath query method it was time to actually read something about XPath queries. Had been keeping a set of notes and typical queries over the last few months and started to build a common set of queries that I thought would be useful for HTML extraction. Given the TABLE based layout used throughout the site to be converted a typical query might be something like...
/html/body/table/tr/td[2]/* Which returns all content in the 2nd table cell <td> of the 1st table row <tr> of the 1st <table> in the page body. This extracted content fragment can then be poured into a new output page style/template. What's required is a large enough set of similar queries such that a group of them can be applied to a given page depending on the location of the significant content areas in that page. As I'd identified about 10 different page layout styles there would be 10 XPath query sets that would need to be assembled. Fortunately many of the queries were very similar.
4. Export To New Template/LayoutNow that I could extract content fragments I needed to then pour them into a container file or template that will make some external call to handle assembly of the objects that are peripheral to the actual body of the content window. Things like headers, footers, branding elements, common site navigation. Also needed to be able to specify different page layouts for the various sub-sites and possibly provide print-friendly versions of content. Another added complication was that some sub-sites (and even folders within them) had files in the same folder which are of a different template layout so output template layout cannot be driven by path alone.
Server-Side IncludesThe way we'd been handling this on other projects was to use Apache extended server-side includes (XSSI) by creating SSI variables that will contain all the objects not directly associated with content and echo them before and after the content. Each of the echos was stored in the HTML file but apart from that the file was basically all content, plus opening/closing BODY and HTML container tags and the Meta data in the HEAD section. A typical stand-alone content page would then look something like this...
The advantage of this sort of mechanism is that the file itself contains content only, plus the meta data associated with it. There are no messy HTML tables mixed in with the content. We can change the whole look of the page by changing the logic or contents of the main include file linked at the top and all other pages that link to it without any changes to the individual content pages themselves. Plus we can easily extract the content from the file at a later date if we need to migrate it to another file format, to another platform or include mechanism, or into a content management system for example without having to dig through non-content markup in the file.
5. Putting it All TogetherGiven that the page layout identification has taken place already and that an XML index file has been created containing the filepath and template group of each file, the pseudo code for the migration script might be something like this... for each file to be processed
The First Single-File Dry RunAfter running the XPath query and spitting the results out into the SSI file we get a new version of the original page that looks pretty good.
Batch ProcessingOf course as soon as I started to batch the process with the final extractor script there were plenty of small but annoying inconsistencies that popped up between various documents and template layouts which required work-arounds.
Summary:The process took a few twists and turns along the way but overall it was a successful outcome. It was always going to be challenging to manage unpredictable, loose and relatively unstructured markup into a structured and manageable form. The lessons learned have been of benefit on other projects using the same modules and/or similar approach, and the experience was rewarding and satisfying.
Resources:
See Also: Web Projects | Web Development | Content Management | Notes Index |