XML Publication : Web document publication through XML

by J.M. Vanel , Copyright © J.M. Vanel - 2001

Last update:

Download current version; User documentation ; en français

Back to main page

What is XML Publication ?

XML Publication is a set of tools to generate Web pages from (possibly large) desktop documents or other structured documents. For instance books with paragraphs, or tabular data. It cuts big documents in Web pages, creates customizable multi-index. All this is done through a repeatable process, where data is separated from presentation and user settings.

It uses cutting-edge XML techniques and particularly XSLT. It is under GNU Public License.

History

I used these techniques for industrial catalogs at industrySuppliers.com, a (defunct) market place, last spring. Then I got involved in Seed to Seed (www.seed2seed.net), a project to publish information collected worldwide about sustainable agriculture. With a bit of refactoring, XML Publication resulted from these projects. Someday XML Publication will also be applied to Worldwide Botanical Knowledge Base .

XMLPublish data flow

Samples

Samples

Source a table, a paragraph structured text, a poorly structured word processor file, etc.

Requirements

Static pages vs dynamic pages

Static pages

Dynamic pages

Information object model

If your browser is an SVG browser (like Amaya), you see below an UML diagram of XMLPublish Information object model:

Document Item Rubric String name Keyword xsd:element realize realize XMLPublish object model

Otherwise, use these links:

XMLPublish object model as SVG (if you have the Adobe SVG plugin)

XMLPublish object model as GIF

XML basic structure

Simple 2-level structure reflecting the above "Information object model" :

<root>
 <itemType1>
  <rubricType1>any markup ...</rubricType1>
  <rubricType2>any markup ...</rubricType2>
  ... etc
 </itemType1>
 ... other item types
</root>

Historically in XMLPublication there were 2 concrete structures for the master.xml file:

but now there is just one: documents with paragraphs and sub-paragraphs. Here is a sample:

<div class="h1" >
 <h1>Item name</h1>
 <div class="h2">
  <h2>Rubric name</h2>
   Rubric content
 </div>
 ...
</div>

Currently we use <p> tags for wrapping items and rubrics, but will switch to <div> in a next version, because <p> as a wrapper is not valid with respect to the HTML DTD. This markup should pass intact through most of the HTML tools (testers wellcome !).

Note 1: ideally the tidy operation for this feature should be idempotent (an algebraic operation x is said idempotent if x is such that xx=x).

Note 2: The ISO/IEC 15445:2000 norm about HyperText Markup Language (HTML) also has a spec. for wrapping paragraphs, but is is through div1, div2 tags that are not in the W3C standard. See http://www.cs.tcd.ie/15445/UG.html .

Presentation settings

XSLT customization

Implementation

Automatic update and chaining with GNU Jakarta Ant (or GNU make).

Moreover ant allows to be multi-platform regarding file-systems.

The build.xml files are designed so that data can be URL, not only files (alas Ant is not yet enough Web-oriented, but we work on that too).

Algorithms

Build skeleton thesaurus.xml

This is done in publication-magician.xslt . The starting point is the work/master.xml file which aggregates all document sources; it has the structure described above (XML basic structure).

// make a unique list of rubrics across document sources:
for each unique string in:
 $master.xml/p/p/h2
 store it in a new <rubric> element
// for each rubric make a list of occuring words
Loop on rubrics
 Concatenate words for this rubric from all items
 Tokenize to get each word wrapped in an XML element
 Sort and suppress doubles in words list
 Filter words list (upper/lower case, numbers, etc)
 Mark stopwords
 Write <rubric> with sub-element <keyword>

Build an index for each rubric

This is done in make-index-by-rubrics-impl.xslt . It has some parts similar to preceding stylesheet, and would need some refactoring . This is because when user chooses <rubric use-keywords="no" > , it does the same thing as preceding stylesheet, just keeping <rubric> element in a <xsl:variable> instead of writing it in publication.xml .

Loop on rubrics
 Fill HTML document for this rubric
  Loop on words list
   Loop on items
    If word is found in this rubric of this item
     add hyperlink to HTML item file
Fill HTML document for the overall index.html
 Loop on rubrics
   add hyperlink to HTML document for this rubric

Modularity

Each functionality in a separate XSLT transform:

The chaining of transforms (data-flow) is not hard-wired in the XSLT transforms, the Ant build.xmltakes care of that.

Future

Detailed list of features and tasks

Back to main page