XML Publication : Web document publication through XML

Last update:

Download current version; User documentation ; en français

What is XML Publication ?

XML Publication is a set of tools to generate Web pages from (possibly large) desktop documents or other structured documents. For instance books with paragraphs, or tabular data. It cuts big documents in Web pages, creates customizable multi-index. All this is done through a repeatable process, where data is separated from presentation and user settings.

It uses cutting-edge XML techniques and particularly XSLT. It is under GNU Public License.

History

I used these techniques for industrial catalogs at industrySuppliers.com, a (defunct) market place, last spring. Then I got involved in Seed to Seed (www.seed2seed.net), a project to publish information collected worldwide about sustainable agriculture. With a bit of refactoring, XML Publication resulted from these projects. Someday XML Publication will also be applied to Worldwide Botanical Knowledge Base .

XMLPublish data flow

Samples

Source a table, a paragraph structured text, a poorly structured word processor file, etc.

Requirements

Free software
Lighweight and easy to maintain: 1000 XSLT and 100 Makefile / Ant lines
using mature techniques: XSLT, Jakarta Ant, Makefile
Portable: just need JVM, bash shell or Jakarta Ant, and Saxon XSLT processor
Content and presentation separation through HTML templates and CSS
Work with poorly structured documents
Adapt to authors and their desktop tools
flexibility
- content organisation: tabular, paragraphs, classification, indexing
- presentation: XHTML template for site wrapper, CSS
- various data sources: Web, word processor, spreadsheet, relational, XML

Static pages vs dynamic pages

Static pages

faster HTTP response and computer resources used
easier to deploy, no need of web application servlet container (Tomcat etc)
easy indexation by classical search engines (google ...)

Dynamic pages

allows instant user customization : page layout, etc
more coding needed, but freeware available
less use of disk space
allows complex queries through relational or XML databases
allows instant update of catalog data
XSLT formatting from XML documents generated bt XMLPublish can be reused in dynamic servers (Cocoon)
anyway the "fill cart" functionality needs a dynamic server

Information object model

document
item
rubric
keyword
hierarchy (for HTML navigation)


If your browser is an SVG browser (like Amaya), you see below an UML diagram of XMLPublish Information object model: Otherwise, use these links: XMLPublish object model as SVG (if you have the Adobe SVG plugin) XMLPublish object model as GIF

XML basic structure

Simple 2-level structure reflecting the above "Information object model" :

<root>
 <itemType1>
  <rubricType1>any markup ...</rubricType1>
  <rubricType2>any markup ...</rubricType2>
  ... etc
 </itemType1>
 ... other item types
</root>

Historically in XMLPublication there were 2 concrete structures for the master.xml file:

table with rows,
documents with paragraphs and sub-paragraphs,

but now there is just one: documents with paragraphs and sub-paragraphs. Here is a sample:

<div class="h1" >
 <h1>Item name</h1>
 <div class="h2">
  <h2>Rubric name</h2>
   Rubric content
 </div>
 ...
</div>

Currently we use <p> tags for wrapping items and rubrics, but will switch to <div> in a next version, because <p> as a wrapper is not valid with respect to the HTML DTD. This markup should pass intact through most of the HTML tools (testers wellcome !).

Note 1: ideally the tidy operation for this feature should be idempotent (an algebraic operation x is said idempotent if x is such that xx=x).

Note 2: The ISO/IEC 15445:2000 norm about HyperText Markup Language (HTML) also has a spec. for wrapping paragraphs, but is is through div1, div2 tags that are not in the W3C standard. See http://www.cs.tcd.ie/15445/UG.html .

Presentation settings

keywords & stopwords file
site specific XHTML wrapper
document-specific header

XSLT customization

framework calls user callbacks for file names, labels, etc
easy to add XSLT template rules to customize formatting of item and rubrics; this is analogous to polymorphism in object-oriented programming, you just define a template with the same mode matching a smaller subset of of items or rubrics

Implementation

Automatic update and chaining with GNU Jakarta Ant (or GNU make).

Moreover ant allows to be multi-platform regarding file-systems.

The build.xml files are designed so that data can be URL, not only files (alas Ant is not yet enough Web-oriented, but we work on that too).

Algorithms

Build skeleton thesaurus.xml

This is done in publication-magician.xslt . The starting point is the work/master.xml file which aggregates all document sources; it has the structure described above (XML basic structure).

// make a unique list of rubrics across document sources:
for each unique string in:
 $master.xml/p/p/h2
 store it in a new <rubric> element
// for each rubric make a list of occuring words
Loop on rubrics
 Concatenate words for this rubric from all items
 Tokenize to get each word wrapped in an XML element
 Sort and suppress doubles in words list
 Filter words list (upper/lower case, numbers, etc)
 Mark stopwords
 Write <rubric> with sub-element <keyword>

Build an index for each rubric

This is done in make-index-by-rubrics-impl.xslt . It has some parts similar to preceding stylesheet, and would need some refactoring . This is because when user chooses <rubric use-keywords="no" > , it does the same thing as preceding stylesheet, just keeping <rubric> element in a <xsl:variable> instead of writing it in publication.xml .

Loop on rubrics
 Fill HTML document for this rubric
  Loop on words list
   Loop on items
    If word is found in this rubric of this item
     add hyperlink to HTML item file
Fill HTML document for the overall index.html
 Loop on rubrics
   add hyperlink to HTML document for this rubric

Modularity

Each functionality in a separate XSLT transform:

semantic markup
creating word lists by rubric
creating hyperlinked indices and (multi-level) TOC
XHTML formatting

The chaining of transforms (data-flow) is not hard-wired in the XSLT transforms, the Ant build.xmltakes care of that.

Future

validation and assisted editing for correction of source documents (Schematron + XED or Emacs)
navigation through hierarchy of categories (commercial catalogs)
integration of an XML search engine (Exist)
dynamic stylesheets and multiple presentations: table, paragraphs, tree
WAR packaging for J2EE Web servers for dynamic server
Web interface and Web services authoring and publication
connectors for several document types: DocBook, OpenOffice, TEI, spreadsheets, relational databases, ...
authoring GUI tools for semantic markup, option choosing, and publish

Detailed list of features and tasks

Back to main page