XML design
XML design principles
Copyright J.M. Vanel 2000 - under Open
Content licence
Data structures design
-
of course, the good principles apply, like normal forms of relational databases
-
attributes vs elements vs text content: My point of view is that there
are few reasons to use attributes in pure XML, exceptions are HTML/XHTML
and:
-
for ID, IDREF, and IDREFS types;
-
for information that is not normally displayed (otherwise you have to say
style='display:none' for each element)(some old browsers will not accept
element content without displaying it, whereas an attribute is never displayed)
-
for data that both cannot exist in several exemplars and has no substructure
(e.g. date)
Otherwise, this distinction inherited from SGML:
-
carries no clear semantics;
-
prevents any future extention of the schema (e.g. add sub-elements)
Attributes are here :
-
for historical reasons, because HTML and SGML have them;
-
when to use ID's ? see below.
-
equivalent of foreign keys in databases, or pointers or references in C++/Java
languages
-
XML Schemas, RDF Schemas, DTD's: a data provider should allways provide
a Schema of the data it offers
-
except for small and non-evolving projects, avoid use of DTD
-
use XML Schemas for database-like data
-
use RDF Schemas for knowledge-oriented data
-
linguistic approach: reuse human vocabulary for nouns and adjectives, but
inside well-defined namespaces; using dictionaries (e. g. wordnet) to disambiguate
meanings. A public-domain dictionary such as wordnet could be used to generate
a giant XML Schema where each noun and adjective are turned into
an element.
-
allways re-use existing vocabularies and semantics, e.g. HTML and Docbook
are well-established Schemas; why re-invent the wheel?
-
There are two sorts of tree structure, the type-of tree and the part-of
tree. For the type-of tree you have two solutions: use XML Schema or RDF
Schema constructs and underlying semantics.
-
naming scheme for XML namespaces: possibly re-use existing naming schemes
for Java
-
Namespaces: it is mandatory to define a namespace for your vocabulary/Schema
(s), whether you use a DTD or an XML Shema or an RDF Schema
-
very easy to do
-
make it possible to mix several vocabularies in a single instance document
-
acts a prefix identifying the domain to which a tag belongs
-
can be a way to do versioning for your vocabulary/Schema
-
avoid naive design, e.g.:
-
<item name="myName">content</item>
-
instead, just put:
-
<myName>content</myName>
-
Notes:
-
The naive design is like saying at lunch: "Give me this thing that is called
water".
-
Expert readers might say that XML Schema has just this kind of "naive design",
but XML Schema is an exception because it serves to express types for others
documents like the instead example above.
-
avoid mixed content
-
very frequent tags can be short, especially if the container has a more
readable name : example in HTML <table><tr>
-
use the containment semantic of XML to mark a context, or add some details
to a previous version of a schema
-
HTML example <span style="color:red;"><html:a href="#target">see
here<html:a></span>
-
this pattern is especially attractive when used with XSLT; see here about
the use of XSLT
-
design according to the XML parser/engine used:
-
SAX
-
DOM
-
XSLT
-
XSLT makes it very easy to data-mine a complex structured document, e.g.
an XHTML document where standard HTML text is mixed with XML data islands.
XSLT has a query language more powerful than SQL.
Application design
See XML inside.
See Extensible browsers.
-
compound documents similar to OLE or MIME documents are easily created
and analyzed in XML, thanks to XSLT and XML Schema
-
if you are familiar with Unified Modeling Language (UML), use the following
correspondance rules:
-
UML aggregation ==> element containment
-
UML generalization ==> rdf:subClassOf or <xs:type name="myDerivedType"
derivedBy="extension" source="myAncestor">
-
UML simple association ==> use an XML attribute of type ID for a pointer
inside a single document, or an RDF or Xlink statement for a pointer outside
the current document
-
using XML at the boundaries between systems allows very flexible design,
e.g.:
-
define a domain vocabulary for messages: queries (including actions), answers
to queries;
-
then this vocabulary can be used either with HTTP requests, or by function
calls such as: void message(String in,String out);
or rather: void message(DOMNode in, DOMNode out);
-
use XML for clipboard and drag'n drop data.
ANNEXES
ID or identity ?
This is a standart design issue for XML. Which criteria can we follow to
decide when to put an ID attribute in an element?
1st example: commercial orders
They have no identity of their own, because the same client can place
the same order twice the same day. So it is good design to put an ID
attribute in <order> elements. It is equivalent to attribute a number
to orders, but in an XML point of vue, it has the advantage to make the
element recheable from inside or outside the document.
2nd example: Plant species descriptions
A <species> element has 2 sub-elements:
<name> <genus>
that together refer to a unique species; this is a key. So an ID attribute
would be redundant. We can specify in XML Schema that several sub-elements
together form a key (in the database sense) for the containing element,
with the new <key> element.
The species example raises a new issue:
How can we "href" from an exterior URL into an instance of <species>?
==> answer: using Xpointer W3C standart (http://www.w3.org/TR/WD-xptr),
we can specify in a <html:a> element:
href="flora_of_UK.xml#xptr(species[genus='Viola'][name='riviniana'])"
==> this raises still a new issue: which browser implements Xpointer
(probably none), or which browser will soon implement it?
Extensible browsers
My vision is the following: a multi-everything browser will mainly be an
empty shell able to call the appropriate processors whenever it sees certain
XML namespaces and/or Processing Instructions.
It will enable multi-domain documents.
It will manage drag'n drop and clipboard with an XML data model.
It might include an editor with the same multi-domain capabilities.
Its responsability will of course also be to manage the display space
between processors (tiling, resize, ...).
One important responsability can also be to manage the mapping between
the data XML and the displayed XML (HTML or plain XML with CSS).
Generic display skills are also desirable:
- collapsable tree/graph views for the document tree, the inheritance
graph, the ID/IDREF graph
- extended search/query
So I expect a general and modular tool for manipulating data, of the
3 main kinds: document-oriented (HTML & word processor), structure-oriented
(database type) and knowledge-oriented (semantic network, RDF, etc)
The next killer-app ...
A role for Mozilla ?