XML design

XML design principles

Data structures design

of course, the good principles apply, like normal forms of relational databases
attributes vs elements vs text content: My point of view is that there are few reasons to use attributes in pure XML, exceptions are HTML/XHTML and:

for ID, IDREF, and IDREFS types;
for information that is not normally displayed (otherwise you have to say style='display:none' for each element)(some old browsers will not accept element content without displaying it, whereas an attribute is never displayed)
for data that both cannot exist in several exemplars and has no substructure (e.g. date)

carries no clear semantics;
prevents any future extention of the schema (e.g. add sub-elements)

for historical reasons, because HTML and SGML have them;

when to use ID's ? see below.

equivalent of foreign keys in databases, or pointers or references in C++/Java languages

XML Schemas, RDF Schemas, DTD's: a data provider should allways provide a Schema of the data it offers

except for small and non-evolving projects, avoid use of DTD
use XML Schemas for database-like data
use RDF Schemas for knowledge-oriented data

linguistic approach: reuse human vocabulary for nouns and adjectives, but inside well-defined namespaces; using dictionaries (e. g. wordnet) to disambiguate meanings. A public-domain dictionary such as wordnet could be used to generate a giant XML Schema where each noun and adjective are turned into an element.
allways re-use existing vocabularies and semantics, e.g. HTML and Docbook are well-established Schemas; why re-invent the wheel?
There are two sorts of tree structure, the type-of tree and the part-of tree. For the type-of tree you have two solutions: use XML Schema or RDF Schema constructs and underlying semantics.
naming scheme for XML namespaces: possibly re-use existing naming schemes for Java
Namespaces: it is mandatory to define a namespace for your vocabulary/Schema (s), whether you use a DTD or an XML Shema or an RDF Schema

very easy to do
make it possible to mix several vocabularies in a single instance document
acts a prefix identifying the domain to which a tag belongs
can be a way to do versioning for your vocabulary/Schema

avoid naive design, e.g.:

<item name="myName">content</item>
instead, just put:
<myName>content</myName>
Notes:

The naive design is like saying at lunch: "Give me this thing that is called water".
Expert readers might say that XML Schema has just this kind of "naive design", but XML Schema is an exception because it serves to express types for others documents like the instead example above.

avoid mixed content
very frequent tags can be short, especially if the container has a more readable name : example in HTML <table><tr>
use the containment semantic of XML to mark a context, or add some details to a previous version of a schema

HTML example <span style="color:red;"><html:a href="#target">see here<html:a></span>
this pattern is especially attractive when used with XSLT; see here about the use of XSLT

design according to the XML parser/engine used:

SAX
DOM
XSLT

XSLT makes it very easy to data-mine a complex structured document, e.g. an XHTML document where standard HTML text is mixed with XML data islands. XSLT has a query language more powerful than SQL.

Application design

See XML inside.
See Extensible browsers.

compound documents similar to OLE or MIME documents are easily created and analyzed in XML, thanks to XSLT and XML Schema
if you are familiar with Unified Modeling Language (UML), use the following correspondance rules:

UML aggregation ==> element containment
UML generalization ==> rdf:subClassOf or <xs:type name="myDerivedType" derivedBy="extension" source="myAncestor">
UML simple association ==> use an XML attribute of type ID for a pointer inside a single document, or an RDF or Xlink statement for a pointer outside the current document

using XML at the boundaries between systems allows very flexible design, e.g.:

define a domain vocabulary for messages: queries (including actions), answers to queries;
then this vocabulary can be used either with HTTP requests, or by function calls such as: void message(String in,String out); or rather: void message(DOMNode in, DOMNode out);

use XML for clipboard and drag'n drop data.

ANNEXES

ID or identity ?

This is a standart design issue for XML. Which criteria can we follow to decide when to put an ID attribute in an element?

1st example: commercial orders

They have no identity of their own, because the same client can place
the same order twice the same day. So it is good design to put an ID
attribute in <order> elements. It is equivalent to attribute a number to orders, but in an XML point of vue, it has the advantage to make the element recheable from inside or outside the document.

2nd example: Plant species descriptions

A <species> element has 2 sub-elements:
<name> <genus>
that together refer to a unique species; this is a key. So an ID attribute would be redundant. We can specify in XML Schema that several sub-elements together form a key (in the database sense) for the containing element, with the new <key> element.

The species example raises a new issue:

How can we "href" from an exterior URL into an instance of <species>?
==> answer: using Xpointer W3C standart (http://www.w3.org/TR/WD-xptr),
we can specify in a <html:a> element:
href="flora_of_UK.xml#xptr(species[genus='Viola'][name='riviniana'])"
==> this raises still a new issue: which browser implements Xpointer (probably none), or which browser will soon implement it?

Extensible browsers

My vision is the following: a multi-everything browser will mainly be an empty shell able to call the appropriate processors whenever it sees certain XML namespaces and/or Processing Instructions.
It will enable multi-domain documents.
It will manage drag'n drop and clipboard with an XML data model.
It might include an editor with the same multi-domain capabilities.
Its responsability will of course also be to manage the display space between processors (tiling, resize, ...).
One important responsability can also be to manage the mapping between the data XML and the displayed XML (HTML or plain XML with CSS).

Generic display skills are also desirable:
- collapsable tree/graph views for the document tree, the inheritance graph, the ID/IDREF graph
- extended search/query

So I expect a general and modular tool for manipulating data, of the 3 main kinds: document-oriented (HTML & word processor), structure-oriented (database type) and knowledge-oriented (semantic network, RDF, etc)

The next killer-app ...
A role for Mozilla ? ct a general and modular tool for manipulating data, of the 3 main kinds: document-oriented (HTML & word processor), structure-oriented (database type) and knowledge-oriented (semantic network, RDF, etc)

The next killer-app ...
A role for Mozilla ?