XML Publication - Enhancement, requirements, future

Last update: jeu 09 mai 2002 15:55:36 CEST

erase corrupt or empty intermediary files (currently it is more safe to do ant clean when something is wrong)
for keyword search take in account plural forms for a word (problably use Lucene, a high-performance, full-featured text search engine written entirely in Java)
take in account multiple structured headers in tables, using <td colspan="..."> : document source
create hyperlinks in item pages:
- add a new rubric, with hyperlinks generated from scanning (possibly in another XMLPublication directory) a rubric for words contained in a set of rubrics in this item; inside <publication> tag in thesaurus.xml add this :
  <add-hyperlink-rubric label="Peoples' knowledge"><from><keywords><from-rubric label="English name"/><from-rubric label="Name"/></from><to><XMLPublication directory="../Farmers"/><keywords><from-rubric label="Varieties of plants"/></to>
- transform existing words in hyperlinks, scanning a rubric for words contained in another rubric; inside <rubric> tag in thesaurus.xml add this :
  <add-hyperlinks-in-item><keywords><from-rubric label="Name" />
for search-engine generated pages, highlight (e.g. in yellow) the words present in the query
transform of internal hyperlinks necessitated by the splitting into small HTML files; to do for table documents (done for documents with paragraphs and sub-paragraphs delimited par h1 and h2 titles)
add an ANT get task to retrieve the XSLT processor saxon.jar, so that XMLPub user doesn't have to install Saxon
<get src="http://saxon.sourceforge.net/saxon/???/saxon.jar" dest="saxon.jar" verbose="true" usetimestamp="true" ignoreerrors="yes" />
"expert system" to guess weather an (X)HTML file or URL is to be considered as a table or as paragraphs and sub-paragraphs delimited par h1 and h2 titles
make an example of Schematron validation (possibly use the Seed Savers example)
test a complete case from a Word file (check windows-1252 encoding conversion to iso-8859-1 or utf-8)
We need a way to convert windows-1252 encoding to a more standard encoding like utf-8 or iso-8859-1; use Amaya or Oracle Parser or Xerces parser
add a target "distrib" in top-level buil.xml , to replace the shell script
investigate Cocoon2 architecture to see a convergence
suppress all references to XMLPublish, replaced by XMLPublication
make a GUI for an XMLPublication project (possibly use Xybrix)
better document the "helpers" target (see "make words-list-by-rubric.xml" in current user_documentation.html)
design the integration of an inheritance hierarchy of item types (for e-commerce catalog integration); the hierarchy of item types, specified by a simple XML file, will be transformed into a hierarchy of hyperlinked Web pages
put on the site a sample of the existing transforms for e-commerce catalogs
tous les inputs pourront être un URL ou un fichier, donc celui qui a ant et Saxon installés pourra tout lancer sans rien installer ==) alas, implies modifications in ANT
inclure une servlet inspirée de celle de M. Kay (exemple GEDCOM)
emballer cette servlet dans une webapp J2EE (grâce à ant)
DTD et schema pour le fichier thesaurus.xml

For Seed2Seed:

integration of XPath query engine: eXist
- make indexation in the init() of the servlet or by hand ?
- first stage: simplest integration: just type a word and eXist looks in all rubrics
- 2nd stage: type a word and choose a rubric; eXist looks in that rubric
- 3rd stage: have a combo box to choose a word and choose a rubric; eXist looks in that rubric
  - XML Publication can be used to get a list of words present by rubric
  - the HTML page will also need to have access to declared keywords for each rubric
  - need some javascript or Java to make a combo box
test or develop the multi-document aspect
- reconcile several metadata (rubric naming different but equivalent in each source document); can be done later
- merge several source documents; problem is that some have table structure, and other have text-with-paragraph structure
take in account the new page layout for Seed2Seed multi-document : http://www.seed2seed.net/base/Allium_Cepa-base.html
il faut absolument traiter correctement les minuscules/majuscules dans les tables; c'est à dire que les entrées "Aphids" et "aphids", qui contiennent toutes deux : CHILIPEPPERS –, DERRIS –, GARLIC –, TOBACCO –, ne doivent pas apparaitre toutes deux dans la page index-Target_Organisms.html
prise en compte des expressions pour les index: exemple 'proof of concept' ; not only single words

GUI

Un formulaire Swing ou HTML avec :

URL de la source (X)HTML en précisant "extract tables" ou "extract paragraphs with titles" ou
URL de la source .doc en précisant ... idem ou
URL d'un répertoire où XMLPublication prendra tout ce qu'il peut : .htm , .html , .xsl , .doc , toujours en précisant ... idem
URL de thesaurus.xml : facultatif
URL d'une feuille XSLT qui sera incluse par <xsl:include> : facultatif
URL de presentation.html : avec une valeur par défaut : un URL avec une bannière de site "XMLPublication"
URL de book-title.html : avec une valeur par défaut : un document contenant le nom du document source
URL du noyau XMLPublication : avec une valeur par défaut : le répertoire http://wwbota.free.fr/xslt/
un bouton "generate skeleton thesaurus.xml" : créer un fichier thesaurus.xml comme point de départ pour être édité; il va spécifier qu'on indexe tous les mots de touts les champs
un bouton "generate statistics" qui parcourt le document source pour repérer les éléments, leur imbrication et leur nombre ==) utilise example2Schema.xslt
un bouton "generate words list by rubric", qui pourra être utilisé comme aide à l'édition de thesaurus.xml ==) utilise make-words-list.xslt

Enfin le + important:

un bouton "make Web site"
un bouton "Save" enregistre tout le formulaire avec un format XML simple
un bouton "Open" ouvre une session avec tout le formulaire dans un format XML simple

Ensuite l'exercice suivant sera de refaire la même chose sur un serveur Web, avec un rôle d'auteur qui publie sur un URL temporaire, sous-site qui sera ensuite validé et publié pour de bon par le webmestre.

Back to main page