What's new in the worldwide botanical knowledge project
J. M. Vanel
- See also my My
- back to home page
Properties inheritance for taxonomic and part-of hierarchies
To enhance the current search engine (http://jmvanel.free.fr/protea.html),
we need to add some reasoning/inference capabilities:
Certainly, rule 2 will not hold for all combinations of
feature/property. To get an idea of which properties are in which
features, I can run a modified version of my parser with WordNet.
- treat upper taxon properties as default value (e.g. in a species
if genus has property "leaves evergreen", and species doesn't assert
otherwise, current species has same property
- treat properties of container organ as default value (e.g. if
"petal color = red", flower has same property, at least partially
- combine two rules above
Then I have several possibilities to implement the above
The advantages of choice 1. are:
- write some more or less ad-hoc code in XQuery to implement the
reasoning, possibily storing the inherited properties in extra files
- switch to a IA engine, such as Protégé (based on
CLIPS and/or OWL), and write the logic of property inheritance in a
language (e.g. OWL) dedicated to this
The advantages/inconvenients of choice 2. are:
- the well-understood (at least for me JMV!) computational model
offered by eXist database and XQuery language
- the format is very near the original Natural Language (NL) plant
descriptions, and has a good level of semantic markup
In the choice 2, probably the best choice for the exchange format
- leverage on a regular IA engine enables to add easily more rules
and facts, e.g. about geographical ranges, plant uses, etc
- the performance is unknow
I found this library :
JWNL (Java WordNet
Library), written by John Didion [email],
has been released. It is a Java API for accessing WordNet, and provides
API-level access to WordNet data. It is pure Java (uses no native code).
There are also others (JNI) Java accesses to WordNet. We can use this
to extract the feature/property ontology out of the XML database
directly, without having to touch our FloraParse Lex/Yacc C++ parser.
This botanical ontology can then be used in enhanced user forms for the
search engine, showing :
This botanical ontology can be easily connected to the general Word
ontology, which can be obtained in RDF (theoretically OWL compatible):
- relevant features at each containment level
- relevant properties for each feature
A Resource Description
representation of WordNet and ontology defining the terms used to
represent the RDF version were developed by Sergey Melnik and Stefan
I met 2 the leaders of two research groups from the INRIA:
Nice demo of the french parser at http://graves.inria.fr:8200/perl/parser.pl,
with an output graph made with Graphviz.
Since february there is a new
version of Link Grammar, the syntactic parser of English,
based on link grammar, an original theory of English
syntax. Alas, it still doesn't seem to work on sentences without
No sponsors yet :-(( .
I found an interesting project to
define an ontology for plants:
Consortium : http://www.plantontology.org
It is defined in a little known language: DAG data structure, with an
editor called DAG edit
. I downloaded the CVS for the ontology and the software from sf.net.
I opened this file:
with the Java+Graphviz GUI DAG edit. The tool has
a good look and feel. The search utility is allright. I couldn't find
how to see in the GUI the human readable definitions that are in the
file. The ontology seems rather complete; see an exemple graph: all classes whose name contains
"leaf" . However there are too many classes, as one can see on the
"leaf" example. Also general properties like color, size, shape
are not reused, so there is (in traits/traits.ontology) a petal color,
a sepal color, etc. Obviously this ontology wasn't build with specimen
indenfication in mind ...
I have commited the floraParser
on Sourceforge; it is up-to-date with the latest WordNet 2.0 and gcc
3.3.2 . To download all this:
cvs -d:pserver:email@example.com:/cvsroot/wwbota login
cvs -z3 -d:pserver:firstname.lastname@example.org:/cvsroot/wwbota co floraParse
The latest version of the search engine for the Flora of China is
currently on the CVS
of the eXist project, in directory webapp/xse . I have to update it
to the XQuery language (currently XSP).
More than 150 hits on the Flora of China search engine !
someone from the Flora of China project at Havard. For now, my personal
500 MHz computer can cope with the load, so I give the URL.
New page on Data
prepraration (Getting XML from various sources, Natural language
processing, Generate organ list).
Now that we have a working
application of the Flora of China, I declare open the hunting
period for SPONSORS $$$ !!!!
If you want to test the Flora Of China search engine, please mail me;
it is currently hosted on my personal 500 MHz computer and I don't want
it to be overloaded . By the way we need hosting for the application.
If you want to run the application on your computer, you must download
the project/ directory on
the CVS at Sourceforge.net (see sourceforge.net/projects/wwbota)
. Then you install it on top of the XML database eXist
. Finally you download the data:
It is not really complicated; however install documentation is needed,
and a release zip file of course.
The application and the project
was presented at the TDWG working group ( www.tdwg.org
) Structure of Descriptive Data (SDD) meeting
in Paris : http://18.104.22.168/Projects/TDWG-SDD/index.html
We discussed about the design of an XML Schema for descriptive data
able to replace the old DELTA format. It will include some ideas of
WWBKB combined with the needs of professional taxonomists.
J.M. Vanel tried to convince the audience to use MathML as constrain language for
the computed characters (e.g. "length of petiole is twice the
length of leaf blade"). But this obviously needs some testing. A
a boolean. One of the crucial points is how to name plain characters so
that they can be refered to in a MathML expression for a computed
character. The advantages of MathML is first being XML (it's easy to
generate programming language code), and second its ability to express
set theoretical assertions like "for all individuals, leaves are
either red or green", or "for all individuals, all leaves have the same
I also advocated in favour of reusing the XML Schema typing system to
express as much as possible of the character semantics (a character is
an element of description reusable for many descriptions, e.g. "length
of petiole"). But this level of abstraction needs some pedagogy. At the
extreme the "terminology" part of the current Schema, i.e. the
definitions, could be a plain XML Schema in the "http://www.w3.org/2001/XMLSchema"
namespace, with in some places additional attributes in the "http://www.tdwg.org/2002/SDD
namespace. This trick of adding attributes from another namespace in a
Schema is perfectly legal; this is also the design of XLink for
There was also an interesting discussion on "arranging characters in
arrays", which I would rather translate as
"complexType characters". Clearly, besides simpleType characters,
there is a place for non-character independant variables,
e.g. medium, temperature and time in the fungi example.
The following example, "growth diameter of fungal cultures on
Petri-dishes" is shown below; the cultivation occurs on various media
(OA = Oat-Agar, MA = Malt-Agar, SNA = Synth. Nutrient-Poor Medium), at
different temperatures (15, 20, 25 °C) and over different time (7,
14, 21 days):
|| 8 mm
|| 10 mm
|| - mm
|| 18 mm
|| 21 mm
|| 6 mm
|| 22 mm
|| 40 mm
|| - mm
|| 21 mm
|| 40 mm
|| - mm
|| 39 mm
|| 80 mm
|| 38 mm
|| 60 mm
|| - mm
|| - mm
I have have updated the XSLT transforms
library and floraParse to put it on Sourceforge CVS.
Now we have a search engine working on the Flora of China with eXist
and Cocoon. We need to find a servlet or, even better, Cocoon hosting.
This user interface is in the hands of Cyril.
Now I can go back to structure of data and metadata, and parsing. I
have a look again at link-grammar,
to see if it still needs a verb in the sentence or if there is a
workaround. Well, after some trials on their online "parse a sentence"
page, I saw that it's not enough; I try to ping them. On freshmeat.net
looked with the keyword "linguistic" and found this:
Yes, the project is more alive (and necessary) that ever! As a
side-kick the XMLPublication has been developped for another project.
The XSLT transforms library has
been expanded, taking in account the new possibilities of XML Schema.
The eXist XML database is
more efficient than ever, and the new XML:DB Java API is well
established. There a new specification,
more detailed but less demanding, that will be realized before the
Taxonomic Database Working Group meeting in Paris on February 13-14th.
We welcome Cyril Vidal, a talented young
Java and XML developer, who will be in charge of displaying the query
results using Cocoon. To ease the group work, a Sourceforge
project, WWBKB, has been created.
As a parallel effort, work goes on about Flora text parsing and
XML-izing. Here is a sample
species with the new XML format. Thanks to WordNet, the nouns and
adjectives are marked up. Also numbers with units (dimensions) and pure
numbers will be marked up as such e.g. :
The words missing in Wordnet are also marked up:
<t:f> generally many</t:f>
<t:dim> internodes 1--10 cm</t:dim>
<t:num> 1 - 8 - 12 per flower</t:num>
<t:f> or coalescent</t:f>
<t:f> forming syncarps</t:f>
<t:num> 1 per flower</t:num>
and the list will be given to the WordNet project.
Being a developer, I couldn't resist the temptation to make a
side-kick, and so here is yet another Identification program and
framework in Java: the Open
Identification API . It is aimed mainly towards Delta-like data, so
it is complementary to other tools (FloraParse) for textual data.
During the last 6 months is was in charge of R&D and industrial
catalogs at IndustrySuppliers.com, a market place for industrial
equipment that recently went into bankruptcy. There I developed
techniques to manage e-catalogs before publication on Internet. I had
the opportunity to work on useful technologies : Perl, XSLT, Makefile,
Cygnus Cygwin bash, WinCVS, Apache Jakarta Tomcat, etc.
During this time lots of things happened outside the WWBKB project :
new versions of WordNet,link-grammar; XML
protocols (Soap, Universal Description, Discovery, and Integration
(UDDI) at http://www.uddi.org/ ,
Resource Directory Description Language (RDDL) at http://www.openhealth.org/RDDL/
, WSDL ); new frameworks for knowledge representation: Topic Maps http://www.topicmaps.org/ , The
DARPA Agent Markup Language (DAML) http://www.daml.org/
, the final version of W3C's XML
Schema; new Natural Language software: GROK, Open NL API, etc.
Inside the WWBKB project, work continues on the Flora of
China: use of Wordnet 1.7, port to last gcc compiler, enhancement
of XML markup to include BOTH semantic markup and original
development of XSLT transforms to generating schema, and above all, use
of eXist ( http://exist.sourceforge.net/
), a nice freeware XML database with textual indexation.
I modified the HTML client page (http://wwbota.free.fr/Generic/XMLClient.htm?URL=species.xml
) so that it has a minimal behavior with Mozilla and CSS2. But
sometimes Mozilla M17 crashes... Also there is a bug in Netscape 6
preventing it to work...
The servlet for the the Flora of China (FOC) seems to be working
allergist on my local machine, with requests such as this:
But for the whole subset of the FOC (10 000 species and 15 Mb ) such
a query lasts 6 mn 30 . The current implementation uses a
combination of SAX and XSLT ( Saxon 5 from
So I will just put in line a small subset, to try and perfect the site
and its ergonomy.
However this Java Framework with SAX, DOM, and XSLT is very useful for
batch processings such as: restructuring, statistics, generating lists
of tags and Schema, adding informations from other sources (JDBC ...)
What remains to do for the FOC site:
Thanks to the Laboratoire Informatique et Systématique at
University of Paris Jussieu, and to its head Régine Vignes for
allowing me to use one of their servers.
- - install JDK and JSERV on the server
- - install the FOC servlet on the server
- - verify that it works with IE5
- - add the plant names to the XML data, and change <tr>
- - generate a squeleton XML dataset, or another way of generating
queries towards rthe server, and connect the servlet with an apropriate
version of the above mentioned HTML client page
- - put the XML in small files on the server, for search engines
- - try ozone
- - try commercial XML databases: Tamino, Excelon
I tried GMD's ipsi XQL search engine for XML with contains()
function: it doesn't work! The ipsi XQL search engine is unable to
search for sub-strings in textual content.
I began a new job at IndustrySuppliers.com, a market place for
industrial equipment, as responsible for R&D and industrial
catalogs. Happily it has strong technical synergy with the WWBKB
Cultivated plants, plant uses, varieties
I got involved in an ambitious worldwide project about plant uses, and
sharing knowledge, called Seed2seed , from the Ecoropa association in
Paris ; more details coming soon.
I try to make a complete formalization (manually) of a few descriptions
from the Flora of China. In parallelel I enhance the abstract data
with new abstractions, like Context/Restriction. This is certainly the
best way to advance the design. Moreover it provides an acurate
introduction to formal plant description for several kinds of most
- experts in Natural Language (NL) analysis,
- experts in Artificial Intelligence (AI),
- experts in data-mining, OLAP, etc.
I come back from the TDWG 2000 meeting in Frankfurt (see http://www.bgbm.fu-berlin.de/TDWG/2000/).
Lots of contacts, and a fruitful discussion on Structure of Data Set
with R. Pankhurst, N. Lander, and G. Hagedorn.
Also fruitful discussions with Morris on XML protocols, and with J.
Lerenard about Computer-assisted Plant Identification.
Among others, interesting presentations about geographical services (alexandria.ucsb.edu/gazeeter),
and GEIN ( http://www.gein.de ), a
german environment portal federating others sites through XML.
Yesterday I was in Paris at a conference about Internet and Ecology
), where I saw again a presentation about GEIN. Several Web masters
where asking advices about how to use XML in their sites.
New distribution of FloraParse, a parser for classical Floras
generating XML markup.
See Release Notes.
More than one year since start of project!
Many hopes created, and not much concrete yet !
However, even without money, even without technically competent
collaborations, it advances.
The immediate goal is still to put on a server the Flora of
China, with a relational database and requests like :
SELECT * FROM descriptions WHERE petal LIKE 'yellow'
An alternative implementation is to use SAX (sax.org, Simple API for
SAX) to make the query.
In both cases, we need to generate colums names (= XML elements
names) out of the existing XML. While doing this, an XML Schema will
also be generated, and this XML Schema will be registered in XML
The Web page for queries will have a pull-down menu for plant organs
(e.g. petals) and an input field for searched sub-string (e.g. yellow).
Probably there will be 2 pull-down menues, one for plant organs and one
for sub-organs. Then the server will respond with a page like the prototype Web
page allready published, which allows further refinement queries on the
local data, without going back to the server.
News from J.M. Vanel
I am trying to find a consultancy job around XML technologies, that
will have a synergy with the WWBKB project. I will probably attend the
TDWG 2000 meeting in Frankfurt (see http://www.bgbm.fu-berlin.de/TDWG/2000/).
New on this site
XML local ressources have been updated:
Work with Xalan and Xerces to enhance the structure of the Flora of
China XML database ( more details on the mailing list).
At last I terminated the parsing of the 10 000 species of the Flora of
See more details on the mailing list: http://www.egroups.com/group/wwbota/
I hope that it can become a test case for the XML databases.
A week-end of work with Bryan Thompson a my house on knowledge
representation. Bryan is a specialist of the Shruti inference system (www.icsi.berkeley.edu/~shastri/shruti).
Minutes of the meeting are coming .
What I did this week:
- lots of mails to WWW9 Conference attendees that I met,
- exploratory work: readings about and downloading the link-grammar,
a natural language parser (thank you Philippe!) ,
- enhancements in the Web page displaying botanical
descriptions (WORK ONLY with the IE5 patch compliant with the XSLT
and XPath W3C Recommandations,
downloadable at Microsoft
- continue integrating the Wordnet library in FloraParse,
- a roadmap, going more in details
tasks, to foster cooperative design and implementation.
My (JMV) presentation at WWW9 in Amsterdam on May 19 was successful,
and got compliments from Jon Bosak, one of the creators of XML. The
of business cards is 15, I will tell you what comes out of these
THANK YOU to all who helped and encouraged in various ways:
Jacques Lolieux, Anthony Brach, Mary Clare HOGAN, Nick Fulton, family
Smalbrugge, Guillaume Rousse, Thomas Beale, Henning MUELLER, Dominique
Salmon, Olivier Tavignot, Franck Yvetot, Ivodor Atanassov, Pierre
Deransart, Rita Lemaire-Smith, Don Kirkup, Nick Lander, Laurent
Kiryenko, and I certainly forget several...
Thanks to Reuters for paying my
expenses for the 4 days WWW9 Conference.
The FloraParse has been adapted to the Flora of China data, and
thanks to Olivier Nouguier will be soon put on the Web using a standard
database, probably PostgreSQL.
I run into small problems in exporting well-formed HTML out of MS
Access files, because:
I tried to use W3C's tidy program on the 18 Mb file, it ran all night
without result! I work on integrating the Wordnet library in
- the HTML (family descriptions) is not well-formed in the
- a standard export or "save as HTML" in Access treats the HTML as
ordinary text , and consequently transforms, e.g., <UL> in
Among other technologies, I look currently at :
- link-grammar (Natural Language parser)
- IA engines:
I downloaded the Flora of China data, with special permission
(from the Flora of China
Project), in two big MDB (MS Access) files. It's an emotion to have
access to the accumulated work of hundreds of botanists over centuries.
I installed the WordNet
environnment on my machine.
First atempt to write an XML Schema for XML
All-purpose Protocol, see XAP.xsd .
Since 2 or 3 weeks there is a first specification for an
client-server application doing specimen
identification; I feel that it can give a good idea of how the
pieces of the puzzle fit together: data, metadata, protocol, GUI.
first version of of a XML Schema
derived from the Botanical Glossary of the Flora of Australia; I used
XSLT transform to generate it from the original file; this is still
in progress. With wordnet I'll be able to distinguish nouns,
relational adjectives, and among nouns those wich are organs,
properties, etc. So the metadata described in the Abstract Data Model will be
implemented. Also a RDF Schema and an XML DTD will be generated from
reading 5 papers about WordNet :
Wordnet also indicates Relational
Adjectives, so it has all we need !
Descriptive adjectives are what one usually thinks of when adjectives
mentioned. A descriptive adjective is one that ascribes a value of an
attribute to a noun.
That is to say, x is Adj presupposes that there is an attribute A such
say The package is heavy presupposes that there is an attribute WEIGHT
WEIGHT(package)=heavy. Similarly, low and high are values for the
WordNet contains pointers between descriptive adjectives and the noun
synsets that refer
to the appropriate attributes.
- Paolo Nesi, an specialist of image processing, who is currently
leading a European project about delivering musical scores an Internet,
a project having similarities with ours.
- I (JMV) started a thread on xml-dev about "XML All-purpose Protocol", proposed as
synthesis beetwen SOAP and Corba;
- Soon another thread on xml-dev about natural vocabulary and
linguistic approach to XML, and turning Wordnet into a big XML Schema,
or maybe several, one with the part-of relation enforced and one
- Related to this, work will start on botanical glossary; most of
the definitions of nouns start with an article (for the rest, maybe use
- some following for the 3D geometry with MathML proposal on
"email@example.com" , etc.
- The XML Cover Pages, SGML and XML News, By: Robin
Cover have writen about our project on 2000-03-16 : http://www.oasis-open.org/cover/xml.html
( the authoritative overview of the XML technology, new entries
- the discusion mailing list actually started today on egroups
- J.M. Vanel will talk at WWW9 in Amsterdam at Developper's Day on
- Brave Georg has writen about WWBOTA in his mountly Free Software
chronicle: Brave GNU World <firstname.lastname@example.org> the monthly GNU forum
in English, German, French, Spanish and Japanese. Check it out at http://www.gnu.org/brave-gnu-world/
- a discusion mailing list will be started today on egroups
- reflexions and first trial to find sponsors
- Abstract Data Model for Taxonomy
- FloraParse sent to D. Kirkup, Kew Gardens
- evaluation of VRML/X3D as language for 3D geometry: not
- Studying XML-Schema, testing RDF software (DATAX), writing XSLT
transforms, readings about Cognitive Science and Semantic Networks, etc.
Discovered the Flora of North America (www.fna.org)
by searching with Altavista sites having the same keywords as this one.
The worldwide botanical database project has moved to
http://wwbota.free.fr. My personal pages are now at http://jmvanel.free.fr .
I subscribed to email@example.com.
Downloaded lots of things for 2D and 3D images:
- fds for SVG from Univ. of Tsukuba, a vectorizer for bitmap
(I'll test it on Fairchild Tropical Garden http://www.ftg.fiu.edu/ herbarium
- Geometra, a stereoscopic software generating VRML files; I have
no images right now to test it; there is another free download called
Camora :-) or something.
I provoked a discussion on firstname.lastname@example.org
about "XML for 3D geometry and objects".
See archive: http://www.lists.ic.ac.uk/hypermail/xml-dev/
Conclusion: we have to evaluate VML
and SVG from the W3C, and X3D of the Web3D
CONSORTIUM, the designated successor of the ISO Standard VRML97.
We also need nice editors for botanical 3D shapes: leaves, flowers,
fruits, twigs. Note that the 3D structure of a fruit, and its time
evolution from a flower has much in common with CAD (Computer Aided
Design) models in mechanical engineering with montage sequence.
But the previleged source of 3D data remains of course real specimens,
or herbarium specimens, which brings the necessity to have stereoscopic
Having well attacked the parsing, I am now investigating AI and 3D
representation subjects. I'm reading "Semantic networks" by Lokendra
Shastri, which has an interesting knowledge model, and proven
to implement the searches. The model allows for probabilities, e.g.
flower 70% rose, 30% white.
I finally found a name for the concept of an adjective defining no
property, but narrowing the subject, like "inferior" in "inferior
leaves". It's called relationnal adjective, or pseudo-adjective.
I'm investigating syntaxes and formats for CAD (Computer Aided Design)
data: IGES, STEP, etc. Apparently no XML thing exists in this field.
Maybe use MathML vocabulary to express Bezier and NURBS algebra,
together with a vocabulary borrowed from STEP, and of course botanical
I'm also investigating algorithms to obtain a CAD description from a
2D volumic mesh.
My XML outfit is growing and includes:
- browsers: IE5, Mozilla M11 (soon M12 will display XML)
- editors: XED, XML Notepad
- transform engines: J. Clark's XT, IBM's LotusXSL (it works
as an applet inside IE5 to enable a proper XSLT-compliant transform)
- RDF engines: DataX, SiRPAC
- the neighbouring INRIA at Rocquencourt (French Institut National
de Recherche en Informatique et Automatique)
- Brave Gnu World, the famous mounthly Free Software Foundation
Lots of new items at http://jmvanel.free.fr/ , including a few
Diospyros descriptions, actually parsed by FloraParse from the Flora of
China, with an enhanced sample User Interface for queries.
The Lex-Yacc-C++ parser for classical Floral is ready for release. Mail me to get the sources.
At last I wrote a list of tasks. See
if something is within your competences.
The Lex-Yacc-C++ parser for classical Flora begins to breathe; it will
be released for Christmas under the GNU Public Licence. I looked at the
I'll ask Bosak for advice.
- Free Software Foundation (www.gnu.org)
has sent an encouraging mail
- The Filters project is writing conversion routines from
proprietary formats to XML-based file formats ==>TODO: ask for
information about this project
- France's Museum National d'Histoire Naturelle, at the Arboretum
- Don Kirkup of Kew Gardens is working in the same direction
(markup of floristic texts)
- lots of contributions on the TDWG - Structure of Descriptive
<TDWG-SDD@USOBI.ORG> mailing list; an interesting discussion
since Mon, 22 Nov 1999 on the
Taxonomic Databases Working Group's new discussion list. There are 5-7
messages a day.
To subscribe: send an email to
LISTSERV@USOBI.ORG with the
message "SUBSCRIBE TDWG-SDD"
web archive (http://usobi.org/archives/tdwg-sdd.html)
I (JMV) got a password for the very interesting West Australian
Florabase database: http://www.calm.wa.gov.au/science/florabase.html
- lots of activity on the W3C's mailing lists:email@example.com and
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
> To unsubscribe, mailto:firstname.lastname@example.org the following message;
> unsubscribe xml-dev
> To subscribe to the digests, mailto:email@example.com the
> subscribe xml-dev-digest
- looked dictionary projects: very
promising in that they offer unambiguous semantic for common vocabulary:
- looked Xdelta (http://www.bath.ac.uk/~ccslrd/delta/index.html)
, a direct translation of Delta format in XML; comments: probably
better to declare character numbers as ID; characters could be turned
A first version of a Lex-Yacc-C++ parser is almost ready; it will be
applied on the floras of China and Australia.
First prototype of a Web page displaying botanical
descriptions with several stylesheets. Sorry, it works only
with Microsoft Internet Explorer 5, which is freely downloadable.
Queries with XSL/XPath are in preparation, also possibility to use any
browser, using an applet to transform the XML.
Thinking about new pages:
Contacts with a botanist at The Royal Botanic Gardens Kew, interested
in the automated handling of descriptive botanical information.
- botany for computer scientists
- computer science for botanists
Contacts with specialists in image generation (L-systems), and image
Posts presenting the project have been made in:
comp.ai, comp.ai.nat-lang, comp.databases, comp.databases.object,
In the present time we work more with the computer people than with the
botanists, because the requirements appear clear enough, and because we
must come up with some convincing prototype. The botanists will not
believe to the feasibility of the project until they see something
However contacts have been taken with the IOPI (International Organization for
Plant Information ).
I studied the feasibility of an automated translation of a standart
flora into a XML file. The example was from the Flora of China at http://flora.harvard.edu/china/search/search.html
; the Guidelines for Contributors is a very interesting document
that the Flora of China is a very well-defined document, but a parsing
using standart tools like LEX and YAC is not applicable. Here is a hand-made parsing of an example species
taken from the Guidelines for Contributors. It can be seen that the
required tool must have a basic knowledge of nouns and articles, to be
able to translate expressions like "leaf blade" and "floral tube". A
post was made in comp.ai.nat-lang,
to ask for the relevant techniques.
A prototype Web page for a query
The overall architecture will be inspired by the book of S. Mohr
(Building Distributed Applications with XML, LDAP, and IE5, at Wrox
Press), who envisions servers on Internet, offering their knowledge via
HTTP requests, and advertising their services via Metadata.
- XML as an exchange format;
- URI (Uniform Resource Indicators) will be established to point
- using these URI and the RDF (Resource Description Format)
syntax, any other document will be able, by refering to any plant
species, to add information to basic taxonomic data provided;
- several regional floras servers can cooperate for a single
request; the result XML files will be concatenated, either on the
browser, or on a front server;
- on the server side:
- possibly also XML with DOM (Document Object Model ) and XSL as
a query processor on the server side; this solution would need an
efficient multi-threaded DOM implementation, capable to manage 600
Mbytes (300 000 species x 2kbytes) and tens of requests simultineously;
- or an OODBMS;
- or an indexing technology similar to the internet search
- or a batch file-to-file sequential processor will able to
not-too-complex queries on 64 Mbytes computers;
- on the client side, a browser will be enough to manage regional
floras or several large families; XSL as a query processor and a few
stylesheets and scripts will provide a state of the art User Interface;
- of course, replicas on CD's (with only the most representative
images) or DVD are possible.