What's new in the worldwide botanical knowledge project

J. M. Vanel - See also my My diary about computer science - Last update   - back to home page

2003-08-16

Properties inheritance for taxonomic and part-of hierarchies

To enhance the current search engine (http://jmvanel.free.fr/protea.html), we need to add some reasoning/inference capabilities:
  1. treat upper taxon properties as default value (e.g. in a species if genus has property "leaves evergreen", and species doesn't assert otherwise, current species  has same property
  2. treat properties of container organ as default value (e.g. if "petal color = red", flower has same property, at least partially
  3. combine two rules above
Certainly, rule 2 will not hold for all combinations of feature/property. To get an idea of which properties are in which features, I can run a modified version of my parser with WordNet.


Then I have several possibilities to implement the above reasoning/inference capabilities:
  1. write some more or less ad-hoc code in XQuery to implement the reasoning, possibily storing the inherited properties in extra files
  2. switch to a IA engine, such as Protégé (based on CLIPS and/or OWL), and write the logic of property inheritance in a language (e.g. OWL) dedicated to this
The advantages of choice 1. are:
The advantages/inconvenients of choice 2. are:
In the choice 2, probably the best choice  for the exchange format is OWL.

WordNet resources

I found this library :
JWNL (Java WordNet Library), written by John Didion [email], has been released. It is a Java API for accessing WordNet, and provides API-level access to WordNet data. It is pure Java (uses no native code).

There are also others (JNI) Java accesses to WordNet. We can use this to extract the feature/property ontology out of the XML database directly, without having to touch our FloraParse Lex/Yacc C++ parser.

This botanical ontology can then be used in enhanced user forms for the search engine, showing :
This botanical ontology can be easily connected to the general Word ontology, which can be obtained in RDF (theoretically OWL compatible):

A Resource Description Framework (RDF) representation of WordNet and ontology defining the terms used to represent the RDF version were developed by Sergey Melnik and Stefan Decker

2003-08-12

I met 2 the leaders of two research groups from the INRIA:
Nice demo of the french parser at http://graves.inria.fr:8200/perl/parser.pl, with an output graph made with Graphviz.


Since february there is a new version of Link Grammar, the syntactic parser of English, based on link grammar, an original theory of English syntax. Alas, it still doesn't seem to work on sentences without verb.

2003-08-05

No sponsors yet :-(( .

I found an interesting project to define an ontology for plants:
Plant Ontology Consortium : http://www.plantontology.org

It is defined in a little known language: DAG data structure, with an editor called DAG edit . I downloaded the CVS for the ontology and the software from sf.net.

I opened this file:
anatomy/anatomy.ontology
with the Java+Graphviz GUI DAG edit. The tool has a good look and feel. The search utility is allright. I couldn't find how to see in the GUI the human readable definitions that are in the file. The ontology seems rather complete; see an exemple graph: all classes whose name contains "leaf" . However there are too many classes, as one can see on the "leaf" example. Also general properties like color, size, shape  are not reused, so there is (in traits/traits.ontology) a petal color, a sepal color, etc. Obviously this ontology wasn't build with specimen indenfication in mind ...


I have commited the floraParser on Sourceforge; it is up-to-date with the latest WordNet 2.0 and gcc 3.3.2 . To download all this:
cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wwbota login
cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/wwbota co floraParse

The latest version of the search engine for the Flora of China is currently on the CVS of the eXist project, in directory webapp/xse . I have to update it to the XQuery language (currently XSP).

2003-03-06

More than 150 hits on the Flora of China search engine ! Including someone from the Flora of China project at Havard. For now, my personal 500 MHz computer can cope with the load, so I give the URL.

New page on Data prepraration (Getting XML from various sources, Natural language processing, Generate organ list).

2003-02-16

Now that we have a working application of the Flora of China, I declare open the hunting period for SPONSORS $$$ !!!!

If you want to test the Flora Of China search engine, please mail me; it is currently hosted on my personal 500 MHz computer and I don't want it to be overloaded . By the way we need hosting for the application.

If you want to run the application on your computer, you must download the project/ directory on the CVS at Sourceforge.net (see sourceforge.net/projects/wwbota) . Then you install it on top of the XML database eXist . Finally you download the data:
http://wwbota.free.fr/project/data/flora_text.xml.zip

It is not really complicated; however install documentation is needed, and a release zip file of course.

The application and the project was presented at the TDWG working group ( www.tdwg.org ) Structure of Descriptive Data (SDD) meeting in Paris : http://160.45.63.11/Projects/TDWG-SDD/index.html  .

We discussed about the design of an XML Schema for descriptive data able to replace the old DELTA format. It will include some ideas of WWBKB combined with the needs of professional taxonomists.

<Gregor_copy_from_here status="not finished">
J.M. Vanel tried to convince the audience to use MathML as constrain language for the  computed characters (e.g. "length of petiole is twice the length of leaf blade"). But this obviously needs some testing. A dialect of Prolog could also be used, or a subset of C/Java/JavaScript returning a boolean. One of the crucial points is how to name plain characters so that they can be refered to in a MathML expression for a computed character. The advantages of MathML is first being XML (it's easy to generate programming language code), and second its ability to express set theoretical assertions like "for all individuals,  leaves are either red or green", or "for all individuals, all leaves have the same color".

I also advocated in favour of reusing the XML Schema typing system to express as much as possible of the character semantics (a character is an element of description reusable for many descriptions, e.g. "length of petiole"). But this level of abstraction needs some pedagogy. At the extreme the "terminology" part of the current Schema, i.e. the character definitions, could be a plain XML Schema in the "http://www.w3.org/2001/XMLSchema" namespace, with in some places additional attributes in the "http://www.tdwg.org/2002/SDD namespace. This trick of adding attributes from another namespace in a Schema is perfectly legal; this is also the design of XLink for instance.

There was also an interesting discussion on "arranging characters in arrays", which I would rather translate as "complexType characters". Clearly, besides simpleType characters, there is a place for non-character independant variables, e.g. medium, temperature and time in the fungi example.
</Gregor_copy_from_here>

The following example, "growth diameter of fungal cultures on Petri-dishes" is shown below; the cultivation occurs on various media (OA = Oat-Agar, MA = Malt-Agar, SNA = Synth. Nutrient-Poor Medium), at different temperatures (15, 20, 25 °C) and over different time (7, 14, 21 days):

15°C: OA MA SNA
7d 8 mm 10 mm - mm
14d 18 mm 21 mm 6 mm
21d 22 mm 40 mm - mm
 


20°C: OA MA SNA
7d 21 mm 40 mm - mm
14d 39 mm 80 mm 38 mm
21d 60 mm - mm - mm

2003-02-02

I have have updated the XSLT transforms library and floraParse to put it on Sourceforge CVS.

2003-02-01

Now we have a search engine working on the Flora of China with eXist and Cocoon. We need to find a servlet or, even better, Cocoon hosting. This user interface is in the hands of Cyril.

Now I can go back to structure of data and metadata, and parsing. I have a look again at link-grammar, to see if it still needs a verb in the sentence or if there is a workaround. Well, after some trials on their online "parse a sentence" page, I saw that it's not enough; I try to ping them. On freshmeat.net I looked with the keyword "linguistic" and found this:

2003-01-15

Yes, the project is more alive (and necessary) that ever! As a side-kick the XMLPublication has been developped for another project. The XSLT transforms library has been expanded, taking in account the new possibilities of XML Schema. The eXist XML database is more efficient than ever, and the new XML:DB Java API is well established. There a new specification, more detailed but less demanding, that will be realized before the Taxonomic Database Working Group meeting in Paris on February 13-14th. We welcome Cyril Vidal, a talented young Java and XML developer, who will be in charge of displaying the query results using Cocoon. To ease the group work, a Sourceforge project, WWBKB, has been created.
As a parallel effort, work goes on about Flora text parsing and XML-izing. Here is a sample species with the new XML format. Thanks to WordNet, the nouns and adjectives are marked up. Also numbers with units (dimensions) and pure numbers will be marked up as such e.g. :
<branch>
<t:f> generally many</t:f>
 <t:dim> internodes 1--10 cm</t:dim>
</branch>
<fruit>
 <t:f> berries</t:f>
 <t:f> distinct</t:f>
 <t:num> 1 - 8 - 12 per flower</t:num>
 <t:f> or coalescent</t:f>
 <t:f> forming syncarps</t:f>
 <t:num> 1 per flower</t:num>
</fruit>
The words missing in Wordnet are also marked up:
<wn missing="="stipules"/>
and the list will be given to the WordNet project.

2001-08-09

Being a developer, I couldn't resist the temptation to make a side-kick, and so here is yet another Identification program and framework in Java: the Open Identification API . It is aimed mainly towards Delta-like data, so it is complementary to other tools (FloraParse) for textual data.

2001-07-13

During the last 6 months is was in charge of R&D and industrial catalogs at IndustrySuppliers.com, a market place for  industrial equipment that recently went into bankruptcy.  There I developed techniques to manage e-catalogs before publication on Internet. I had the opportunity to work on useful technologies : Perl, XSLT, Makefile, Cygnus Cygwin bash, WinCVS, Apache Jakarta Tomcat, etc.

During this time lots of things happened outside the WWBKB project : new versions of WordNet,link-grammar; XML protocols (Soap, Universal Description, Discovery, and Integration (UDDI) at http://www.uddi.org/ , XML Resource Directory Description Language (RDDL) at  http://www.openhealth.org/RDDL/ , WSDL ); new frameworks for knowledge representation: Topic Maps http://www.topicmaps.org/ , The DARPA Agent Markup Language (DAML) http://www.daml.org/ , the final version of W3C's XML Schema; new Natural Language software: GROK, Open NL API, etc.

Inside the WWBKB project, work continues on the Flora of China:  use of Wordnet 1.7, port to last gcc compiler, enhancement of XML markup to include BOTH semantic markup and original presentation, development of XSLT transforms to generating schema, and above all, use of eXist ( http://exist.sourceforge.net/  ), a nice freeware XML database with textual indexation.

2000-12-23

I modified the HTML client page (http://wwbota.free.fr/Generic/XMLClient.htm?URL=species.xml )  so that it has a minimal behavior with Mozilla and CSS2. But sometimes Mozilla M17 crashes... Also there is a bug in Netscape 6 preventing it to work...

At last!
The servlet for the the Flora of China (FOC) seems to be working allergist on my local machine, with requests such as this:
file:///windowsD/jmv/wwbota.free.fr/Generic/XMLClient.htm?URL=http://localhost/servlets/FOC?xpath=//td[.//stigma]

But for the whole subset of the FOC (10 000 species and 15 Mb ) such a query  lasts 6 mn 30 . The current implementation uses a combination of SAX and XSLT ( Saxon 5 from http://users.iclway.co.uk/mhkay/saxon/ ).
So I will just put in line a small subset, to try and perfect the site and its ergonomy.
However this Java Framework with SAX, DOM, and XSLT is very useful for batch processings such as: restructuring, statistics, generating lists of tags and Schema, adding informations from other sources (JDBC ...) ...

What remains to do for the FOC site:

Thanks to the Laboratoire Informatique et Systématique at University of Paris Jussieu, and to its head Régine Vignes for allowing me to use one of their servers.

I tried  GMD's ipsi XQL search engine for XML with contains() function: it doesn't work! The ipsi XQL search engine is unable to search for sub-strings in textual content.
 

2000-12-18

I began a new job at IndustrySuppliers.com, a market place for  industrial equipment, as responsible for R&D and industrial catalogs. Happily it has strong technical synergy with the WWBKB project.
 

Cultivated plants, plant uses, varieties

I got involved in an ambitious worldwide project about plant uses, and sharing knowledge, called Seed2seed , from the Ecoropa association in Paris ; more details coming soon.

2000-11-19

Design

I try to make a complete formalization (manually) of a few descriptions from the Flora of China. In parallelel I enhance the abstract data model with new abstractions, like Context/Restriction. This is certainly the best way to advance the design. Moreover it provides an acurate introduction to formal plant description for several kinds of most interesting people:

Site

2000-11-15

I come back from the TDWG 2000 meeting in Frankfurt (see http://www.bgbm.fu-berlin.de/TDWG/2000/). Lots of contacts, and a fruitful discussion on Structure of Data Set with R. Pankhurst, N. Lander, and G. Hagedorn.
Also fruitful discussions with Morris on XML protocols, and with J. Lerenard about Computer-assisted Plant Identification.
Among others, interesting presentations about geographical services (alexandria.ucsb.edu/gazeeter), and GEIN ( http://www.gein.de ), a german environment portal federating others sites through XML.

Yesterday I was in Paris at a conference about Internet and Ecology ( http://www.multimania.com/mgiran/ ), where I saw again a presentation about GEIN. Several Web masters where asking advices about how to use XML in  their sites.
 

2000-11-06

New distribution of FloraParse, a parser for classical Floras generating XML markup.
See Release Notes.
download FloraParse

2000-11-03

More than one year since start of project!
Many hopes created, and not much concrete yet !
However, even without money, even without technically competent collaborations, it advances.

The immediate goal is still to put on a server the Flora of China, with a relational database and requests like :

SELECT * FROM descriptions WHERE petal LIKE 'yellow'

An alternative implementation is to use SAX (sax.org, Simple API for SAX) to make the query.

In both cases, we need to generate colums names (= XML elements names) out of the existing XML. While doing this, an XML Schema will also be generated, and this XML Schema will be registered in XML repositories :

GUI
The Web page for queries will have a pull-down menu for plant organs (e.g. petals) and an input field for searched sub-string (e.g. yellow). Probably there will be 2 pull-down menues, one for plant organs and one for sub-organs. Then the server will respond with a page like the prototype Web page allready published, which allows further refinement queries on the local data, without going back to the server.
 

News from J.M. Vanel
I am trying to find a consultancy job around XML technologies, that will have a synergy with the WWBKB project. I will probably attend the TDWG 2000 meeting in Frankfurt (see http://www.bgbm.fu-berlin.de/TDWG/2000/).

New on this site
XML local ressources have been updated:

2000-07-14

Work with Xalan and Xerces to enhance the structure of the Flora of China XML database ( more details on the mailing list).

2000-06-27

At last I terminated the parsing of the 10 000 species of the Flora of China :
http://jmvanel.free.fr/pub/data/
See more details on the mailing list:  http://www.egroups.com/group/wwbota/
I hope that it can become a test case for the XML databases.
 

2000-06-11

A week-end of work with Bryan Thompson a my house on knowledge representation. Bryan is a specialist of the Shruti inference system (www.icsi.berkeley.edu/~shastri/shruti). Minutes of the meeting are coming .

2000-05-27

What I did this week:

2000-05-24

My (JMV) presentation at WWW9 in Amsterdam on May 19 was successful, and got compliments from Jon Bosak, one of the creators of XML. The crop of business cards is 15, I will tell you what comes out of these contacts.

THANK YOU to all who helped and encouraged in various ways:
Jacques Lolieux, Anthony Brach, Mary Clare HOGAN, Nick Fulton, family Smalbrugge, Guillaume Rousse, Thomas Beale, Henning MUELLER, Dominique Salmon, Olivier Tavignot, Franck Yvetot, Ivodor Atanassov, Pierre Deransart, Rita Lemaire-Smith, Don Kirkup, Nick Lander, Laurent Kiryenko, and I certainly forget several...
Thanks to Reuters for paying my expenses for the 4 days WWW9 Conference.

The FloraParse has been adapted to the Flora of China data, and thanks to Olivier Nouguier will be soon put on the Web using a standard database, probably PostgreSQL.
I run into small problems in exporting well-formed HTML out of MS Access files, because:

  1. the HTML (family descriptions) is not well-formed in the original file,
  2. a standard export or "save as HTML" in Access treats the HTML as ordinary text , and consequently transforms, e.g.,  <UL> in &lt;UL&gt;
I tried to use W3C's tidy program on the 18 Mb file, it ran all night without result! I work on integrating the Wordnet library in FloraParse.
Among other technologies, I look currently at :

2000-04-19

 I downloaded the Flora of China data, with special permission (from the Flora of China Project), in two big MDB (MS Access) files. It's an emotion to have access to the accumulated work of hundreds of botanists over centuries.

I installed the WordNet environnment on my machine.

First atempt to write an XML Schema for XML All-purpose Protocol, see XAP.xsd .

Since 2 or 3 weeks there is a first specification for an client-server application doing specimen identification; I feel that it can give a good idea of how the pieces of the puzzle fit together: data, metadata, protocol, GUI.

2000-04-10

Technique:
  • first version of of a XML Schema derived from the Botanical Glossary of the Flora of Australia; I used the HTMLGlossary2XMLSchema.xslt XSLT transform to generate it from the original file; this is still work in progress. With wordnet I'll be able to distinguish nouns, adjectives, relational adjectives, and among nouns those wich are organs, properties, etc. So the metadata described in the Abstract Data Model  will be implemented. Also a RDF Schema and an XML DTD will be generated from the XML Schema.
  • reading 5 papers about WordNet :

  • Descriptive Adjectives
    Descriptive adjectives are what one usually thinks of when adjectives are
    mentioned. A descriptive adjective is one that ascribes a value of an attribute to a noun.
    That is to say, x is Adj presupposes that there is an attribute A such that A(x)=Adj.To
    say The package is heavy presupposes that there is an attribute WEIGHT such that
    WEIGHT(package)=heavy. Similarly, low and high are values for the attribute HEIGHT.
    WordNet contains pointers between descriptive adjectives and the noun synsets that refer
    to the appropriate attributes.
    Wordnet also indicates Relational Adjectives, so it has all we need !

    Contacts

    2000-03-27

    Technique: Communication:

    2000-03-18

    Communication:

    2000-03-09

    Communication: Technique:

    2000-02-24

    2000-01-26

    Discovered the Flora of North America (www.fna.org) by searching with Altavista sites having the same keywords as this one. It's magnificent!

    The worldwide botanical database project has moved to http://wwbota.free.fr. My personal pages are now at http://jmvanel.free.fr .

    I subscribed to  x3d-contributors@web3d.org.

    2000-01-23

    Downloaded lots of things for 2D and 3D images:

    2000-01-22

    I provoked a discussion on xml-dev@ic.ac.uk about "XML for 3D geometry and objects".
    See archive: http://www.lists.ic.ac.uk/hypermail/xml-dev/
    Conclusion: we have to evaluate VML and SVG from the W3C, and X3D of the Web3D CONSORTIUM, the designated successor of the ISO Standard VRML97.
    We also need nice editors for botanical 3D shapes: leaves, flowers, fruits, twigs. Note that the 3D structure of a fruit, and its time evolution from a flower has much in common with CAD (Computer Aided Design) models in mechanical engineering with montage sequence.
    But the previleged source of 3D data remains of course real specimens, or herbarium specimens, which brings the necessity to have stereoscopic software.

    2000-01-20

    Having well attacked the parsing, I am now investigating AI and 3D representation subjects. I'm reading "Semantic networks" by Lokendra Shastri, which has an interesting knowledge model, and proven techniques to implement the searches. The model allows for probabilities, e.g. flower 70% rose, 30% white.

    2000-01-15

    Semantics

    I finally found a name for the concept of an adjective defining no property, but narrowing the subject, like "inferior" in "inferior leaves". It's called relationnal adjective, or pseudo-adjective.

    3D geometry

    I'm investigating syntaxes and formats for CAD (Computer Aided Design) data: IGES, STEP, etc. Apparently no XML thing exists in this field. Maybe use MathML vocabulary to express Bezier and NURBS algebra, together with a vocabulary borrowed from STEP, and of course botanical markup.

    I'm also investigating algorithms to obtain a CAD description from a 2D volumic mesh.
     

    XML

    My XML outfit is growing and includes:

    Contacts

    2000-01-05

    Lots of new items at http://jmvanel.free.fr/ , including a few Diospyros descriptions, actually parsed by FloraParse from the Flora of China, with an enhanced sample User Interface for queries.

    1999-12-29

    The Lex-Yacc-C++ parser for classical Floral is ready for release. Mail me to get the sources.
    At last I wrote a list of tasks. See if something is within your competences.

    1999-12-18

    The Lex-Yacc-C++ parser for classical Flora begins to breathe; it will be released for Christmas under the GNU Public Licence. I looked at the Flora Europeae.
    I'll ask Bosak for advice.

    Contacts

    1999-11-13

    A first version of a Lex-Yacc-C++ parser is almost ready; it will be applied on the floras of China and Australia.

    1999-10-19

    First prototype of a Web page displaying botanical descriptions with several stylesheets.  Sorry, it works only with Microsoft Internet Explorer 5, which is freely downloadable. Queries with XSL/XPath are in preparation, also possibility to use any browser, using an applet to transform the XML.

    Thinking about new pages:

    Contacts with a botanist at The Royal Botanic Gardens Kew, interested in the automated handling of descriptive botanical information.

    1999-10-10

    Contacts with specialists in image generation (L-systems), and image database.
    Posts presenting the project have been made in:
    comp.ai, comp.ai.nat-lang, comp.databases, comp.databases.object, comp.lang.xml
     

    1999-10-04

    In the present time we work more with the computer people than with the botanists, because the requirements appear clear enough, and because we must come up with some convincing prototype. The botanists will not believe to the feasibility of the project until they see something concrete.

    However contacts have been taken with the IOPI (International Organization for Plant Information ).

    I studied the feasibility of an automated translation of a standart flora into a XML file. The example was from the Flora of China at http://flora.harvard.edu/china/search/search.html ; the Guidelines for Contributors is a very interesting document showing that the Flora of China is a very well-defined document, but a parsing using standart tools like LEX and YAC is not applicable. Here is a hand-made parsing of an example species taken from the Guidelines for Contributors. It can be seen that the required tool must have a basic knowledge of nouns and articles, to be able to translate expressions like "leaf blade" and "floral tube". A post was made in comp.ai.nat-lang, to ask for the relevant techniques.

    A prototype Web page for a query was made.

    The overall architecture will be inspired by the book of S. Mohr (Building Distributed Applications with XML, LDAP, and IE5, at Wrox Press), who envisions servers on Internet, offering their knowledge via HTTP requests, and advertising their services via Metadata.