Short-term plan for a WWW botanical database

last update: Jan. 15, 2000

List of Tasks

Following is the List of Tasks, ordered from the most advanced to the least:
XML Schema, Query and GUI, Parsing, Webmaster, Server, Communication and public relations, Images.

I (JMV) can go on currently with the first 3 tasks (XML Schema, Query and GUI, Parsing), and the overall coordination, but help is needed on the rest.

XML Schema

[study (see preliminary schemas) and specification stage]
- try RDF sample
- write on description vs qualificator
- general XML vocabulary for biological descriptions (...)
- convergence Delta-descriptions
- write XSLT transforms between flat database, Delta etc
- links with other disciplins (zoology, ecology)

Query and GUI

[study, prototyping and specification stage]
- choose a browser (IE or Mozzilla), for the short-term and long-term
- use of aplets for XSLT
- use of metadata (containment, generalization, general and specialized dictionaries)
- use of logical computing (inference engines, etc) to enhance query
- use of natural language techniques to formalize further the descriptions

Parsing

[study, implementation (see marked up files) and specification stage]
- Flora Europaea
- Flora of China
- Flora of Australia
- find out about other floras
- copyright problems

Webmaster

[existing site]
- architecture of the site
- collect contributions
- upload site content
- answer or forward mail about the site
- find suitable hosts
- manage mailing lists
- manage links
- manage referencing (see that lots of other sites reference ours)
- manage referencing by search engines

Server

[study and specification stage]
- validate choise of XML processor (XT, Xerces, or search engine technique/indexing, or maybe database)
- multi-server
- implement download and query page
- find server machines and/or institutions to host databases

Communication and public relations

[exploratory stage]
- find sponsors
- find developpers and biologists
- make our project well-known among
   - scientists in general
   - political world
   - economic world
   - organizations defending nature
   - international organizations

Images

[study and specification stage]
- 3D recognition from 2 or 3 pictures from different angles
- 3D representation (with Bezier mappings, VRML or whatever)
- pattern recognition, vectorization
- L-Systems
- static or growth simulation images
- how to display 3D on the browser
- collect images

Update: sept. 23, 1999

Data

We want to build complex databases for botanical data, aimed primarily in computer-aided identification of sample plants.. Building up on the existing plant names synomyns databases, the first data to include will be :

description of the species, including pictures,
geographical distribution.

But we must specify an extensible software architecture, in order to support:

many types of data:

living collections, herbaria, Index Seminum
chromosomal studies
zoological: pollinating, disseminators, parasites, herbivores
biochemical
paleontological
ecological, phytosociological, pedological, climatic
agronomic, plant uses, ethnobotany
books and publications
lists of biologists and computer scientists

many types of processing:

traditional queries (SQL,OQL),
assisted identification of plant samples ,
correlation studies, e.g. beetwen molecules and taxonomy, plants and animals repartition,

different access types:

local replicates,
multi-tier architectures and cooperative processing,
facilities for publishing, permissioning and validating data

Existing software

DELTA is a series of taxonomic software, connected by a common data file format. It is available for download. This is a very valuable software, but:

there are not enough databases (there are about 250 000 plant species!)
if we want to gather all plant species in a single framework (cf project Species Plantarum of the IOPI), we need either:

some kind of collaboration between separate databases,
or a common set of characters

the characters are not explicitely associated to an organ (like leaf, flower,etc)
the data file format is "proprietary"; a standart like XML would help

Gathering plant descriptions

Maybe it would be easier as a first step towards this world flora to have a few fields in natural language (english and/or latin), like this:

Genus
Species
Family
Flower
Leaf
Area
Other Information

The legacy of existing floras on paper could be put in this schema, but this would need some treatement of natural language. Afterwards we could:

correct the names with the help of existing synomyns databases (like theGlobal Plant Checklist);
use a natural language processor to extract a formal description in terms of a normalized set of characters (to be defined)
add images
connect herbaria databases

Genus and Family would be either a string, or a foreign key to another table with fields: Flower, Leaf, Area, Other Information fields; the Species inherits characters of the Genus, and can redefine them.

Specifying vocabularies

While gathering as much as possible of natural language plant descriptions, we must define vocabularies. Whatever technology we choose for our database (object, deductive, document-based, etc), we will need a common naming for all our data. Here we have 2 possibilities:

use latin terms for botanical description. Why? Beside being a tradition in botany (it is mandatory for new species publication), it will have the advantage of not interfering with other meaning, especially for Web searches. If you search for an occurence or a tag "leaf", you get a majority of non-botanical data, whereas searching for a "folium" XML tag will bring only botanical data, and searching for an occurence of "folium" will bring formal species publications;
use english words.

Searching state-of-the-art sofware

A third parallel activity will be the search and evaluation of all kinds of software and techniques.

Image processing

For assisted identification of plant samples, we need image vectorization and pattern recognition. The set of reference patterns (species in the database) can be very large, but the search can use the available non-graphical information, like geographical area. For leaves and their veins, a 2D treatment will be enough, but for flowers and twigs architecture (phyllotaxy), we need 3D treatment with stereoscopic analysis.
The use of plant growth and plant shape models can help both in pattern recognition and in reduce of storage space.

Database, network

distributed OO Database,
how to conciliate groups working on families and genuses, and groups working on geographical areas
replication

Processing of natural language

Like said before, a natural language processor could extract a formal description from a natural language description. A formal description is easy to translate to several national languages.

Expert systems

Help is needed on that matter...
We should check also also the knowledge representation (KR) conceps.

First data Model

Botany alone is a huge project (about 250 000 species of flowering plants), and we want these botanical data to be easily usable by anyone for any kind of use.
To this end, it is not wise to be tied to a particular database system, or file format, or network protocol.

Even an object model in the sense of UML or ODL is hard to achieve, because there are at least 2 different views:

a morphological view, where the accent is on organs (flower, leaf, etc)
a taxonomical view, where the accent is on relations beetwen species, and the main type of object is the caracter; this is currently covered by the DELTA series of software

So I think the best way to advance now is:

define XML vocabularies to be able to exchange data,
define ways of issuing queries (SQL/OQL or XML/XPath) using Internet protocols, maybe in the spirit of Stephen Mohr's book "Building distributed applications using XML, LDAP"
convince the institutions having natural language data and pictures (floras, scientific journals) to let their data be imported in the database