Short-term plan for a WWW botanical database
last update: Jan. 15, 2000
List of Tasks
Following is the List of Tasks, ordered from the most advanced to the least:
XML Schema, Query and GUI, Parsing, Webmaster, Server, Communication
and public relations, Images.
I (JMV) can go on currently with the first 3 tasks (XML Schema, Query
and GUI, Parsing), and the overall coordination, but help is needed on
the rest.
XML Schema
[study (see preliminary schemas) and
specification stage]
- try RDF sample
- write on description vs qualificator
- general XML vocabulary for biological descriptions (...)
- convergence Delta-descriptions
- write XSLT transforms between flat database, Delta etc
- links with other disciplins (zoology, ecology)
Query and GUI
[study, prototyping and specification
stage]
- choose a browser (IE or Mozzilla), for the short-term and long-term
- use of aplets for XSLT
- use of metadata (containment, generalization, general and specialized
dictionaries)
- use of logical computing (inference engines, etc) to enhance query
- use of natural language techniques to formalize further the descriptions
Parsing
[study, implementation (see
marked up files) and specification stage]
- Flora Europaea
- Flora of China
- Flora of Australia
- find out about other floras
- copyright problems
Webmaster
[existing site]
- architecture of the site
- collect contributions
- upload site content
- answer or forward mail about the site
- find suitable hosts
- manage mailing lists
- manage links
- manage referencing (see that lots of other sites reference ours)
- manage referencing by search engines
Server
[study and specification stage]
- validate choise of XML processor (XT, Xerces, or search engine technique/indexing,
or maybe database)
- multi-server
- implement download and query page
- find server machines and/or institutions to host databases
Communication and public relations
[exploratory stage]
- find sponsors
- find developpers and biologists
- make our project well-known among
- scientists in general
- political world
- economic world
- organizations defending nature
- international organizations
Images
[study and specification stage]
- 3D recognition from 2 or 3 pictures from different angles
- 3D representation (with Bezier mappings, VRML or whatever)
- pattern recognition, vectorization
- L-Systems
- static or growth simulation images
- how to display 3D on the browser
- collect images
Update: sept. 23, 1999
Data
We want to build complex databases for botanical data, aimed primarily
in computer-aided identification of sample plants.. Building up on the
existing plant names synomyns databases, the first data to include will
be :
-
description of the species, including pictures,
-
geographical distribution.
But we must specify an extensible software architecture, in order to support:
-
many types of data:
-
living collections, herbaria, Index Seminum
-
chromosomal studies
-
zoological: pollinating, disseminators, parasites, herbivores
-
biochemical
-
paleontological
-
ecological, phytosociological, pedological, climatic
-
agronomic, plant uses, ethnobotany
-
books and publications
-
lists of biologists and computer scientists
-
many types of processing:
-
traditional queries (SQL,OQL),
-
assisted identification of plant samples ,
-
correlation studies, e.g. beetwen molecules and taxonomy, plants and animals
repartition,
-
different access types:
-
local replicates,
-
multi-tier architectures and cooperative processing,
-
facilities for publishing, permissioning and validating data
Existing software
DELTA is a series of taxonomic
software, connected by a common data file format. It is available for download.
This is a very valuable software, but:
-
there are not enough databases (there are about 250 000 plant species!)
-
if we want to gather all plant species in a single framework (cf project
Species Plantarum of the IOPI), we need either:
-
some kind of collaboration between separate databases,
-
or a common set of characters
-
the characters are not explicitely associated to an organ (like leaf, flower,etc)
-
the data file format is "proprietary"; a standart like XML would help
Gathering plant descriptions
Maybe it would be easier as a first step towards this world flora to have
a few fields in natural language (english and/or latin), like this:
-
Genus
-
Species
-
Family
-
Flower
-
Leaf
-
Area
-
Other Information
The legacy of existing floras on paper could be put in this schema, but
this would need some treatement of natural language. Afterwards we
could:
-
correct the names with the help of existing synomyns databases (like theGlobal
Plant Checklist);
-
use a natural language processor to extract a formal description in terms
of a normalized set of characters (to be defined)
-
add images
-
connect herbaria databases
Genus and Family would be either a string, or a foreign key to another
table with fields: Flower, Leaf, Area, Other Information fields; the Species
inherits characters of the Genus, and can redefine them.
Specifying vocabularies
While gathering as much as possible of natural language plant descriptions,
we must define vocabularies. Whatever technology we choose for our database
(object, deductive, document-based, etc), we will need a common naming
for all our data. Here we have 2 possibilities:
-
use latin terms for botanical description. Why? Beside being a tradition
in botany (it is mandatory for new species publication), it will have the
advantage of not interfering with other meaning, especially for Web searches.
If you search for an occurence or a tag "leaf", you get a majority of non-botanical
data, whereas searching for a "folium" XML tag will bring only botanical
data, and searching for an occurence of "folium" will bring formal species
publications;
-
use english words.
Searching state-of-the-art sofware
A third parallel activity will be the search and evaluation of all kinds
of software and techniques.
Image processing
For assisted identification of plant samples, we need image
vectorization and pattern recognition.
The set of reference patterns (species in the database) can be very large,
but the search can use the available non-graphical information, like geographical
area. For leaves and their veins, a 2D treatment will be enough, but for
flowers and twigs architecture (phyllotaxy), we need 3D treatment with
stereoscopic analysis.
The use of plant growth and plant shape models can help both in pattern
recognition and in reduce of storage space.
Database, network
-
distributed OO Database,
-
how to conciliate groups working on families and genuses, and groups working
on geographical areas
-
replication
Processing of natural language
Like said before, a natural language processor could extract a formal description
from a natural language description. A formal description is easy to translate
to several national languages.
Expert systems
Help is needed on that matter...
We should check also also the knowledge representation (KR) conceps.
First data Model
Botany alone is a huge project (about 250 000 species of flowering
plants), and we want these botanical data to be easily usable by anyone
for any kind of use.
To this end, it is not wise to be tied to a particular database system,
or file format, or network protocol.
Even an object model in the sense of UML or ODL is hard to achieve,
because there are at least 2 different views:
-
a morphological view, where the accent is on organs (flower, leaf, etc)
-
a taxonomical view, where the accent is on relations beetwen species, and
the main type of object is the caracter; this is currently covered by the
DELTA series of software
So I think the best way to advance now is:
-
define XML vocabularies to be able to exchange data,
-
define ways of issuing queries (SQL/OQL or XML/XPath) using Internet protocols,
maybe in the spirit of Stephen Mohr's book "Building distributed applications
using XML, LDAP"
-
convince the institutions having natural language data and pictures (floras,
scientific journals) to let their data be imported in the database