Machine understanding botanical descriptions

Gathering plant descriptions from existing floras:
The legacy of existing floras on paper could be put in a formal schema, but this would need some treatment of natural language. Even if we don't require a formal description in terms of a normalized set of characters, simple parsing is not enough (see following example). A basic recognition of nouns and adjectives is needed.

So the QUESTION was: What tools (free software preferably) could we use?
It seems that classical parsing with lex & yacc can be used as a first step, since the grammar is simple.

Example of botanical description

Epilobium fangii C. J. Chen, Hoch, & Raven, Syst. Bot. Monogr. 34: 152. 1992.
Herbs perennial, erect. Stems 15--40 cm tall, strigillose with scattered glandular hairs, with 2 lines of hairs decurrent from margins of petiole. Leaves opposite, apically alternate; petiole 1--4(--6) mm; leaf blade elliptic to elliptic-oblong, 1--4 X 0.5--1.5 cm, leathery, subglabrous except for strigillose veins and margins, base cuneate to broadly cuneate, margin obscurely serrulate with 5--18 teeth on each side, apex subobtuse to sometimes acute. Inflorescences erect; bracts ca. 1/2 as long as ovaries. Pedicel 3--7 mm. Flowers erect; floral tube 0.6--1.1 mm. Sepals oblonglanceolate, 4--5 X 0.8--1.2 mm, keeled. Petals pink to rose-purple, narrowly obcordate, 6--7.5 X 2.8--3.5 mm, apical notch 1--1.5 mm. Filaments of longer set 2.8--3.5 mm but those of shorter set 2.4--2.6 mm; anthers 0.7--1 mm. Ovary 1.5--3 cm, glandular. Style 3--4 mm, glabrous or sparsely hairy near base. Stigma capitate, 0.8--1.1 mm. Capsule 3--7 cm, 1--1.8 mm in diam. Seeds narrowly obovoid, 1.1--1.4 X 0.3--0.5 mm, finely papillose; coma 6--7 mm. Fl. May-Aug, fr. Jun-Oct. 2n = 36@.

* Open places along stream banks, at bases of rock walls or scree slopes; (1100--)1700--3500 m. Sichuan, Yunnan.

This example was taken from:
"Flora of China, Guidelines for Contributors. November 1995 " at http://flora.harvard.edu/china/

Other examples of botanical descriptions :
Eucalyptus rudderi (http://www.environment.gov.au/life/species/flora/euc-descriptions/euc-rudderi.html) from the Flora of Australia ; see also: www.conifers.org .

Formal description corresponding to the Example

This is an example of the kind of output wished from the botanical description above. It is in well-formed XML. If you use Microsoft Internet Explorer 5, or another XML-compliant browser, it will be displayed nicely.
species_example.xml

Here is a schema diagram generated by XML Authority of Extensibility using the same XML data file above (see the corresponding schema files). Of course we should process a representative sample of species to get have a complete set of organs and characters. Here is a Web page displaying this  botanical description with several stylesheets.

Problems encountered when parsing botanical descriptions

I applied by hand a simple parsing, using delimitors , ; .

Solutions for parsing botanical descriptions

To prepare a botanical vocabulary in the XML/RDF sense, we urgently need a formal vocabulary in the common sense, something like:
lexicon_sample.xml

These definitions are taken from the Flora of Australia GLOSSARY , Compiled by A.McCusker. I have invented the hazelnut entry to show synonyms markup. We can parse this or another GLOSSARY to obtain such a formal vocabulary.

Then we will use this vocabulary to help parse a floristic text (see an example above). At the beginning of a nominal group (part beetwen 2 commas) one can find the principal noun this way. Let f be the first word, s the second.

if (type(f) == noun) {
  if (type(s) != noun)
    principal = f;
  else
    principal = f + "_" + s;
}
else
  principal = f + "_" + s;
So we can treat expressions like "leaf blade" or "floral tube" in the example above.

Using the formal botanical vocabulary, we can also eliminate the non-prefered synonyms.

Various  parsing issues

Parser for floras

Implementation

This is the tool to translate China's and Australia's Flora into XML. It is writen in C++ with LEX and YACC. This software is under Gnu Public Licence (see www.gnu.org). Mail me to get the sources.

A species description in the flora of China is made of 4 parts (see below), the main of which is the botanical description part. Today (1999-12-29) only the description part is done. The key to sub-species will be done later. The 2 other parts can be isolated by HTML parsing. So the next step will be to use an XML parser (maybe IBM's XML4C) to separate the 4 parts. The abbreviations (e.g. "ca."=circa, &) are treated by lex.

Issues

The composed element tags (e.g. "floral tube" or "upper leaves") are not treated consistently; I'm not shure how to do it (but see above). The names like "Petals" are left in the plural when they become XML element names.
Structure of the Flora
A species paragraph in the flora of China is made of 4 parts:

1. A carriage-return terminated paragraph containing the species identification
(genus, species hepithet, and publication reference), e.g.:

Epilobium fangii C. J. Chen, Hoch, & Raven, Syst. Bot. Monogr. 34: 152. 1992.

2. The description, e.g.:

Herbs perennial, erect. Stems 15--40 cm tall, strigillose with scattered
glandular hairs,

3. An <html:table> with property-values pairs (mainly habitat and ecological
informations), e.g.:

Habitat Sunny, grassy places or sparse, mixed forests, often disturbed places,
long cultivated in S China but now only sporadically distributed in the wild
Elevation 400-1100 m
Provinces Guangdong E Guangxi SW Hunan SE Yunnan
Countries ?Vietnam
Regions
Comment1
Comment2
Flora of China volume:page 4: 5

Or possibly just a line begining with "* ", e.g.:

* Open places along stream banks, at bases of rock walls or scree slopes;
(1100--)1700--3500 m. Sichuan, Yunnan.

4. Possibly a key to sub-species, e.g. (Pinus tabuliformis):

1a. Seed cones ovoid-globose, 2.5-5 cm; apophyses slightly swollen; needles
slender, 7-12 cm × ca. 1 mm, pliant; 1st-year branchlets usually glaucous
........ 9d. var. henryi
1b. Seed cones ovoid, 4-9 cm; apophyses obviously swollen; needles stout, 6-15
cm × 1.2-1.5 mm, stiff; 1st-year branchlets not glaucous or glaucous only when
very young.
2a. Trunk monopodial only toward base, branched in middle part; crown
flabellate ........ 9c. var. umbraculifera
...

Rules followed by human writers

Following is a part of the rules followed by human writers, taken from:
"Flora of China Guidelines for Contributors" downloadable from http://flora.harvard.edu/china/

12.5 Descriptions are to be in botanical English, which is mostly composed of nouns, adjectives, and conjunctions. For the Flora of China, descriptive botanical English does not contain verbs and has few articles (e.g. "the" is not used and "a" is used only when necessary).
12.6 Descriptions will follow the conventional order (i.e., habit, duration, sex, roots, stems, leaves, inflorescences, flowers, fruit, seeds). Each major part of a description will be in a separate sentence with semicolons used to separate subparts. At the beginning of each sentence and after each semicolon there must be a noun, and all the description until the end of the sentence or until a semicolon must refer back to that noun. Commas are used to separate the various components within the sentence. Note that a series with the use of "and" or "or" is treated as a single component (see paragraphs 12.4 and 12.12).
12.7 If two alternate states of a structure exist, they will be separated by the word "or," and when several alternate states exist, each state will be separated by a comma with the final state preceded by a comma followed by "or" (e.g., "petals white or pink" and "petals white, pink, or blue").
12.8 If a range of shapes is found in a structure the word "to" will be used (e.g., "leaf blade oblong to ovate"). If a structure is meant to be described as intermediate in shape rather than a range between two extremes, a dash "" is used (e.g., "leaf blade lanceolate-ovate").
12.9 When characters are given in series, a comma will separate each component of the series and before the final "and" (e.g., "branchlets, petioles, and peduncles tomentose").
12.10 The general order that a structure should be described is as follows: color, shape, dimensions, texture, surface characteristics, base, margin, apex.
12.11 The following is the general order for describing specific structures:
Below ground parts: roots, underground stems
Stems: primary stems, trunks, bark, wood, branches, branchlets
Leaves: general arrangement, stipules, petiole, leaf blade, lobes, compound leaf axes, leaflets (segments in ferns), modified leaflets
Inflorescences: general, position, type, branches (i.e., description of axes), peduncle, bracts
Flowers: general features, pedicel, receptacle and hypanthium, calyx, corolla, corona, androecium (flowering), glands or disk, gynoecium (flowering)
Fruit: general, aggregation of or division within fruit, fruit or mericarp structure, accessory structures, multiple fruit structure
Seeds: external structures, germination, abortion, endosperm, megagametophytes, embryo
12.14 Terms such as above, back, below, beneath, bottom, front, lower, top, and upper should in general not be used because they are often ambiguous (but see the following two paragraphs). Rather, the terms adaxial, abaxial, apical, basal, proximal, or distal (or their adverbial forms adaxially, abaxially, apically, basally, proximally, or distally) should be used.
12.15 For zygomorphic flowers the terms upper and lower are to be used in describing the calyx and corolla lips (e.g., "upper lip" and "lower lip").
12.16 The terms apical or basal or upper and lower rather than proximal and distal should be used when describing structures on the main stem of a plant because proximal and distal are meaningless in this context. However, proximal and distal can be used when describing structures along a side branch because in this context distal means farther from the stem and proximal means closer to the stem.