So the QUESTION was: What tools (free software preferably) could we
use?
It seems that classical parsing with lex & yacc can be used as
a first step, since the grammar is simple.
* Open places along stream banks, at bases of rock walls or scree slopes; (1100--)1700--3500 m. Sichuan, Yunnan.
This example was taken from:
"Flora of China, Guidelines for Contributors. November 1995 " at http://flora.harvard.edu/china/
Other examples of botanical descriptions :
Eucalyptus
rudderi (http://www.environment.gov.au/life/species/flora/euc-descriptions/euc-rudderi.html)
from the Flora of Australia ; see also: www.conifers.org
.
Here
is a schema diagram generated by XML Authority of Extensibility
using the same XML data file above (see the corresponding schema
files). Of course we should process a representative sample of species
to get have a complete set of organs and characters. Here is a Web page
displaying this botanical
description with several stylesheets.
These definitions are taken from the Flora of Australia GLOSSARY , Compiled by A.McCusker. I have invented the hazelnut entry to show synonyms markup. We can parse this or another GLOSSARY to obtain such a formal vocabulary.
Then we will use this vocabulary to help parse a floristic text (see an example above). At the beginning of a nominal group (part beetwen 2 commas) one can find the principal noun this way. Let f be the first word, s the second.
if (type(f) == noun) { if (type(s) != noun) principal = f; else principal = f + "_" + s; } else principal = f + "_" + s;So we can treat expressions like "leaf blade" or "floral tube" in the example above.
Using the formal botanical vocabulary, we can also eliminate the non-prefered synonyms.
A species description in the flora of China is made of 4 parts (see below), the main of which is the botanical description part. Today (1999-12-29) only the description part is done. The key to sub-species will be done later. The 2 other parts can be isolated by HTML parsing. So the next step will be to use an XML parser (maybe IBM's XML4C) to separate the 4 parts. The abbreviations (e.g. "ca."=circa, &) are treated by lex.
1. A carriage-return terminated paragraph containing the species identification
(genus, species hepithet, and publication reference), e.g.:
Epilobium fangii C. J. Chen, Hoch, & Raven, Syst. Bot. Monogr. 34: 152. 1992.
2. The description, e.g.:
Herbs perennial, erect. Stems 15--40 cm tall, strigillose with scattered
glandular hairs,
3. An <html:table> with property-values pairs (mainly habitat and
ecological
informations), e.g.:
Habitat Sunny, grassy places or sparse, mixed forests, often disturbed
places,
long cultivated in S China but now only sporadically distributed in
the wild
Elevation 400-1100 m
Provinces Guangdong E Guangxi SW Hunan SE Yunnan
Countries ?Vietnam
Regions
Comment1
Comment2
Flora of China volume:page 4: 5
Or possibly just a line begining with "* ", e.g.:
* Open places along stream banks, at bases of rock walls or scree slopes;
(1100--)1700--3500 m. Sichuan, Yunnan.
4. Possibly a key to sub-species, e.g. (Pinus tabuliformis):
1a. Seed cones ovoid-globose, 2.5-5 cm; apophyses slightly swollen;
needles
slender, 7-12 cm × ca. 1 mm, pliant; 1st-year branchlets usually
glaucous
........ 9d. var. henryi
1b. Seed cones ovoid, 4-9 cm; apophyses obviously swollen; needles
stout, 6-15
cm × 1.2-1.5 mm, stiff; 1st-year branchlets not glaucous or glaucous
only when
very young.
2a. Trunk monopodial only toward base, branched in middle part; crown
flabellate ........ 9c. var. umbraculifera
...
12.5 Descriptions are to be in botanical English, which is mostly composed
of nouns, adjectives, and conjunctions. For the Flora of China, descriptive
botanical English does not contain verbs and has few articles (e.g. "the"
is not used and "a" is used only when necessary).
12.6 Descriptions will follow the conventional order (i.e., habit,
duration, sex, roots, stems, leaves, inflorescences, flowers, fruit, seeds).
Each major part of a description will be in a separate sentence with semicolons
used to separate subparts. At the beginning of each sentence and after
each semicolon there must be a noun, and all the description until the
end of the sentence or until a semicolon must refer back to that noun.
Commas are used to separate the various components within the sentence.
Note that a series with the use of "and" or "or" is treated as a single
component (see paragraphs 12.4 and 12.12).
12.7 If two alternate states of a structure exist, they will be separated
by the word "or," and when several alternate states exist, each state will
be separated by a comma with the final state preceded by a comma followed
by "or" (e.g., "petals white or pink" and "petals white, pink, or blue").
12.8 If a range of shapes is found in a structure the word "to" will
be used (e.g., "leaf blade oblong to ovate"). If a structure is meant to
be described as intermediate in shape rather than a range between two extremes,
a dash "" is used (e.g., "leaf blade lanceolate-ovate").
12.9 When characters are given in series, a comma will separate each
component of the series and before the final "and" (e.g., "branchlets,
petioles, and peduncles tomentose").
12.10 The general order that a structure should be described is as
follows: color, shape, dimensions, texture, surface characteristics, base,
margin, apex.
12.11 The following is the general order for describing specific structures:
Below ground parts: roots, underground stems
Stems: primary stems, trunks, bark, wood, branches, branchlets
Leaves: general arrangement, stipules, petiole, leaf blade, lobes,
compound leaf axes, leaflets (segments in ferns), modified leaflets
Inflorescences: general, position, type, branches (i.e., description
of axes), peduncle, bracts
Flowers: general features, pedicel, receptacle and hypanthium, calyx,
corolla, corona, androecium (flowering), glands or disk, gynoecium (flowering)
Fruit: general, aggregation of or division within fruit, fruit or mericarp
structure, accessory structures, multiple fruit structure
Seeds: external structures, germination, abortion, endosperm, megagametophytes,
embryo
12.14 Terms such as above, back, below, beneath, bottom, front, lower,
top, and upper should in general not be used because they are often ambiguous
(but see the following two paragraphs). Rather, the terms adaxial, abaxial,
apical, basal, proximal, or distal (or their adverbial forms adaxially,
abaxially, apically, basally, proximally, or distally) should be used.
12.15 For zygomorphic flowers the terms upper and lower are to be used
in describing the calyx and corolla lips (e.g., "upper lip" and "lower
lip").
12.16 The terms apical or basal or upper and lower rather than proximal
and distal should be used when describing structures on the main stem of
a plant because proximal and distal are meaningless in this context. However,
proximal and distal can be used when describing structures along a side
branch because in this context distal means farther from the stem and proximal
means closer to the stem.