While I do the new XML markup, I use WordNet to classify noun categories that are used as XML tags. I'm surprised by a few results, like this one:
1 sense of perianth
Sense 1
perianth, floral envelope
=> covering, natural covering, cover
=> natural object
=> object, physical object
=> entity
I would have expected perianth to be an hyponym of "plant organ", like petal and most of them. So want to ask this to the WordNet team: why not "multiple inheritance" in WordNet ? Perianth would be BOTH a covering and a plant organ. Bract, perianth, indusium have exactly the same hypernym chain. I will use the "natural object" category for them.
There is another category including corona, pith, stele, lobe, vein that are considered by WordNet as "body part", distinct from "plant organ". I'm not competent to discuss this nuance, but agaoin I suggest using multiple inheritance.
Pollen is not even recorded as something alive:
Sense 1
pollen
=> powder
=> solid
=> substance, matter
=> object, physical object
=> entity
It is enough for this release of FloraParse. Next time I might categorize also words inside phrases (and not just at begining).
Now I have to put plant names into resulting XML, wich will be done by an XSLT transform.
La réunion du 13 au 15 (jeudi à vendredi)
TDWG working group: <http://www.tdwg.org>
Structure of Descriptive Data (SDD)
http://160.45.63.11/Projects/TDWG-SDD/index.html
est très importante pour WWBKB.
Il faudrait avoir idéalement pour demain soir (afin que
les étrangers puissent un peu voir une démo.
avant de partir pour Paris) les fonctionalités suivantes:
The current XML markup, e.g.:
<leaf> <t:f> alternate</t:f> <t:f>
simple</t:f> <t:f> without stipules</t:f> <t:f>
petiolate</t:f> </leaf>
doesn't keep track of the exact word form of the plant organ e.g. here "Leaves" ).
So I propose this instead:
<leaf> <t:n>Leaves</t:n> <t:f>
alternate</t:f> ... </leaf>
This will also make XSLT stylesheets easier to write.
Nouns being "hyponyms" (kind of) taxon must not be tagged as ordinary features. Same applies to "geographical area", or rather the more general "region". Hyponyms of "plant part" are the main nouns that have to become tags. Are there others ??? So I propose these new tags:
<t:t>Lycopodium digitatum</t:t>
<t:g>Mongolia</t:g>
Adjectives being also nouns, e.g. bisexual in "bisexual flower", must be treated as adjectives.
To achieve this I have to use WordNet's Synset
structure and findtheinfo_ds(word, NOUN, HYPERPTR, ALLSENSES)
function. I did a small C++ program to
explore WordNet Synset C structure. To see the same information
easily use
WordNet online . Here is a sample output of the
program under the gdb debugger. Here is the output of a more elaborate test program
for Wordnet showing how to navigate through the hypernyms hierarchy :
./a.out
explore "leaf" senses
defn=(the main organ of photosynthesis and transpiration in higher plants)
wcount=3
words[i]=leafwords[i]=leaf There is another category including There is another category including
words[i]=leafagewords[i]=leafagecorona, pith, stele, lobe, vein that are considered by WordNet as "body part", distinct from "plantThere is another category including words[i]=leaf There is another category including There is another category including words[i]=foliage
defn=(a sheet of any written or printed material (especially in a manuscript or book)) wcount=2 words[i]=leaf words[i]=folio defn=(hinged or detachable flat section (as of a table or door)) wcount=1 words[i]=leaf
I prepare the list of unknown words in Wordnet.
grep '<wn' flora_text.xml | sort | uniq | \
sed -e 's/^.*<wn missing="//; s/".*$//' >
wordnet-unknown-words.txt
Then I re-verified the word list with WordNet through this Perl script.:
# test-wn-word-absent.pl
$file = $ARGV[0];
open FILE, $file;
while ( chomp($word = <FILE>) ) {
$a = absent($word);
print "$a $word\n";
}
sub absent(){
my ($word) = @_;
$res = `wn $word | wc`;
if ( $res =~ m/^ *8 */ ) {
$not_present = 1;
} else {
$not_present = 0;
}
return $not_present;
}
Then I re-read all to get this list of 118 botanical words unknown to Wordnet . While doing this, I discovered many mispellings in the original Flora Of China data, and I did a sed command file to arrange all Flora Of China data.