2003-03-06

More than 150 hits on the Flora of China search engine ! Including someone from the Flora of China project at Havard.

2003-02-16

I made Query.java quite generic by adding a setXPathPrefix(String s) method ( maybe put this in Database class ). At least the XPath prefix is not hard-coded in the Java; the sprocess.xsp page calls the setXPathPrefix(String s) method.

TASKS TO DO:

I have to publish on sourceforge.net the Open Identification API , together with some UML diagrams.

Other "genericity" tasks:
The next version of the FOC dataset :
http://wwbota.free.fr/project/data/flora_text2.xml.zip
has to be put in the WWBKB search & identification engine. This entails modification in the species_classic.xslt styleshheet.

2003-02-11

While I do the new XML markup, I use WordNet to classify noun categories that are used as XML tags. I'm surprised by a few results, like this one:

1 sense of perianth                                                     
Sense 1
perianth, floral envelope
=> covering, natural covering, cover
=> natural object
=> object, physical object
=> entity

I would have expected perianth to be an hyponym of "plant organ", like petal and most of them. So want to ask this to the WordNet team: why not "multiple inheritance" in WordNet ? Perianth would be BOTH a covering and a plant organ. Bract, perianth, indusium have exactly the same hypernym chain. I will use the "natural object" category for them.

There is another category including corona, pith, stele, lobe, vein that are considered by WordNet as "body part", distinct from "plant organ". I'm not competent to discuss this nuance, but agaoin I suggest using multiple inheritance.

Pollen is not even recorded as something alive:

Sense 1
pollen
=> powder
=> solid
=> substance, matter
=> object, physical object
=> entity

It is enough for this release of FloraParse. Next time I might categorize also words inside phrases (and not just at begining).

Now I have to put plant names into resulting XML, wich will be done by an XSLT transform.

2003-02-10

La réunion du 13 au 15 (jeudi à vendredi)

 TDWG working group: <http://www.tdwg.org>
 Structure of Descriptive Data (SDD)

http://160.45.63.11/Projects/TDWG-SDD/index.html

est très importante pour WWBKB.

Il faudrait avoir idéalement pour demain soir (afin que les étrangers puissent un peu voir une démo. avant de partir pour Paris) les fonctionalités suivantes:

2003-02-09

New XML markup

The current XML markup, e.g.:

<leaf> <t:f> alternate</t:f> <t:f> simple</t:f> <t:f> without stipules</t:f> <t:f> petiolate</t:f> </leaf>

doesn't keep track of the exact word form of the plant organ e.g. here "Leaves" ).

So I propose this instead:

<leaf> <t:n>Leaves</t:n> <t:f> alternate</t:f> ... </leaf>

This will also make XSLT stylesheets easier to write.

Nouns being "hyponyms" (kind of) taxon must not be tagged as ordinary features. Same applies to "geographical area", or rather the more general "region". Hyponyms of "plant part" are the main nouns that have to become tags. Are there others ??? So I propose these new tags:

<t:t>Lycopodium digitatum</t:t>

<t:g>Mongolia</t:g>

Adjectives being also nouns, e.g. bisexual in "bisexual flower", must be treated as adjectives.

To achieve this I have to use WordNet's Synset structure and findtheinfo_ds(word, NOUN, HYPERPTR, ALLSENSES) function. I did a small C++ program to explore WordNet Synset C structure. To see the same information easily use WordNet online . Here is a sample output of the program under the gdb debugger. Here is the output of a more elaborate test program for Wordnet showing how to navigate through the hypernyms hierarchy :

./a.out

explore "leaf" senses

defn=(the main organ of photosynthesis and transpiration in higher plants)
wcount=3
words[i]=leafwords[i]=leaf There is another category including There is another category including
words[i]=leafagewords[i]=leafagecorona, pith, stele, lobe, vein that are considered by WordNet as "body part", distinct from "plantThere is another category including words[i]=leaf There is another category including There is another category including words[i]=foliage
defn=(a sheet of any written or printed material (especially in a manuscript or book)) wcount=2 words[i]=leaf words[i]=folio defn=(hinged or detachable flat section (as of a table or door)) wcount=1 words[i]=leaf


Unknown words in Wordnet

I prepare the list of unknown words in Wordnet.

grep '<wn' flora_text.xml | sort | uniq | \

sed -e 's/^.*<wn missing="//; s/".*$//' > wordnet-unknown-words.txt

Then I re-verified the word list with WordNet through this Perl script.:

# test-wn-word-absent.pl
$file = $ARGV[0];
open FILE, $file;
while ( chomp($word = <FILE>) ) {
$a = absent($word);
print "$a $word\n";
}
sub absent(){
my ($word) = @_;
$res = `wn $word | wc`;
if ( $res =~ m/^ *8 */ ) {
$not_present = 1;
} else {
$not_present = 0;
}
return $not_present;
}

Then I re-read all to get this list of 118 botanical words unknown to Wordnet . While doing this, I discovered many mispellings in the original Flora Of China data, and I did a sed command file to arrange all Flora Of China data.