WWBKB data prepraration

Data preparation
Worldwide Botanical Knowledge Base project
J. M. Vanel - See also my My diary about computer science - Last update - back to home page

The data

The data provided by the Flora of China project were in 2 large MDB files : flora.mdb and general.mdb. Here are the SQL data description : general.sql , flora.sql . The "content" field in flora_text table in file flora.mdb contains the taxon descriptions. It has many other fields for each taxon, depending on the value of the category_id field. The category_id field in many different tables is described by the fields name and table in the category table in file general.mdb . Here is an XML version of the category table : category.xml .

Data in flora_text table in file flora.mdb :

XPath	Field name
/table/tr/td[1]	taxon_id	key for the taxon
/table/tr/td[2]	publication_id	key for a publication (article, book, ...)
/table/tr/td[3]	category_id	specifies the type of content, 12021 for botanical description
/table/tr/td[4]	content	the botanical description that gets XML-ized
	note

For a given taxon, not all the categories get populated. In fact, only the following categories are filled somewhere :
                <category_id>12004</category_id>
                <category_id>12006</category_id>
                <category_id>12007</category_id>
                <category_id>12008</category_id>
                <category_id>12011</category_id>
                <category_id>12021</category_id>
                <category_id>12023</category_id>
                <category_id>12024</category_id>
                <category_id>12025</category_id>
                <category_id>12027</category_id>
                <category_id>12041</category_id>
                <category_id>12043</category_id>
                <category_id>12051</category_id>
                <category_id>12053</category_id>
                <category_id>12055</category_id>
                <category_id>12057</category_id>
                <category_id>12061</category_id>
                <category_id>12081</category_id>
                <category_id>12091</category_id>
                <category_id>12093</category_id>
                <category_id>12094</category_id>

I got this by the shell command :

grep "<category_id>" flora_text2.xml | sort | uniq

Dowload the latest XML file : flora_text2.xml.zip .

Getting XML from various sources

I first used a small Java JDBC program, JDBC2XML.java , to get clean XHTML out of the Flora of China Access database.

Now I use

to get the database in CSV format:

mdb-export general.mdb category > category.csv

followed by the Perl module XML::CSV .

Here is an example of a script using Perl module XML::CSV:

# csv2xml.pl 
use XML::CSV; 
$csv_obj = XML::CSV->new(); 

# Configure  CSV parser : 
$default_obj_xs = Text::CSV_XS->new({
     'quote_char'  => '"',
     'escape_char' => '"',
     'sep_char'    => ',',
     'binary'      => TRUE
}); 
$csv_obj = XML::CSV->new({csv_xs => $default_obj_xs}); 

$basename = "flora_text2";
$csv_obj->parse_doc( $basename.".csv", {headings => 1});
$csv_obj->print_xml( $basename.".xml");

Natural language processing

FloraParse is our Natural Language parser. It is a parser for classical Floras generating XML markup. It integrates the Wordnet C library. WordNet is a semantic dictionary from Princeton University, accessible through Web interface, GUI interface, and a C library. FloraParse is writen in C++ with Lex and Yacc. FloraParse transforms Natural Language descriptions into an XML format where informations are marked as organe, sub-organ, geography, etc.
Sample of the current output of FloraParse - Here is a small XML sample with its stylesheet displaying it as colored text.

FloraParse records specific WordNet hypernyms: "plant organ", like petal and most plant organs, "natural object", and another category including corona, pith, stele, lobe, vein that are considered by WordNet as "body part" .
FloraParse could be generalized into a quite generic XMLizer for punctuation-delimited texts.

General information about the WWBKB Natural Language processing (see also WWBKB diary of 2003-02-11)
FloraParse Source code in on CVS at SourceForge.net : CVS web interface at sourceforge Link to the project page on SourceForge :

The latest version of the XML markup of the Flora of China is in this directory .

Generate organ list

I do this tag list processing apart from the XML database, it is easier. If I had several XML files, I would concatenate all files (or just some typical XML instances) in one file, put opening and closing tag around it . It could be done more elegantly with XML External Parsed Entities. Here is a way of doing the concatenation in Unix shell (use Cygwin if on Windows : http://sources.redhat.com/cygwin/ ) :

echo '<root>' > ../data.xml
list=`find . ! -type d -name '*.xml'`
cat $list >> ../data.xml
echo '</root>' >> ../data.xml

# obtaining data.xml in parent directory, and then do:

cd ..
saxon7 data.xml example2Schema.xslt > schema.xml
saxon7 schema.xml XMLSchema2query.xslt > XPath-helper.html

after having downloaded example2Schema.xslt, XMLSchema2query.xslt, unique.xslt, XPath-helper.js from:
http://wwbota.free.fr/XSLT_models/

saxon7 is the script I use for XSLT transformation :
java -Xmx110m -classpath /home/jmv/install/saxon7/saxon7.jar net.sf.saxon.Transform $*

SAXON XSLT processor is here: http://saxon.sourceforge.net/
SAXON is 100% Java, efficient, allways synchronized with new W3C specs.

XPath-helper.html now contains an alphabetical list of all tags in all the original XML files, whatever the XML nesting level.
Then I manually edit XPath-helper.html to eliminate some non-relevant or rare tags, and the duplicated root tag. Finally I manually insert the relevant part into the squery.xsp page in the Cocoon search engine application.

Adding rubrics from other tables

scientific names, ???, and upper taxon links
add-rubrics.xslt

In table hu_card we have this:
<record>
        <card_no>44</card_no>
        <family_id>10241</family_id>
        <genus_id>108817</genus_id>
        <card_taxon>Cycas rumphii</card_taxon>

In table taxon we have this:
<record>
        <taxon_id>200005231</taxon_id>
        <name>Cycas rumphii</name>

??? to be continued