Data preparation
Worldwide Botanical Knowledge Base project
J. M. Vanel
- See also my My
diary about
computer science
- Last
update
- back to home page
The data
The data provided by the Flora
of China project were in 2 large MDB files : flora.mdb and
general.mdb. Here are the SQL data description : general.sql
, flora.sql . The "content" field in flora_text table in file
flora.mdb contains the taxon descriptions. It has many other fields for
each taxon, depending on the value of the category_id field. The
category_id field in many different tables is described
by the fields name and table in the category table in
file general.mdb . Here is an XML version of the category table : category.xml .
Data in flora_text table
in file flora.mdb :
XPath
|
Field name |
|
/table/tr/td[1]
|
taxon_id
|
key for the taxon |
/table/tr/td[2] |
publication_id |
key for a publication (article,
book, ...)
|
/table/tr/td[3] |
category_id |
specifies the type of content,
12021 for botanical description |
/table/tr/td[4] |
content |
the botanical description that
gets XML-ized |
|
note |
|
For a given taxon, not all the categories get populated. In fact, only
the following categories are filled somewhere :
<category_id>12004</category_id>
<category_id>12006</category_id>
<category_id>12007</category_id>
<category_id>12008</category_id>
<category_id>12011</category_id>
<category_id>12021</category_id>
<category_id>12023</category_id>
<category_id>12024</category_id>
<category_id>12025</category_id>
<category_id>12027</category_id>
<category_id>12041</category_id>
<category_id>12043</category_id>
<category_id>12051</category_id>
<category_id>12053</category_id>
<category_id>12055</category_id>
<category_id>12057</category_id>
<category_id>12061</category_id>
<category_id>12081</category_id>
<category_id>12091</category_id>
<category_id>12093</category_id>
<category_id>12094</category_id>
I got this by the shell command :
grep "<category_id>" flora_text2.xml | sort | uniq
Dowload the latest XML file : flora_text2.xml.zip
.
Getting XML from various sources
I first used a small Java JDBC program, JDBC2XML.java , to
get clean XHTML out of the Flora of China Access database.
Now I use
to get the database in CSV format:
mdb-export general.mdb category > category.csv
followed by the Perl module XML::CSV
.
Here is an example of a script using Perl module XML::CSV:
# csv2xml.pl
use XML::CSV;
$csv_obj = XML::CSV->new();
# Configure CSV parser :
$default_obj_xs = Text::CSV_XS->new({
'quote_char' => '"',
'escape_char' => '"',
'sep_char' => ',',
'binary' => TRUE
});
$csv_obj = XML::CSV->new({csv_xs => $default_obj_xs});
$basename = "flora_text2";
$csv_obj->parse_doc( $basename.".csv", {headings => 1});
$csv_obj->print_xml( $basename.".xml");
Natural language processing
FloraParse is our Natural Language parser. It is a parser for classical
Floras generating XML markup. It integrates the Wordnet C
library. WordNet is a semantic dictionary from Princeton
University, accessible through Web interface, GUI interface, and a
C library. FloraParse is writen in C++ with Lex and Yacc.
FloraParse transforms Natural Language descriptions into an XML format
where informations are marked as organe, sub-organ, geography,
etc.
Sample
of the current output of FloraParse - Here
is a small XML sample with its stylesheet displaying it as colored
text.
FloraParse records specific WordNet hypernyms: "plant organ",
like petal and most plant organs, "natural object",
and another category including corona, pith, stele, lobe, vein that are
considered by WordNet as "body part" .
FloraParse could be generalized into a quite generic XMLizer for
punctuation-delimited texts.
General information about the
WWBKB Natural Language processing (see also WWBKB diary of 2003-02-11)
FloraParse Source code in on CVS at SourceForge.net : CVS web
interface at sourceforge
Link to the project
page on SourceForge :
The latest version of the XML markup of the Flora of China
is in this directory .
Generate organ list
I do this tag list processing apart from the XML database, it is
easier. If I had several XML files, I would concatenate all files (or
just some typical XML instances) in one file, put opening and closing
tag around it . It could be done more elegantly with XML External
Parsed
Entities. Here is a way of doing the concatenation in Unix shell
(use Cygwin if on Windows : http://sources.redhat.com/cygwin/
) :
echo '<root>' > ../data.xml
list=`find . ! -type d -name '*.xml'`
cat $list >> ../data.xml
echo '</root>' >> ../data.xml
# obtaining data.xml in parent directory, and then do:
cd ..
saxon7 data.xml example2Schema.xslt > schema.xml
saxon7 schema.xml XMLSchema2query.xslt > XPath-helper.html
after having downloaded example2Schema.xslt, XMLSchema2query.xslt,
unique.xslt, XPath-helper.js from:
http://wwbota.free.fr/XSLT_models/
saxon7 is the script I use for XSLT transformation :
java -Xmx110m -classpath
/home/jmv/install/saxon7/saxon7.jar net.sf.saxon.Transform $*
SAXON XSLT processor is here: http://saxon.sourceforge.net/
SAXON is
100% Java, efficient, allways synchronized with new W3C specs.
XPath-helper.html now contains an alphabetical list of all tags in all
the original XML files, whatever the XML nesting level.
Then I manually edit XPath-helper.html to eliminate some non-relevant
or rare tags, and the duplicated root tag. Finally I manually insert
the
relevant part into the squery.xsp page in the Cocoon search engine
application.
Adding rubrics from other tables
scientific names, ???, and upper taxon links
add-rubrics.xslt
In table hu_card we have this:
<record>
<card_no>44</card_no>
<family_id>10241</family_id>
<genus_id>108817</genus_id>
<card_taxon>Cycas
rumphii</card_taxon>
In table taxon we have this:
<record>
<taxon_id>200005231</taxon_id>
<name>Cycas
rumphii</name>
??? to be continued