The Next Step: TaxonX marked up publications
The real value of systematics publications is their extremely rich content of descriptive data that is the description, diagnoses and distribution data. Similar to specimen data, sitting in collections, published data is not accessible, unless an effort is made to transfer the printed record in some sort of electronic medium. But scanned publications are only the very first step of the solution to open up the more than 80,000 pages of printed record containing the systematics of the >11,800 ant species.
While it is a significant achievement to provide access to PDF versions of scanned documents, more work remains to to be done in order to expose the information contained in them for more accurate searching and navigation, as well as more powerful computer analysis and processing. First, the texts of the publications needs to be converted to machine readable form, ideally in an open, non-proprietary data format such as XML. Next, the XML texts should be encoded so as to make explicit the structures (e.g, nomenclature, distribution, diagnosis, etc... sections) and features (scientific names, localities, characters, etc...) characteristic of taxonomic treatments."
In a collaborative project with the American Museum of Natural History, Ohio State University, University of Massachusetts and the University of Karlsruhe (Germany), supported by a bi-national research award by the US NSF and the German DFG, ants are used as the primary pilot group to develop an XML schema (taxonx) for encoding the logical structures of treatments in taxonomic publications, develop tools to automatize the mark up, and finally tools to demonstrate to extract, use and mine the data in those publications.
The elegant result will be that anybody with some basic skills can write programs to extract the information or build viewers of the documents or part of.
Two examples are given using a simple viewer to display the first description of an ant by Linnaeus, 1758,
and a more recent description by Fisher, describing Strumigenys ants from Madagascar.
Here is a list of test documents related to our digital library project.
For those interested in edition your own systeamtics paper, based on ocr-ed documents, we now offer GoldenGATE, a dedicated editor tuned to produce valid Taxonx documents for ants, including Life Science Identifier (LSIDs) for ant taxa. Our own testbase, which eventuallay will include 125 publications covering all descriptions of Malagasy ants, is available here.
For more information, please Contact Us