Wednesday, July 4, 2007

Digital Species Descriptions & the new GBIF portal

The Biodiversity Information Standards (previously known as TDWG, the Taxonomic Database Working Group) has recently rolled out a new subsgroup called "Species Profile Model" led by Éamonn Ó Tuama (GBIF). Thirteen people attended a workshop April 16-18, 2007 in Copenhagen, DK shortly after the Encyclopedia of Life informatics workshop in Woods Hole, MA. The point of this Copenhagen workshop was to hash out a specification to support the retrieval and integration of data with the lofty goal of "reaching consensus and avoiding fragmentation" of existing species-level initiatives. I'm all for this, but I wonder if it will work? I believe "consensus" as it's described here is meant to be a common way of presenting the data rather than a true taxonomic, ecological, or political consensus. A specification does not preclude the possibility for several variants of a species profile served from multiple (or even the same) provider. These could of course have conflicting or dated information and ultimately result in misleading COSEWIC-type recommendations. So, what about consensus as we usually define it? Or, is that beyond the responsibility of this subgroup?

A standard for specimen data (Darwin Core, ABCD, etc.) is obvious, but I'm not convinced that a standard for species descriptions is wise unless such a standard were developed and solely hosted by the nomenclators and sanctioned by the various Codes. A standard for species descriptions without ties to the nomenclators and the authors who conducted the original species description or revision merely democratizes fluff. Before a standard Species Profile Model is put into practise, such RDF representations have to at least explicitly incorporate peer review, authorship, and a date stamp.

I also noticed that the Species Profile Model is attempting to integrare citations to scientific literature. I suggest the team take a good close look at OpenURL, which lends itself to useful functionality when building lists of references in front-end applications (see Rod Page's post in iPhylo on this very subject and several posts in this blog). The OpenURL format will influence how the elements in the proposed Species Profile Model ought to be constructed.

On other fronts, GBIF just rolled out their new portal: http://data.gbif.org. It looks as if the whole index and back-end was reconstructed and there remain some missing provider data tables. In time, these will probably blink on as they were presented via the old portal. What I appreciate seeing for the first time is a concerted effort to give providers some auto-magic feedback about what is being served from their boxes. Vetting data is a very important part of federation and I hope providers sit up and take notice. GBIF calls these "event logs", which is too obtuse. I'd like to see this called "Questionable Data Served from this Provider", "Problem Records", or "The Crap You're Serving the Scientific Community", or something similar. "Event logs" is easily dismissed and overlooked. For example, here are the event logs for the University of Alaska Museum of the North Mollusc Collection: http://data.gbif.org/datasets/resource/967/logs/. GBIF also has a flashy new logo & plenty of easy to use web services.

4 comments:

Anonymous said...

"The Crap You're Serving the Scientific Community" - This made me laugh ;o)

The event logs also show some very (at the moment) basic usage reports, with the aim of being able to let providers see how their data is being discovered (e.g search parameters) and how many records from the data set are returned. It is primitive at the moment, but the aim is for better reporting on usage. It does also of course log the issues found in the data as they were harvested. Tim (GBIF)

David Shorthouse said...

Tim,
Usage reports are a great idea because this helps providers argue for more local funding to keep their boxes alive and healthy. What I'd also like to see is a local app for providers (bundled with TAPIR or DiGIR) to create their own usage reports. A usage report at GBIF's end is only half the picture because it doesn't capture non-GBIF access to the data providers serve. For example, data pulled from the University of Kansas, ZipCodeZoo, etc. can at times be just as intensive as data pulled from GBIF.

Éamonn said...

Hi David,

The 'consensus' that the workshop was attempting to reach was an agreed/harmonised list of top level elements/terms that would be used to tag descriptions relating to species (or, more generally, taxa), terms like 'distribution', 'life cycle', 'migration', 'reproduction', 'growth', etc., drawn from the various schemas (e.g. Plinain Core, FishBase, NatureSrrve, GISIN). See here: http://wiki.gbif.org/dadiwiki/images/speciesmodel/speciesModelElements.xls.

These terms constitute the TDWG Species Profile Model InfoItems Ontology (http://rs.tdwg.org/ontology/voc/SPMInfoItems.rdf) that are used by the Species Profile Model defined here: http://rs.tdwg.org/ontology/voc/SpeciesProfileModel.rdf.

So, of course, you can have as many variants of a species profile as you want. The whole idea is to be able to tag descriptions (infoItems) and relate them to a particular taxon concept. Likewise, literature citations can be integrated with the infoItems. They, like other elements (taxon names, concepts, rank, citation, etc.) are expressed using the TDWG LSID vocabularies (http://wiki.tdwg.org/twiki/bin/view/TAG/LsidVocs)

Éamonn (GBIF)

David Shorthouse said...

Éamonn,

Thanks for stopping by and contributing a response.
First, I think federating species descriptions is definitely something worth persuing & I'm happy to see GBIF and TDWG taking the lead on this. Will there be a listserv or forum for the Species Profile Model?
Second, I think it a good idea to qualify the SPM somehow. For example, should the representation be from the original description, that ought to be 'tagged'. If the representation comes from the result of a peer-reviewed document (online or otherwise), that too should be 'tagged' with clear authorship. In this manner, aggregators or users of multiple SPM representations have something upon which to make their decisions when selecting content for other purposes.
I had a peek at the literature LSID vocabulary & I cannot understand why existing vocabularies/schema are not being used. OpenURL is the logical choice here. The current LSID vocabulary for literature cannot be reconstructed or repurposed into OpenURL because the selected elements seem proprietary or created de novo.