Monday, May 12, 2008

Life Science Identifiers (LSIDs) - Why?

The Catalogue of Life (CoLP) recently released its 2008 checklist and has now implemented Life Science Identifiers (LSIDs). In the past, the Catalogue of Life changed its identifiers with every new version, thus forcing database owners who made use of CoLP names and identifiers to reconstruct their databases if they wished to maintain some sort of external linking to an authoritative source.

If you're not familiar with LSIDs, this from the sourceforge LSID resolution project:

Life Science Identifiers (LSIDs) are persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including species names, concepts, occurrences, genes or proteins, or data objects that encode information about them. To put it simply, LSIDs are a way to identify and locate pieces of biological information on the web.
This is how LSIDs are constructed:


So, what can one do with an LSID? Well, given an LSID, one can get some metadata for that data object. This assumes of course that the authority at the other end is alive and ready to serve the metatdata. There is not a central authority as is the case with Digital Object Identifiers (DOIs) used by the publication industry.

For starters, one can resolve LSIDs using various online resources. Examples:
  1. Biodiversity Information Standards (TWDG): LSID resolver

  2. Rod Page: LSID tester


Because of the distributed nature of LSID authorities (its ultimately based on DNS), there is of course nothing preventing the same taxon name from having multiple identifiers or one authority from serving multiple LSIDs for the same taxon name. For example, the namestring for the fishing spider Dolomedes tenebrosus Hentz, 1844 has no less than 3 LSIDs that resolve to three different authorities:

uBio: urn:lsid:ubio.org:namebank:2072956
Catalogue of Life 2008: urn:lsid:catalogueoflife.org:taxon:f3b7cf14-29c1-102b-9a4a-00304854f820:ac2008 (ugh!)
The World Spider Catalog: urn:lsid:amnh.org:spidersp:019664

The uBio and the Catalogue of Life LSIDs for this spider resolve, but the AMNH LSID is nothing more than a pointer at this stage because at the time of writing does not yet have a functioning resolution service.

Which LSID is a database owner supposed to use? Are LSIDs meant to be currencies that either crumble or presist under Darwinian market pressures? What I want to do is store an LSID in my relational database such that I can more confidently link names with other sources of information such as information about the type specimens, gene sequences, synonyms, specimens etc. The uBio LSID above is nice and compact, but no one but me and uBio would use it. Norm Platnick wasn't aware that uBio had LSIDs for spider names! The World Spider Catalog LSID above is also nice and compact, but it doesn't resolve. The Catalogue of Life LSID is downright awful because I can't merely use the object identification as a stand-alone integer.

So, I'll continue to use "Dolomedes tenebrosus Hentz, 1844" thank you very much. A decentralized identifer system is failing me.

1 comment:

Ken-ichi said...

Speaking from a position of extreme inexpertise on the topic, I think LSIDs are generally meant to identify digital resources, not necessarily the physical realities those resources describe. So the 3 LSIDs you mentioned above all identify records in taxonomic databases, not the actual taxa they describe. I don't think the intent of LSIDs is to resolve synonymies or provide linkages between authorities.

As to which one you should use, I'd say use the LSID from the authority you trust the most, or that provides what you need. Then again, if your goal is to have a better key to join data across diverse data sources, you're right, the old school binomen will probably serve the best.