Tuesday, June 19, 2007

DiGIR for Collectors

The Global Diversity Information Facility (GBIF) has done an excellent job at designing the infrastructure to support the federation of specimen and observation data. The majority of contributing institutions use Distributed Generic Information Retrieval (DiGIR), an open-source PHP-based package that nicely translates columns of data in one's dedicated database (e.g. MySQL, SQL Server, Oracle, etc.) into Darwin Core fields. So, even if your data columns don't match the Darwin Core schema, you can use the DiGIR configurator to match your columns to what's needed at GBIF's end. Europeans tend to prefer Access to Biological Collection Data (ABCD) as their transport mechanism. The functionality of these will soon be rolled into the TDWG Access Protocol for Information Retrieval (TAPIR). To the uninitiated like me, this is a jumbled, confusing alphabet soup and at first I couldn't navigate this stuff.

Suffice it to say, the documentation isn't particularly great on either the TDWG or GBIF web sites. To the TDWG folks: a screencast with step by step install for both Windows & Linux would go a long way! I don't mean a flashy Encyclopedia of Life webcast, I mean a basic TDWG for Dummies. If you have a dedicated database and a web server that can push PHP-based pages, it's actually pretty straight forward once you get going. It's really just a matter of jumping through a few simple hoops. Click here; do this; match this; click there - not much more difficult than managing an Excel datasheet. The downloads & step by steps for DiGIR can be found HERE. The caveat: you need a dedicated database, a dedicated web server, and you need your resource to be recognized by a GBIF affiliate before it's registered for access. That's unfortunately how all this stuff works.

So what about the casual or semi-professional collector that may have much larger collections than what can be found in museums? It's not terribly likely countless, hard-working people like these have the patience to fuss with dedicated databases (we're not talking Excel here) or web servers. Must they wait to donate their specimens to a museum before these extremely valuable data are made accessible? In many cases, a large donation of specimens to a museum sits in the corner and never get accessioned because there simply isn't the human power at the receiving end to manage it all. Heck, some of the pre-eminent collections in the world don't even have staff to receive donations of any size! This is a travesty.

An attractive solution to this is to complement DiGIR/ABCD/TAPIR with a fully online solution akin to Google Spreadsheets. For the server on the other end, this means a beefy pipe and a hefty set of machines to cope with this AJAX-style, rapid & continuous access. But, for small taxa-centric communities, this isn't a problem. In fact, I developed such a Google Spreadsheet-like function in The Nearctic Spider Database for collectors wanting to manage their spider data.

Turn Up Volume!

Watch the video above or HERE (better resolution). Everything is hosted on one machine on a residential Internet connection & I have had up to 5 concurrent users + all the usual 2,500 - 3,500 unique visitors a day with no appreciable drop in performance. Granted things are a little slower in these instances, but the alternative is no aggregation of data at all. To help users that each have their own table in the database, I designed some easy and useful tools. For example, they may query their records for nomenclatural issues, do some real-time reverse geocoding to verify that their records are actually in the State or Province they specified, check for duplicates, among a few other goodies like mapping everything as clickable pushpins in Google Map. Of course, one can export as Excel or tab-delimited text files at any time. The other advantage to such a system is that upon receiving user feedback and requests, I can quickly add new functions & these are immediately available to all users. I don't have to stamp and mail out new CDs, urge them to download an update, or maintain versions of various releases. If you're curious about wanting to do the same sort of thing for your interest group, check out Rico LiveGrid Plus, the code base upon which I built the application.

What would be really cool is if this sort of thing could be made into a Drupal-like module & bundled into Ubuntu Server Edition. A taxon focal group with a community of say 20-30 contributors could collectively work on their collection data in such a manner & never have to think about the tough techno stuff. They'd buy a cheap little machine for $400, slide the CD into the drive to install everything & away they go.

The real advantage to the on-line data management application in the Nearctic Spider Database is the quick access to the nomenclatural data. So, the Catalog of Life Partnership & other major pools of names ought to think about simple web services upon which such a plug-and-go system can draw their names. It's certainly valuable to have a list of vetted names such as what ITIS and Species2000 provide, but to really sell them to funding agencies they no doubt have to demonstrate how the names are being used. Web services bundled with a little plug-and-go CD would allow small interest groups to hit the ground running. Such a tool would give real-world weight to this business of collecting names and would go a long way toward avoiding the shell games these organizations probably have to play. I suspect these countless small interest groups would pay a reasonable, annual subscription fee to keep the names pipes open. Agencies already exist to help monetize web services using such a subscription system. Perhaps it's worth thinking like an Amazon Web Service (AWS) where users pay for what they use. Unlike AWS however, incoming monies would only support the Catalog of Life Partnership wages and infrustructure to take some weight off chasing grants.


leebel said...

I just found this posting of yours. I'm slow. Anyway, your comments about TDWG, DiGIR, TAPIR, DwC are spot on David. This is probably the biggest current issue for TDWG. My job over the past two years (with Roger Hyam, Ricardo Pereira and Donald Hobern) has been to provide TDWG with a more effective infrastructure. This is now done, but this has not bridged the gap between TDWG 'standards' and shrink wrapped products. In another year, we probably could have got there. We have supported considerable TAPIR and TAPIRLink developments including an implementation guide and a 2-page 'TAPIR for dummies'.

But should TDWG? Most Internet-related standards bodies such as TDWG do not deliver readily deployable packaged solutions or high levels of end-user support. That is usually left up to commercial service providers. The gap really does have to bridged, but how best to do it?

David Shorthouse said...

Happy to hear there will be such a document. But, it does of course need wide advertisement such that it will be adopted and deployed. Indeed, it is a shame that Internet standards groups do not devote equal efforts and energies to end-user packages. Without this half of the practice, standards development is merely an academic exercise under a perpetual, dark cloud of imminent failure. Too often, standards are dismissed and eventually forgotten in favor of easily implemented solutions. I look for inspiration at widely used front-end solutions and APIs like the Google Map API that have extremely active followings and community-based support. In these sorts of environments, people can see the reason why standards are created. In addition, standards groups get a sense of end-user acceptance thus being adaptive for positive, future growth.

phil said...

This is really interesting, I came across it while looking for a way to do distributed content replication for BHL. After reading more of your posts I find out you're involved with EoL too, then a mention to Chris F. tied it all together. I'm Phil from Missouri Botanical Garden, I'm working on bringing Fedora-Commons into the Mobot mix, and will be working with EoL/BHL in terms of looking forward to a distributed replication system for other organizations worldwide. I'll be in Woods Hole April 20-23 for Nomina 2, if you're not attending the workshop we should meetup and discuss current goings on. Either way you can reach me at phil (at) cryer.us