Tuesday, October 30, 2007

Taxonomic Consensus as Software Creation

It occurred to me today that the process of reaching taxonomic consensus or developing a master database of vetted names like that undertaken by The Catalogue of Life Partnership (CoLP) is not unlike software development that necessarily requires some sort of framework to manage versioning. However, taxonomic activities and building checklists do not currently have a development framework. We likely have a set of rules and guidelines, but infighting and bickering no doubt fragment interest groups, which ultimately leads to the stagnation, abandonment, and eventual distrust of big projects like CoLP. We have organizations like the International Commission on Zoological Nomenclature to manage the act of naming animals but there is nothing concrete out the other end to actually organize the names. Publications are merely the plums in a massive bowl of pudding. And, it is equally frustrating to actually find these publications. One way to approach a solution to this is to equate systematics with perpetual software development where subgroups manage branches of the code and occasionally perform commits to (temporarily) lock the code. Like with software development, groups of files (i.e. branches on the tree of life) and the files themselves (i.e. publications, images, genomic data, etc.) ought to be tracked with unique identifiers and time-stamps. This would be a massively complex shift in how taxonomic business is conducted, but what other solution is there?

Without really understanding distributed environments in software development...it's too geeky for me...I spent a few moments watching a Google TechTalk presentation by Randal Schwartz delivered at Google October 12, 2007 about Git, a project spearheaded by Linus Torvalds: http://video.google.com/videoplay?docid=-1019966410726538802 (sorry, embedding has apparently been disabled by request).

There are some really interesting parallels between distributed software development environments like Git and what we ought to be working toward in systematics, especially as we move toward using Life Sciences Identifiers (LSIDs). Here are a few summarized points from Randal's presentation:

  • Git manages changes to a tree of files over time
  • Optimized for large file sets and merges
  • Encourages speculation with the construction of trial branches
  • Anyone can clone the tree, make & test local changes
  • Uses "Universal Public Identifiers"
  • Has multi-protocal transport like HTTP and SSH
  • One can navigate back in time to view older trees of file sets via universal public identifiers

With a cross-platform solution and a facile user interface, perhaps thinking in these terms will help engage taxonomists and will ultimately lead to a ZooBank global registry of new taxon names.

Thursday, October 25, 2007

Buying & Selling DOIs...and the same for specimens

A previous post of mine described the business model for digital object identifiers in albeit simplistic terms. But, perhaps I should back up a second. Just what the heck is a DOI and why should the average systematist care? [Later in this post, I'll describe an interesting business model for biodiversity informatics]

Rod Page recently wrote a post in iPhylo that does a great job of selling the concept. Permit me to summarize and to add my own bits:

  1. DOIs are strings of numbers & letters that uniquely identify something in the digital realm. In the case of published works, they uniquely identify that work.
  2. DOIs are resolvable and can be made actionable. i.e. you can put http://dx.doi.org/ in front of a DOI and, through the magic of HTTP, you get redirected to the publisher's offering or the PDF or HTML version of the paper
  3. DOIs have metadata. If you have for example a citation to a reference, you can obtain the DOI. Conversely, if you have a DOI, you can get the metadata
  4. DOIs are a business model. Persistent URLs (championed by many) are not a business model because there is no transfer of funds & confidences

Systematists have lamented that their works on delineating & describing species don't get cited in the primary literature. If they published in journals that stamped DOIs on their works or if they participated in helping journals get DOIs for back-issues or future publications, then outfits like the Biodiversity Heritage Library would have an easier time mapping taxon names to published works. For example, searching not for a publication but a taxon name in Biodiversity Heritage Library (protoype HERE) would not only provide a list of works in BHL that used the name somewhere in its text, it could provide a forward-linking gadget from CrossRef. The end user would have an opportunity to do his or her own cognitive searching:

There is nothing stopping an outfit like the Biodiversity Heritage Library from using Handles or some other globally unique identifer. But, doing so cuts off the possibility of injecting old works back into contemporary use because they will not be embedded into a widely used cross-linking framework.

MOIs for Sale

The Global Biodiversity Information Facility and The Encyclopedia of Life must also be active participants in the adoption of globally unique identifiers. But again, there must be a business model. So, here's a business model in relation to museum specimens:
  1. A registry sells a "MOI" - Museum Object Identifier (my creation of course) at 1 cent per labelled specimen.
  2. The price will go up to 2 cents a specimen after 2020, the usual year given for various National Biodiversity Strategies. Translation: get your act together because it'll cost more later.
  3. All MOIs must have DarwinCore metadata
  4. The registry sets up a resolver identical in functionality to DOIs

Now, before all the curators out there scream bloody murder, let's stop and think about this and put a creative, financial spin on the possibilities. Craig Newmark, the founder of the ever popular Craig's List, was recently interviewed on Stephen Colbert's Colbert Report where he mentioned Donor's Choose (see interview). If you're not familiar with that new service, here's the slogan: "Teachers ask. You choose. Students learn."
DonorsChoose.org is a simple way to provide students in need with resources that our public schools often lack. At this not-for-profit web site, teachers submit project proposals for materials or experiences their students need to learn. These ideas become classroom reality when concerned individuals, whom we call Citizen Philanthropists, choose projects to fund.

There's a lot of interest in The Encyclopedia of Life and the Biodiversity Heritage Library now. Let's set-up a global "Donor's Choose" clone called something like "Biodiversity Knowledge Fund" (though that's not catchy enough) to be locally administered by daughter organizations to EOL and the BHL in countries throughout the world. Funds then are transferred to institutions of the donor's choosing. Museums then accept the funds donated to them and turn around and buy "MOIs". What would prevent a museum from taking the money specifically donated to them and spending it on things other than MOIs? Nothing. But, their specimens then aren't indexed. Are you a philanthropist or have 20 dollars (or francs, rubles, pounds, pesos, dinar, lira, etc.) you'd like to donate? Want to fund biodiversity but don't know how? Here's an answer. But is such a "Biodiversity Knowledge Fund" sustainable? No, but it's a start.

Wednesday, October 24, 2007

Biodiversity Informatics Needs a Business Model

Publishers and (most) librarians understand that digital object identifiers (doi) associated with published works are more than just persistent codes that uniquely identify items. They are built into the social fabric of the publishing industry. Because monies are transferred for the application and maintenance of a doi, the identifier is persistent. It's really because of this "feature" that tools like cross-linking and forward-linking can be built and that these new tools will themselves persist. The nascent biodiversity informatics community is attempting to do all the fun stuff (myself included) like building taxonomic indices, gadgetry to associate names and concepts with other things like literature, images, and specimens without first establishing a long-term solution for how the persistence of all these new tools will be established. Let me break it down another way:

Publishers buy dois and pay an annual subscription. In turn, the extra fee for the doi is passed down the chain to the journal & its society. The society then passes the extra fees on to either an author in the way of page fees or to the subscribers of the journal. Since the majority of subscribers are institutions and authors receive research grants from federal agencies, ultimately, the fractions of pennies that merge to pay for a single doi come from tax payers' wallets and purses. So, dois fit nicely into the fabric of society and really do serve a greater purpose than merely uniquely identifying a published object. Then, and only then, can the nifty tools CrossRef provides be made available. Then, third parties may use these tools with confidence.

Not surprisingly, the biodiversity informatics community has latched on to the nifty things one can do with globally unique identifiers because everybody wants to "do things" by connecting one another's resources. Some very important and extremely interesting answers to tough questions can only be obtained by doing this work. Also not suprisingly, there is now a mess of various kinds of supposed globally unique identifiers (GUIDs) because big players want to be the clearinghouse much as CrossRef is the clearinghouse for dois. But they have all missed the point.

So, how do we instill confidence in the use of LSIDs, ITIS TSNs, the various NCBI database id's, etc. without a heap of silos with occasional casualities? Get rid of them or at least clearly associate what kind of object gets what kind of identifier along with a business model where there will be persistent, demonstrable transfer of funds. The use of Semantic Web tools is merely a band-aid for a gushing wound. When I say persistent transfer of funds, I don't mean assurances that monies will come from federal grants or wealthy foundations in order to maintain those identifiers. I mean an identifier that is woven into the fabric and workflow of the scientific community. This may be easier said than done because other than publications, the scientific community (especially systematists and biologists) aren't in the business of producing anything tangible except publications. CrossRef has that angle very well covered. So, what else do scientists (the systematics community is what I'm most interested in) produce that can be monetized? Specimens, gene sequences, and perhaps a few other objects. We need several non-profits like CrossRef with the guts to demand monies for the assignment of persistent identifiers. Either we adopt this as a business model or we monetize some services (e.g. something like Amazon Web Services as previously discussed) that directly, clearly, and unequivocally feed into the maintenance of all the shiny new GUIDs.

Tuesday, October 16, 2007

PygmyBrowse Classification Tree API

Yay, a new toy! This one ought to be useful for lots of biodiversity/taxonomic web sites. First, I'll let you play with it (click the image):

Seems I always pick up where Rod Page leaves off. Not sure if this is a good thing or not. However, we do have some worthwhile synergies. Rod has cleaned up and simplified his old (Sept. 2006) version of PygmyBrowse. Earlier this week, he made an iframe version and put it on his iPhylo blog. Like Rod, I dislike a lot of the classification trees you come across on biodiversity/taxonomic web sites because these ever-expanding monstrosities eventually fill the screen and are a complete mess. When you click a hyperlinked node, you often have to wait while the page reloads and the tree re-roots itself...not pretty. Trees are supposed to simplify navigation and give a sense of just how diverse life on earth really is. The Yahoo YUI TreeView is OK because it's dynamic, but it desperately needs to handle overflow for exceptionally large branches as is the case with classification trees in biology. What did I do that's different from Rod's creation?

I convinced Dave Martin (GBIF) to duplicate the XML structure Rod used to fill the branches in his PygmyBrowse and to also do the same with JSON outputs. This is the beta ClassificationSearchAPI, which will soon be available from the main GBIF web services offerings. When the service is out of beta, I'll just adjust one quick line in my code.

I jumped at the chance to preserve the functionality Rod has in his newly improved, traditional XMLHTTP-based PygmyBrowse and write one as an object-oriented JavaScript/JSON-based version. My goal is to have a very simple API for developers and end users who wish to have a remotely obtained, customizable classification tree on their websites. Plus, I want this API to accept an XML containing taxon name and URL elements (e.g. a Google sitemap) such that the API will parse it and adjust the behaviour of the links in the tree. In other words, just like you can point the Google Map API to an XML file containing geocoded points for pop-ups, I wanted to author this API to grab an XML and magically insert little, clickable icons next to nodes or leaves that have correspoding web pages on my server. Think of this as a hotplugged, ready-made classification naviagator. This is something you cannot do with an iframe version because it's stuck on the server and you can't stick your fingers in it and play with it. Sorry, Rod.

The ability to feed an XML to the tree isn't yet complete, but the guts are all in place in the JavaScript. You can specify a starting node (homonym issues haven't yet been dealt with but I'll do that at some point), the size of the tree, the classification system to use (e.g. Catalogue of Life: 2007 Annual Checklist or Index Fungorum, among others), and you can have as many of these trees on one page as you wish. You just have to pray GBIF servers don't collapse under the strain. So, you could use this API as a very simple way to eyeball 2+ simultaneous classifications. The caveat of course is that GBIF must make these available in the API. So, hats off to GBIF and Dave Martin. These are very useful and important APIs.

Last month, I proposed that the Biodiversity Informatics community develop a programmableweb.com clone called programmablebiodiversity.org. There are more and more biodiversity-related APIs available, many of which produce JSON in addition to the usual XML documents via REST. Surely people more clever than me can produce presentational & analytical gadgets if there was a one-stop-shop for all the APIs and a showcase for what people are doing with these data services. The response from TDWG was luke-warm. I think there's a time and place for development outside the busy work of standards creation. But, there were a few very enthusiastic responses from Tim Robertson, Donald Hobern, Lee Belbin, Vince Smith and a few others. It turns out that Markus Döring and the EDIT team in Berlin have been creating something approaching my vision called BD (Biodiversity) Tracker at http://www.bdtracker.net. I just hope they clean it up and extend it to approximate the geekery in programmableweb.com with some clean cut recipes for people to dive into using APIs like this. [Aside: Is it just me or all the Drupal templates starting to look a little canned and dreary?].

There's plenty more I want to do with this JSON-based PygmyBrowse so if you have ideas or suggestions, by all means drop a comment. Rod wants to contribute this code to an open-source repository & I'll be sure to contribute this as a subproject.

Wednesday, October 3, 2007

The Open Library

I stumbled on an amazing new project lead by Aaron Swartz called the Open Library - not be confused with this Open Library though there appears to be some resemblance. What strikes me about Aaron's project is that it is so relevant to The Encyclopedia of Life it scares me that I haven't yet heard of it. According to their "About the technology" page:

Building Open Library, we faced a difficult new technical problem. We wanted a database that could hold tens of millions of records, that would allow random users to modify its entries and keep a full history of their changes, and that would hold arbitrary semi-structured data as users added it. Each of these problems had been solved on its own, but nobody had yet built a technology that solved all three together.

The consquence of all this is that there is a front-facing page for every book with the option to edit the metadata. All versioning and users are tracked. The content of the "About Us" page sounds eerily like E. O. Wilson's proclamations in his 2003 opinion piece in TREE (doi:10.1016/S0169-5347(02)00040-X). For those of you who don't recognize the name, Aaron Swartz, he's the whiz behind a lot of important functionality on the web we see today. It's also worth reading his multi-part thoughts on the spirit of Wikipedia and why it may soon hit a wall.