Tuesday, October 30, 2007

Taxonomic Consensus as Software Creation

It occurred to me today that the process of reaching taxonomic consensus or developing a master database of vetted names like that undertaken by The Catalogue of Life Partnership (CoLP) is not unlike software development that necessarily requires some sort of framework to manage versioning. However, taxonomic activities and building checklists do not currently have a development framework. We likely have a set of rules and guidelines, but infighting and bickering no doubt fragment interest groups, which ultimately leads to the stagnation, abandonment, and eventual distrust of big projects like CoLP. We have organizations like the International Commission on Zoological Nomenclature to manage the act of naming animals but there is nothing concrete out the other end to actually organize the names. Publications are merely the plums in a massive bowl of pudding. And, it is equally frustrating to actually find these publications. One way to approach a solution to this is to equate systematics with perpetual software development where subgroups manage branches of the code and occasionally perform commits to (temporarily) lock the code. Like with software development, groups of files (i.e. branches on the tree of life) and the files themselves (i.e. publications, images, genomic data, etc.) ought to be tracked with unique identifiers and time-stamps. This would be a massively complex shift in how taxonomic business is conducted, but what other solution is there?

Without really understanding distributed environments in software development...it's too geeky for me...I spent a few moments watching a Google TechTalk presentation by Randal Schwartz delivered at Google October 12, 2007 about Git, a project spearheaded by Linus Torvalds: http://video.google.com/videoplay?docid=-1019966410726538802 (sorry, embedding has apparently been disabled by request).

There are some really interesting parallels between distributed software development environments like Git and what we ought to be working toward in systematics, especially as we move toward using Life Sciences Identifiers (LSIDs). Here are a few summarized points from Randal's presentation:

  • Git manages changes to a tree of files over time
  • Optimized for large file sets and merges
  • Encourages speculation with the construction of trial branches
  • Anyone can clone the tree, make & test local changes
  • Uses "Universal Public Identifiers"
  • Has multi-protocal transport like HTTP and SSH
  • One can navigate back in time to view older trees of file sets via universal public identifiers

With a cross-platform solution and a facile user interface, perhaps thinking in these terms will help engage taxonomists and will ultimately lead to a ZooBank global registry of new taxon names.


kehan said...

An interesting take I must say. I know that IPNI does take a CVS approach to their names. What you're talking about goes many steps further and tries to take into account of changes throughout the whole classification. I know that GIT has a reputation of being extremely complex - Bazaar (bzr) also follows the distributed paradigm and has a lot of support from Mark Shuttleworth of ubuntu fame. Further reading on distributed vs central version control includes The Cathedral and the Bazaar" which is also available as a book from O'Reilly Media.

Randal L. Schwartz said...

Thanks for referencing my presentation on "git" at Google.

David Shorthouse said...

Kehan - You're absolutely right. What I propose (and I'll be the first to admit that I have my head where the sun don't shine) would be fritghteningly complex. But, I think a lot of that complexity is a legacy of our own making. We never had anything like git. The conceptual, working model was and still is the publication & the blind faith that someday, someone will organize it. It's shameful that we still work in a largely insular environment like this.
Randal - thanks for stopping by! The work you guys are doing blows me away. I have never used any of it...and I hope you can see me blush as I write this :)~ But, I'd love to get you in a room with cybertaxonomists to hash out a simple, workable plan that could put a fire under current global checklist or name harvesting efforts.

Christopher Taylor said...

I've put a link to this post up at Linnaeus' Legacy