Tuesday, July 24, 2007

The Next Scientific Revolution?

With the advent of Science Commons (see the popsci.com article) and cool new tools available at Nature like its Precedings, opportunities to share research outside for-profit business models (i.e. journals) are starting to gain momentum. Systems like Neurocommons allow for a redrawing of money tranfer trajectories around and within publishing firms such that everyone can play nice in the sandbox. Medical research sells no matter how you redraw the lines or rebuild the castle. But, I wonder if the discussions around these "revolutions" have attempted to include pure research whose business model is based entirely on public funds and funding institutions that receive their monies from government pots? How do you sell Science Commons to mathematicians, theoretical physicists, systematists, ecologists, among other scientists whose research is not closely tied to big bucks pharmaceutical companies and a Gates Foundation? Repositories like iBridge are useful for niche markets and interests, but I'd hardly call them revolutionary for all of science. If proponents of the Semantic Web want to sell their ideas, I'd like to see buy-in by one tenured ecologist & then some demonstrable evidence for how this will accelerate his/her research.

Thursday, July 19, 2007

Biodiversity on the Desktop

In earlier posts, I described my vision of an Encyclopedia of Life (EoL) where the web and your desktop environment are blurred or mashed together. In such a manner, I envisioned a tool where users and contributors to EoL could maintain either a public or private working space to mix and mash data from multiple sources with those in their own hard drives.

One of the grandiose dreams I had was the ability to create a private, content management-like community within EoL in which co-authors of a proposed manuscript like a taxonomic revision could merge their datasets, grab data from third party sources if useful (e.g. GenBank, GBIF, etc.), and collectively work on the manuscript. Upon completion of the manuscript, there may be some elements of use to EoL that the authors could later push out, which in no way diminishes the value of their publication or would cause editors to frown and reject the paper, but offers immediate value to the public at large. Granted I'm not totally clear on how this will work, but I really don't think it would be completely unrealistic. But, what I can't stress enough, is that EoL cannot be like WikiSpecies where contributors have to sit and author content solely for use in WikiSpecies. Rather, it absolutely must have a system that can somehow slip into the workflow of biologists. I'm now convinced that such a vision is not science fiction.

I watched a few Adobe AIR (Adobe Integrated Runtime - code named Apollo) videos this morning and got thoroughly fired-up about the possibilities. Google has been working on an offline API, which looks pretty good, but AIR blows all this out of the water because it allows a developer to maintain an application's presence on the desktop. Further, it can be integrated into the operating system. So, imagine working on a manuscript with a handful of colleagues and be notified with a desktop toast pop-up when he/she has completed his/her revisions or sections of the manuscript, or if a dataset in EoL suddenly became available while the manuscript was being prepared.

The really cool thing from a web developer's perspective is that AIR uses existing web programming tools, i.e. the learning curve to create these cross-platform applications is quite shallow because code can be repurposed for either the web or the desktop...at least that's the impression I get from watching a few of these videos. Here's one such video where Christian Cantrell demonstrates a few simple applications built on AIR...just mentally substitute Amazon for EoL and you get my drift.

Now, all this stuff is really cool and there is indeed the potential for EoL or the totally unknown "MyEoL" to truly transform the science of biology because it can & should slip into biologists' workflows. Of course, there's no reason why multiple flavours of these AIR desktop applications couldn't be built for children, amateur naturalists, scientists, or however EoL sees its user base. In fact, what I'd really like to see is a BioBlitz class of user where observational data can be merged across interest groups at any geographic scale in their respective desktop applications. An Encyclopedia is great, but we ultimately want people to use the encyclopedia and go outside with fellow human beings to look at, count, touch, or otherwise experience life on our planet. Each flavour of the desktop application can be customized to do different chores or expose different aspects of the EoL system to various classes of end users. The first step then is to nail the needs and design these applications around them. So, here are a few questions to kick-start this process, at least for biologists/systematists who are excited about EoL but just don't believe there will ever be time for them to contribute content:

1. When conducting a taxonomic revision, what are the significant organizational and communications impediments that have slowed down the process?
2. What data sets are vital to conducting an effective revision?
3. What desktop software applications are critical when conducting a revision? [remember, because AIR is on the desktop, there may be an opportunity here to automate file type transformations produced by one provider/database/spreadsheet to the applications required to actually analyze the data]

#2 above will be a challenge and will require that data providers produce similarly structured APIs (e.g. DiGIR, TAPIR, OpenSearch, and the like). This I believe, is where EoL has to take firm leadership. What it needs now is a demo application like that produced by Adobe and Christian Cantrell so we can all visualize this dream. Currently, the EoL dream is very fuzzy and has lead to a lot of miscommunication and confusion. e.g. isn't this just WikiSpecies? THEN, EoL absolutely must write some recipes for content providers to start sharing their data in a manner like what Google does with their Google Maps API. The selling point will be iconographic and link-back attribution for content providers, which can be constructed already if all content providers used a system like OpenSearch, MediaRSS, GeoRSS, and various other RSS flavours to open their doors.

Wednesday, July 4, 2007

Digital Species Descriptions & the new GBIF portal

The Biodiversity Information Standards (previously known as TDWG, the Taxonomic Database Working Group) has recently rolled out a new subsgroup called "Species Profile Model" led by Éamonn Ó Tuama (GBIF). Thirteen people attended a workshop April 16-18, 2007 in Copenhagen, DK shortly after the Encyclopedia of Life informatics workshop in Woods Hole, MA. The point of this Copenhagen workshop was to hash out a specification to support the retrieval and integration of data with the lofty goal of "reaching consensus and avoiding fragmentation" of existing species-level initiatives. I'm all for this, but I wonder if it will work? I believe "consensus" as it's described here is meant to be a common way of presenting the data rather than a true taxonomic, ecological, or political consensus. A specification does not preclude the possibility for several variants of a species profile served from multiple (or even the same) provider. These could of course have conflicting or dated information and ultimately result in misleading COSEWIC-type recommendations. So, what about consensus as we usually define it? Or, is that beyond the responsibility of this subgroup?

A standard for specimen data (Darwin Core, ABCD, etc.) is obvious, but I'm not convinced that a standard for species descriptions is wise unless such a standard were developed and solely hosted by the nomenclators and sanctioned by the various Codes. A standard for species descriptions without ties to the nomenclators and the authors who conducted the original species description or revision merely democratizes fluff. Before a standard Species Profile Model is put into practise, such RDF representations have to at least explicitly incorporate peer review, authorship, and a date stamp.

I also noticed that the Species Profile Model is attempting to integrare citations to scientific literature. I suggest the team take a good close look at OpenURL, which lends itself to useful functionality when building lists of references in front-end applications (see Rod Page's post in iPhylo on this very subject and several posts in this blog). The OpenURL format will influence how the elements in the proposed Species Profile Model ought to be constructed.

On other fronts, GBIF just rolled out their new portal: http://data.gbif.org. It looks as if the whole index and back-end was reconstructed and there remain some missing provider data tables. In time, these will probably blink on as they were presented via the old portal. What I appreciate seeing for the first time is a concerted effort to give providers some auto-magic feedback about what is being served from their boxes. Vetting data is a very important part of federation and I hope providers sit up and take notice. GBIF calls these "event logs", which is too obtuse. I'd like to see this called "Questionable Data Served from this Provider", "Problem Records", or "The Crap You're Serving the Scientific Community", or something similar. "Event logs" is easily dismissed and overlooked. For example, here are the event logs for the University of Alaska Museum of the North Mollusc Collection: http://data.gbif.org/datasets/resource/967/logs/. GBIF also has a flashy new logo & plenty of easy to use web services.