Sunday, June 14, 2009

The Community is Dead

This may not be much of a relevation to many, but is a notion that is sinking home more deeply for me of late. By "Community", I don't necessarily mean the online community, though there are hints of that as well when you think of the MySpace->Facebook->Twitter progression from all-out friend fest to ever more insular & individualistic directions, I mean the taxonomic community.

I lead the LifeDesk application of the Encyclopedia of Life and have been trying to sell the notion of a taxa-centric "community" of taxonomists that have a desire to get their content online in a human and machine readable format. Banding together means the workload can be shared. i.e. you gather the images, I'll get the text, she'll get the names in order, and he'll get the bibliography, etc. etc. This is a similar approach behind the Scratchpad philosophy. [Aside: there are apparently some who think Scratchpads and LifeDesks are duplicating efforts, but nothing could be further from the truth. Having both means choice and that is a good thing because it strengthens both our directions and is a clear signal to taxonomists that there is something behind this.] While the Scratchpad/LifeDesk community-driven focus may work in a number of situations, it is by no means the rule. Rather, the chances are much greater that taxonomists don't have a taxa-centric community of colleagues to share the workload because in fact, they may be the only one in the world working on their chosen taxa. As a result, the majority of Scratchpads and LifeDesks will be "communities" of single individuals. So, I have been thinking a little more deeply about the Scratchpad/LifeDesk direction and think I see a way forward.

The clear signal from the Scratchpad/LifeDesks projects is that folks are doing primarily two things: 1) getting a biblio online, and 2) getting taxonomic names in order. These two activities are largely divorced from one another because the workflow leaves a lot to be desired. Both activities are thankless tasks to begin with regardless of the LifeDesks/Scratchpad environment, which adds further insult to the workflow. Why should these activities be so independent from one another? Here's what the workflow ought to be:

1. Upload PDF reprints
2. Look for a DOI & get the metadata from CrossRef. If none found, prompt with citation form (first check for existing paper in db to cut down on duplicates)
3. Scan the PDFs using TaxonFinder
4. Present flat lists of names found in individual PDFs
5. Drag these into jsTree-based classification manager while retaining the name-reprint link in the background

This is the workflow that makes sense because when building a classification, one necessarily starts with publications, not some mythical list of names.

But...

Does the above make sense in a LifeDesk or a Scratchpad? It could certainly be a cool tool to help lower the bar of entry, but I seriously doubt it would get the traction in the taxonomic "community" that the tool would deserve. Rather, the application is best placed on the desktop as a rich, cross-platform app in Adobe Air or similarly facile environment to develop. Roll in some Bittorrent capabilities (ee gads!) and you have the start to a mechanism whereby reprints, names AND classifications may be shared and one could walk among the three in various ways. It would work because taxonomists need reprints and names AND there are plenty of residual names in any one reprint (i.e. of use to someone else). If cleverly constructed, reconciliation of names is an insular exercise that happens on the desktop (as it always has been) but the sharing of these reconciliation groups / biblio metadata acts to enhance the findability of reprints.

Here's the challenge then. Build a service that accepts PDF reprints, finds the DOI (if present) & spits back the citation metadata for the article AND all the names (dedup'd and cleaned) they contain. I don't don't need any more taxonomic intelligence than that. Give it to me in JSON and I can whip up the jsTree-based interface to help individuals build their own reconciliation groups...all linked to reprints in their store.

Thursday, December 18, 2008

Cooliris on Eight Legs


For well over a year, I have been serving MediaRSS feeds from the Nearctic Spider Database (before Yahoo and Flickr!) and I am overjoyed to see that all the big guys are jumping on this extension to RSS 2.0. One in particular that blows me away is Cooliris, a plug-in for all modern browsers that allows one to navigate MediaRSS feeds in 3D. So, if you haven't yet downloaded and installed Cooliris, you may do so HERE. Then, you're welcome to see the feed of spiders HERE.

Maybe I better take a second look at RSSBus...

Monday, November 3, 2008

Little E's

Because I work for the Encyclopedia of Life (EOL) and because I can tinker on the Nearctic Spider Database, I have the opportunity to try out various approaches to help mobilize data. One thing that concerns me about the current relationship between EOL and its content partners is their near 1:1 relationship. In other words, content partners that come onboard are encouraged to represent their data in one potentially massive XML document much like a Google Sitemap. More information on what EOL would like to see future content partners produce can be found HERE. A potential outside consumer of these data will have no idea where to retrieve this XML document. Thus, the relationship between EOL and its content partners is closed. That is, until EOL releases some web services.

So, in an effort to help expose the data structure EOL is looking for, I made a link on every one of the species pages in the Nearctic Spider Database. Upon clicking these "little e's", you can catch a glimpse of what EOL is hoping its content partners will produce. These "little e's" don't really help the relationship between EOL and its array of content partners, nor does it ease the effort on the part of content partners to make these documents, and nor does it help us at EOL. So what's the point? What it does is share what I produced for EOL. If you can parse the data behind the "little e's", you can parse the big XML "sitemap" document I made for EOL as well.

The problem with sitemaps is that no one but the harvester knows where these sitemaps can be found. A Google sitemap for instance can be found in any folder on a website that shares a sitemap (but is usually in the root folder and is accessed as /sitemap.xml or /sitemap.gz). This is the same situation for EOL and its content partners; the "sitemap" can be found anywhere.

To finish off the "little e" approach, each page should have a link to the EOL content partner sitemap document in which can be found links to all pages with "little e's". This would be somewhat similar to an OpenSearch document where are found instructions on how to make use of the search feed(s) available on a site. And of course, there should be a JSON option for a lighter weight option than XML.

But, to make this of any use at all, we need a desktop reader like an RSS reader...something with the ability to shunt the data into the correct spot within a rich GUI-based classification (with some degree of certainty), thus forcing us to eventually develop far better online tree browsers. With all the bits described above, you'd come across a species page, click a button like an RSS feed button, download a sitemap containing a list of all species pages on the site you landed on, then browse through the content the way you want it organized.

Saturday, October 18, 2008

Google Charts...Wow

Kevin Pfeiffer, an avid participant in the Nearctic Arachnologists' Forum, finally got me to do something about the Flash-based charts on the species pages in the Nearctic Spider Database. While these older charts were great at the time, they've had their day. So, in light of the sparklines that Rod Page integrated into a "Biodiversity Service Status" pinger, I thought I'd take a closer look at Google Charts. Wow. The added plus for this service is the truly stellar documentation.

Rather than using a terribly long URL to get the PNG for the chart, I used a proxy. This way, I can pass the identifier to a local script that then grabs the image and dumps it on the page. And, I can give the chart a file name of my choosing.

Sunday, October 12, 2008

Long Tail of Biodiversity

At last count on the World Spider Catalog, there are 4345 species in the spider family Linyphiidae. This is second only to the jumping spiders. The latter are primarily tropical and subtropical, but linyphiids are predominantly found in the northern hemisphere, where are coincidentally found most of the world's arachnid systematists. And, of course, there's very little accessible information on most of these species either in print or on the web. A few notable exceptions are Tanasevitch's Linypiid Spiders of the World, which contains flat lists of names organized in various ways and the ever popular BugGuide gallery (few of which identified to species). There is a smattering of other resources out there, but they are all hard to find. Both the Tree of Life and the Encyclopedia of Life have the equivalent of stub pages so neither of these are particularly helpful.


A recent unlocking of these hidden gems is underway by Nina Sandlin, an Associate of Zoology at the Field Museum in Chicago. She has been building LinEpig, an photo gallery of linyphiid epigyna on Picasa Web Albums. Like most other online work on arachnids, LinEpig is built with love for the organisms and no budget (correct me if I'm wrong Nina!). While taking images of the epigyna, Nina graciously shared the habitus images with the Nearctic Spider Database. While in Chicago recently, I chatted with Nina about Picasa. While it comes close to what she wanted, it fell short in a number of areas. The most important in my opinion is findability. Sure, she can tag her images with names, but her gallery is poorly exposed on Google and other search engines. However, there are some features in Picasa that make it attractive. It is relatively easy to upload, manage, and geotag - though the latter could evidently use text boxes if one already has coordinates on hand. Most importantly, the interface is clean, responsive and uncluttered.

Now the long tail...

Prior to Nina's efforts, there was very little (if any) linyphiid imagery on the web, especially the specialized images of the epigyna, which are a lot more useful than the habitus images. If you've seen one linyphiid, you've pretty much seen them all (a few exceptions of course). They are remarkably similar in shape & size, but their sexual characters, especially the male's, are dramatically different. The big biodiversity aggregators like the Encyclopedia of Life have positioned themselves to present low hanging fruit. That is, show the furry charismatic megafauna (or fish) because there are many resources serving this sort of content. But, why? Wouldn't it make sense to instead provide better and more useful tools for folks like Nina to create and organize content for which there is either nothing or very little available elsewhere? Let's hope that in time, LifeDesk will provide a ladder for consumers of content generated there to reach out to the furthest branches and leaves where are found all the curiosities. But first, it'll have to contain tools and functionality useful for folks like Nina and for others to jump in and give her a hand.

Friday, July 25, 2008

Show Me...Crab Spiders on Bark



One of the DarwinCore elements for specimen and observation data is "habitat". To my knowledge, not a lot has been done with these data. Either there are actually few records cached at GBIF that have this field filled or the data are in a such a mess as to be (mostly) unusable. I certainly hope it's not the latter. No matter how messy, there is still a wealth of information here if one takes the time to sift through it. The data are not unlike folksonomies and someone with more patience than me could probably develop a natural classification of these terms.

Faceted search is a first crack at making these data useful, because there is certainly more trajectories into the data than without making use of the data. For a first cut at this, I pulled 30 random contributed specimen records in the Nearctic Spider Database for each species and merely display the full contents on the species pages. Then, I index the pages as always using my trusty Zoom Search. Voila, a quick way to do some quick, faceted searches. It's not perfect, but it's better than nothing. Where "crab spider bark" or "wolf spider beach" once produced no search results, there are now 5 and 17 results returned, respectively. Incidentally, Flickr produced 13 and 18 results, respectively but many images are useless.

Sunday, July 20, 2008

Green Porno

I couldn't resist sharing these. Pure genius. Kudos to Isabella Rossellini.