Sunday, June 14, 2009

The Community is Dead

This may not be much of a relevation to many, but is a notion that is sinking home more deeply for me of late. By "Community", I don't necessarily mean the online community, though there are hints of that as well when you think of the MySpace->Facebook->Twitter progression from all-out friend fest to ever more insular & individualistic directions, I mean the taxonomic community.

I lead the LifeDesk application of the Encyclopedia of Life and have been trying to sell the notion of a taxa-centric "community" of taxonomists that have a desire to get their content online in a human and machine readable format. Banding together means the workload can be shared. i.e. you gather the images, I'll get the text, she'll get the names in order, and he'll get the bibliography, etc. etc. This is a similar approach behind the Scratchpad philosophy. [Aside: there are apparently some who think Scratchpads and LifeDesks are duplicating efforts, but nothing could be further from the truth. Having both means choice and that is a good thing because it strengthens both our directions and is a clear signal to taxonomists that there is something behind this.] While the Scratchpad/LifeDesk community-driven focus may work in a number of situations, it is by no means the rule. Rather, the chances are much greater that taxonomists don't have a taxa-centric community of colleagues to share the workload because in fact, they may be the only one in the world working on their chosen taxa. As a result, the majority of Scratchpads and LifeDesks will be "communities" of single individuals. So, I have been thinking a little more deeply about the Scratchpad/LifeDesk direction and think I see a way forward.

The clear signal from the Scratchpad/LifeDesks projects is that folks are doing primarily two things: 1) getting a biblio online, and 2) getting taxonomic names in order. These two activities are largely divorced from one another because the workflow leaves a lot to be desired. Both activities are thankless tasks to begin with regardless of the LifeDesks/Scratchpad environment, which adds further insult to the workflow. Why should these activities be so independent from one another? Here's what the workflow ought to be:

1. Upload PDF reprints
2. Look for a DOI & get the metadata from CrossRef. If none found, prompt with citation form (first check for existing paper in db to cut down on duplicates)
3. Scan the PDFs using TaxonFinder
4. Present flat lists of names found in individual PDFs
5. Drag these into jsTree-based classification manager while retaining the name-reprint link in the background

This is the workflow that makes sense because when building a classification, one necessarily starts with publications, not some mythical list of names.

But...

Does the above make sense in a LifeDesk or a Scratchpad? It could certainly be a cool tool to help lower the bar of entry, but I seriously doubt it would get the traction in the taxonomic "community" that the tool would deserve. Rather, the application is best placed on the desktop as a rich, cross-platform app in Adobe Air or similarly facile environment to develop. Roll in some Bittorrent capabilities (ee gads!) and you have the start to a mechanism whereby reprints, names AND classifications may be shared and one could walk among the three in various ways. It would work because taxonomists need reprints and names AND there are plenty of residual names in any one reprint (i.e. of use to someone else). If cleverly constructed, reconciliation of names is an insular exercise that happens on the desktop (as it always has been) but the sharing of these reconciliation groups / biblio metadata acts to enhance the findability of reprints.

Here's the challenge then. Build a service that accepts PDF reprints, finds the DOI (if present) & spits back the citation metadata for the article AND all the names (dedup'd and cleaned) they contain. I don't don't need any more taxonomic intelligence than that. Give it to me in JSON and I can whip up the jsTree-based interface to help individuals build their own reconciliation groups...all linked to reprints in their store.

6 comments:

kehan said...

I was going through my reading list and this post was immediately followed by this one which details an application which does automated entity recognitions from scientific articles (albeit in XML format rather than pdf's). The coincidence was strange

Rod Page said...

Great post! It seems clear that the reality is that we need tools that provide tangible benefits for users, without asking them to modify (much) what they already do.

I am continually baffled as to why major name databases are divorced from the scientific literature. Surely we want the names linked to their publication?

The service you describe would be pretty easy to build. I wonder whether there are parallels with Mendeley, which features automatic extraction of metadata (including bibliographic references) from PDFs, and is aiming to be a Last.fm for research papers.

Markus Döring said...

on the spot, David.
We are currently building a scientific name index for publications here at ECAT/GBIF. Using lucene and the apache tika project we can index various formats while retrieving some embedded document metadata. Unfortunately most PDFs I cam across do not contain rich metadata like author, date, doi, etc as PDF metadata, so getting hold of the DOI is key. Unfortunately finding the relevant DOI within the publication could be tricky (it may reference lots of others), but thats a challenge to try for sure.

Once we got that (prototypes are already running), building services that expose name checklists based on (sets of) user selected publication in json will be yours. We also try to make use of TaxonMatch to lexically group names if you want to remove some dirty names.


One of the tools you might be interested in is Taxon Tagger, developed by Mike Giddens during our last Nomina meeting. It uses a service like ubios TaxonFinder or our lucene indexer that mark up names in documents and then allows you to add or remove found names manually. The list of names can then be exported as a CSV file. There is an early version running here using TaxonFinder (make sure to use firefox, safari breaks):
http://names.gbif.org/ws/taxontagger/index.html

ECAT is driving taxon taggers development forward to allow the tools to organize the found names in a tree hierarchy instead of a flat list. You can also mark leaf nodes in the trees as being synonyms. And finally you will be able to retrieve the resulting taxonomic tree as a simple flat darwin core file.

I am waiting for a new development machine and will update the above installation with new features hopefully this week.

David Shorthouse said...

Markus -

Indeed, PDFs almost never contain any metadata.It'll be a long time before publishers wake-up to the many tools one can use to embed metadata into PDFs. In the interim, wouldn't the first mention of "doi:" on p.1 qualify as the the DOI for the PDF? Seems the standard flag of honour for publishers is to splash the doi near the top margin of the PDF on p.1.

As for lexical groups and beefy, "you do the thinking for me" services", I'm afraid I'm more for simplicity. I just want the names as written in the document with minimal massaging. Classifications, as I come to appreciate them, are highly personal and of little value to outsiders.

Vince said...

I would not write off the community just yet. The reason why community sites struggle is because we have technically failed to deliver enough personal benefit to the individual contributors to justify their individual efforts within the community. The "community of one" Scratchpads and LifeDesks succeed because the single author receives all the credit. I even have some users of single author Scratchpads that have removed the login from their front page because author say "others have the audacity to try and login and contribute"! But there many Scratchpads that genuinely are community built. These usually have a more specific focus or goal other contributors buy into (e.g. the society sites). We are embarking on a sociological study of Scratchpad maintainers to understand more about the dynamics of these sites.

bob said...

Rod's comment "tools that provide tangible benefits to users" is spot on, and it's echoed by Vince's comment about the need to "deliver enough personal benefit to the individual contributors to justify their individual efforts within the community."

I think about these two topics quite a lot. The concept of "value" starts with the user of the product, not the developer of the product. Practicing "outside-in" development, as you suggest, should result in better outcomes for everyone involved. Thanks for the article.