Tuesday, October 30, 2007

Taxonomic Consensus as Software Creation

It occurred to me today that the process of reaching taxonomic consensus or developing a master database of vetted names like that undertaken by The Catalogue of Life Partnership (CoLP) is not unlike software development that necessarily requires some sort of framework to manage versioning. However, taxonomic activities and building checklists do not currently have a development framework. We likely have a set of rules and guidelines, but infighting and bickering no doubt fragment interest groups, which ultimately leads to the stagnation, abandonment, and eventual distrust of big projects like CoLP. We have organizations like the International Commission on Zoological Nomenclature to manage the act of naming animals but there is nothing concrete out the other end to actually organize the names. Publications are merely the plums in a massive bowl of pudding. And, it is equally frustrating to actually find these publications. One way to approach a solution to this is to equate systematics with perpetual software development where subgroups manage branches of the code and occasionally perform commits to (temporarily) lock the code. Like with software development, groups of files (i.e. branches on the tree of life) and the files themselves (i.e. publications, images, genomic data, etc.) ought to be tracked with unique identifiers and time-stamps. This would be a massively complex shift in how taxonomic business is conducted, but what other solution is there?

Without really understanding distributed environments in software development...it's too geeky for me...I spent a few moments watching a Google TechTalk presentation by Randal Schwartz delivered at Google October 12, 2007 about Git, a project spearheaded by Linus Torvalds: http://video.google.com/videoplay?docid=-1019966410726538802 (sorry, embedding has apparently been disabled by request).

There are some really interesting parallels between distributed software development environments like Git and what we ought to be working toward in systematics, especially as we move toward using Life Sciences Identifiers (LSIDs). Here are a few summarized points from Randal's presentation:

  • Git manages changes to a tree of files over time
  • Optimized for large file sets and merges
  • Encourages speculation with the construction of trial branches
  • Anyone can clone the tree, make & test local changes
  • Uses "Universal Public Identifiers"
  • Has multi-protocal transport like HTTP and SSH
  • One can navigate back in time to view older trees of file sets via universal public identifiers


With a cross-platform solution and a facile user interface, perhaps thinking in these terms will help engage taxonomists and will ultimately lead to a ZooBank global registry of new taxon names.

Thursday, October 25, 2007

Buying & Selling DOIs...and the same for specimens

A previous post of mine described the business model for digital object identifiers in albeit simplistic terms. But, perhaps I should back up a second. Just what the heck is a DOI and why should the average systematist care? [Later in this post, I'll describe an interesting business model for biodiversity informatics]

Rod Page recently wrote a post in iPhylo that does a great job of selling the concept. Permit me to summarize and to add my own bits:

  1. DOIs are strings of numbers & letters that uniquely identify something in the digital realm. In the case of published works, they uniquely identify that work.
  2. DOIs are resolvable and can be made actionable. i.e. you can put http://dx.doi.org/ in front of a DOI and, through the magic of HTTP, you get redirected to the publisher's offering or the PDF or HTML version of the paper
  3. DOIs have metadata. If you have for example a citation to a reference, you can obtain the DOI. Conversely, if you have a DOI, you can get the metadata
  4. DOIs are a business model. Persistent URLs (championed by many) are not a business model because there is no transfer of funds & confidences


Systematists have lamented that their works on delineating & describing species don't get cited in the primary literature. If they published in journals that stamped DOIs on their works or if they participated in helping journals get DOIs for back-issues or future publications, then outfits like the Biodiversity Heritage Library would have an easier time mapping taxon names to published works. For example, searching not for a publication but a taxon name in Biodiversity Heritage Library (protoype HERE) would not only provide a list of works in BHL that used the name somewhere in its text, it could provide a forward-linking gadget from CrossRef. The end user would have an opportunity to do his or her own cognitive searching:

There is nothing stopping an outfit like the Biodiversity Heritage Library from using Handles or some other globally unique identifer. But, doing so cuts off the possibility of injecting old works back into contemporary use because they will not be embedded into a widely used cross-linking framework.

MOIs for Sale


The Global Biodiversity Information Facility and The Encyclopedia of Life must also be active participants in the adoption of globally unique identifiers. But again, there must be a business model. So, here's a business model in relation to museum specimens:
  1. A registry sells a "MOI" - Museum Object Identifier (my creation of course) at 1 cent per labelled specimen.
  2. The price will go up to 2 cents a specimen after 2020, the usual year given for various National Biodiversity Strategies. Translation: get your act together because it'll cost more later.
  3. All MOIs must have DarwinCore metadata
  4. The registry sets up a resolver identical in functionality to DOIs


Now, before all the curators out there scream bloody murder, let's stop and think about this and put a creative, financial spin on the possibilities. Craig Newmark, the founder of the ever popular Craig's List, was recently interviewed on Stephen Colbert's Colbert Report where he mentioned Donor's Choose (see interview). If you're not familiar with that new service, here's the slogan: "Teachers ask. You choose. Students learn."
DonorsChoose.org is a simple way to provide students in need with resources that our public schools often lack. At this not-for-profit web site, teachers submit project proposals for materials or experiences their students need to learn. These ideas become classroom reality when concerned individuals, whom we call Citizen Philanthropists, choose projects to fund.

There's a lot of interest in The Encyclopedia of Life and the Biodiversity Heritage Library now. Let's set-up a global "Donor's Choose" clone called something like "Biodiversity Knowledge Fund" (though that's not catchy enough) to be locally administered by daughter organizations to EOL and the BHL in countries throughout the world. Funds then are transferred to institutions of the donor's choosing. Museums then accept the funds donated to them and turn around and buy "MOIs". What would prevent a museum from taking the money specifically donated to them and spending it on things other than MOIs? Nothing. But, their specimens then aren't indexed. Are you a philanthropist or have 20 dollars (or francs, rubles, pounds, pesos, dinar, lira, etc.) you'd like to donate? Want to fund biodiversity but don't know how? Here's an answer. But is such a "Biodiversity Knowledge Fund" sustainable? No, but it's a start.

Wednesday, October 24, 2007

Biodiversity Informatics Needs a Business Model

Publishers and (most) librarians understand that digital object identifiers (doi) associated with published works are more than just persistent codes that uniquely identify items. They are built into the social fabric of the publishing industry. Because monies are transferred for the application and maintenance of a doi, the identifier is persistent. It's really because of this "feature" that tools like cross-linking and forward-linking can be built and that these new tools will themselves persist. The nascent biodiversity informatics community is attempting to do all the fun stuff (myself included) like building taxonomic indices, gadgetry to associate names and concepts with other things like literature, images, and specimens without first establishing a long-term solution for how the persistence of all these new tools will be established. Let me break it down another way:

Publishers buy dois and pay an annual subscription. In turn, the extra fee for the doi is passed down the chain to the journal & its society. The society then passes the extra fees on to either an author in the way of page fees or to the subscribers of the journal. Since the majority of subscribers are institutions and authors receive research grants from federal agencies, ultimately, the fractions of pennies that merge to pay for a single doi come from tax payers' wallets and purses. So, dois fit nicely into the fabric of society and really do serve a greater purpose than merely uniquely identifying a published object. Then, and only then, can the nifty tools CrossRef provides be made available. Then, third parties may use these tools with confidence.

Not surprisingly, the biodiversity informatics community has latched on to the nifty things one can do with globally unique identifiers because everybody wants to "do things" by connecting one another's resources. Some very important and extremely interesting answers to tough questions can only be obtained by doing this work. Also not suprisingly, there is now a mess of various kinds of supposed globally unique identifiers (GUIDs) because big players want to be the clearinghouse much as CrossRef is the clearinghouse for dois. But they have all missed the point.

So, how do we instill confidence in the use of LSIDs, ITIS TSNs, the various NCBI database id's, etc. without a heap of silos with occasional casualities? Get rid of them or at least clearly associate what kind of object gets what kind of identifier along with a business model where there will be persistent, demonstrable transfer of funds. The use of Semantic Web tools is merely a band-aid for a gushing wound. When I say persistent transfer of funds, I don't mean assurances that monies will come from federal grants or wealthy foundations in order to maintain those identifiers. I mean an identifier that is woven into the fabric and workflow of the scientific community. This may be easier said than done because other than publications, the scientific community (especially systematists and biologists) aren't in the business of producing anything tangible except publications. CrossRef has that angle very well covered. So, what else do scientists (the systematics community is what I'm most interested in) produce that can be monetized? Specimens, gene sequences, and perhaps a few other objects. We need several non-profits like CrossRef with the guts to demand monies for the assignment of persistent identifiers. Either we adopt this as a business model or we monetize some services (e.g. something like Amazon Web Services as previously discussed) that directly, clearly, and unequivocally feed into the maintenance of all the shiny new GUIDs.

Tuesday, October 16, 2007

PygmyBrowse Classification Tree API

Yay, a new toy! This one ought to be useful for lots of biodiversity/taxonomic web sites. First, I'll let you play with it (click the image):

Seems I always pick up where Rod Page leaves off. Not sure if this is a good thing or not. However, we do have some worthwhile synergies. Rod has cleaned up and simplified his old (Sept. 2006) version of PygmyBrowse. Earlier this week, he made an iframe version and put it on his iPhylo blog. Like Rod, I dislike a lot of the classification trees you come across on biodiversity/taxonomic web sites because these ever-expanding monstrosities eventually fill the screen and are a complete mess. When you click a hyperlinked node, you often have to wait while the page reloads and the tree re-roots itself...not pretty. Trees are supposed to simplify navigation and give a sense of just how diverse life on earth really is. The Yahoo YUI TreeView is OK because it's dynamic, but it desperately needs to handle overflow for exceptionally large branches as is the case with classification trees in biology. What did I do that's different from Rod's creation?

I convinced Dave Martin (GBIF) to duplicate the XML structure Rod used to fill the branches in his PygmyBrowse and to also do the same with JSON outputs. This is the beta ClassificationSearchAPI, which will soon be available from the main GBIF web services offerings. When the service is out of beta, I'll just adjust one quick line in my code.

I jumped at the chance to preserve the functionality Rod has in his newly improved, traditional XMLHTTP-based PygmyBrowse and write one as an object-oriented JavaScript/JSON-based version. My goal is to have a very simple API for developers and end users who wish to have a remotely obtained, customizable classification tree on their websites. Plus, I want this API to accept an XML containing taxon name and URL elements (e.g. a Google sitemap) such that the API will parse it and adjust the behaviour of the links in the tree. In other words, just like you can point the Google Map API to an XML file containing geocoded points for pop-ups, I wanted to author this API to grab an XML and magically insert little, clickable icons next to nodes or leaves that have correspoding web pages on my server. Think of this as a hotplugged, ready-made classification naviagator. This is something you cannot do with an iframe version because it's stuck on the server and you can't stick your fingers in it and play with it. Sorry, Rod.

The ability to feed an XML to the tree isn't yet complete, but the guts are all in place in the JavaScript. You can specify a starting node (homonym issues haven't yet been dealt with but I'll do that at some point), the size of the tree, the classification system to use (e.g. Catalogue of Life: 2007 Annual Checklist or Index Fungorum, among others), and you can have as many of these trees on one page as you wish. You just have to pray GBIF servers don't collapse under the strain. So, you could use this API as a very simple way to eyeball 2+ simultaneous classifications. The caveat of course is that GBIF must make these available in the API. So, hats off to GBIF and Dave Martin. These are very useful and important APIs.

Last month, I proposed that the Biodiversity Informatics community develop a programmableweb.com clone called programmablebiodiversity.org. There are more and more biodiversity-related APIs available, many of which produce JSON in addition to the usual XML documents via REST. Surely people more clever than me can produce presentational & analytical gadgets if there was a one-stop-shop for all the APIs and a showcase for what people are doing with these data services. The response from TDWG was luke-warm. I think there's a time and place for development outside the busy work of standards creation. But, there were a few very enthusiastic responses from Tim Robertson, Donald Hobern, Lee Belbin, Vince Smith and a few others. It turns out that Markus Döring and the EDIT team in Berlin have been creating something approaching my vision called BD (Biodiversity) Tracker at http://www.bdtracker.net. I just hope they clean it up and extend it to approximate the geekery in programmableweb.com with some clean cut recipes for people to dive into using APIs like this. [Aside: Is it just me or all the Drupal templates starting to look a little canned and dreary?].

There's plenty more I want to do with this JSON-based PygmyBrowse so if you have ideas or suggestions, by all means drop a comment. Rod wants to contribute this code to an open-source repository & I'll be sure to contribute this as a subproject.

Wednesday, October 3, 2007

The Open Library


I stumbled on an amazing new project lead by Aaron Swartz called the Open Library - not be confused with this Open Library though there appears to be some resemblance. What strikes me about Aaron's project is that it is so relevant to The Encyclopedia of Life it scares me that I haven't yet heard of it. According to their "About the technology" page:

Building Open Library, we faced a difficult new technical problem. We wanted a database that could hold tens of millions of records, that would allow random users to modify its entries and keep a full history of their changes, and that would hold arbitrary semi-structured data as users added it. Each of these problems had been solved on its own, but nobody had yet built a technology that solved all three together.

The consquence of all this is that there is a front-facing page for every book with the option to edit the metadata. All versioning and users are tracked. The content of the "About Us" page sounds eerily like E. O. Wilson's proclamations in his 2003 opinion piece in TREE (doi:10.1016/S0169-5347(02)00040-X). For those of you who don't recognize the name, Aaron Swartz, he's the whiz behind a lot of important functionality on the web we see today. It's also worth reading his multi-part thoughts on the spirit of Wikipedia and why it may soon hit a wall.

Sunday, September 30, 2007

DOIs in the References Cited Section

I just read a recent post by Ed Pentz (Executive Director of CrossRef) who alerted his readers to some recent changes to recommended American Medical Association and American Psychological Association style guides. Ed also provides two examples. Essentially, a string like "doi:10.1016/j.ssresearch.2006.09.002" is to be tagged at the end of each reference (if these dois exist) in the literature cited section of journals that use AMA or APA styles. Cool! It would be a snap to write a JavaScript to recognize "doi:" on a page and magically add the "http://dx.doi.org/" and I have no doubt publishers can do something similar prior to producing PDF reprints. Ed seemed puzzled by the exclusion of "http://dx.doi.org/", but this is a really smart move by the AMA and the APA. DOIs afterall are URNs so it's best to avoid any confusion. If paper publishers want to make these actionable, then they can do so. If web publishers want to do the same, then a simple little JavaScript can do it.

So, what we need now are style editors to pick-up this recommedation across the board in all journals in all disciplines. This simple addition would do a world of good for human discoverability and for machine consumption/repurposing. It shouldn't just be a recommendation, it should be mandatory.

Hopefully, this will step-up the drive for including XMP metadata within PDF reprints. It may be a pain for authors to track down DOIs if they're not stamped on the covering page (usually hovering around the abstract or the very top of the page) and consequently, adoption of this new recommendation may be rather slow. If the DOI were embedded in the XMP, then reference managers like EndNote and RefMan will naturally read these metadata. In other words, building your reference collection would be as simple as dropping a PDF in a watched file folder and letting your reference manager do the rest. This would also open the door to zero local copies of PDFs or an intelligent online storage system. EndNote and RefMan need only have the DOI and they can pull the rest using CrossRef's DOI look-up services.

Thursday, September 27, 2007

Would You Cite a Web Page?

Species pages in The Nearctic Spider Database are peer-reviewed in the very traditional sense. But, instead of doling out pages that need to be reviewed, I leave it up to authors to anonymously review each others' works. Not just anyone can author a species page; you at least need to show me that you have worked on spiders in some capacity. Once three reviews for any page have been received and the author has made the necessary changes (I can read who wrote what and when), I flick the switch and the textual contributions by the author are locked. There is still dynamically created content on these species pages like maps, phenological charts, State/Province listings, etc. However, at the end of the day, these are still just web pages, though you can download a PDF if you really want to.

Google Scholar allows you to set your preferences for downloading an import file for BibTex, EndNote, RefMan, RefWorks, and WenXianWang so I thought I would duplicate this functionality for species pages in The Nearctic Spider Database, though limited to BibTex and EndNote. I'm not at all familiar with the last three reference managers and I suspect they are not as popular as EndNote and BibTex. Incidentally, Thomson puts out both EndNote and RefMan and recently, they released EndNoteWeb. As cool as EndNoteWeb looks, Thomson has cut it off at the knees by limiting the number of references you can store in an online account to 10,000. Anyone know anything about WenXianWang? I couldn't find a web site for that application anywhere. So, here's how it works:

First, it's probably a good idea to set the MIME types on the server though this is likely unnecessary because these EndNote and BibTex files are merely text files:

EndNote: application/x-endnote-refer (extension .enw)
BibTex: ?? (extension .bib)

Second, we need to learn the contents of these files:

EndNote: called "tagged import format" where fields are designated with a two-charcter code that starts with a % symbol (e.g. %A). After the fields, there is a space, and then the contents. It was a pain in the neck to find all these but at least the University of Leicester put out a Word document HERE. Here's an example of the file for a species page in The Nearctic Spider Database:


%0 Web Page
%T Taxonomic and natural history description of FAM: THOMISIDAE, Ozyptila conspurcata Thorell, 1877.
%A Hancock, John
%E Shorthouse, David P.
%D 2006
%W http://www.canadianarachnology.org/data/canada_spiders/
%N 9/27/2007 10:40:40 PM
%U http://www.canadianarachnology.org/data/spiders/30843
%~ The Nearctic Spider Database
%> http://www.canadianarachnology.org/data/spiderspdf/30843/Ozyptila%20conspurcata

BibTex: Thankfully, developers at BibTex recognize the importance of good, simple documentation and have a page devoted to the format. But, the examples for reference type are rather limited. Again, I had to go on a hunt for more documentation. What was of great help was the documentation for the apacite package, which outlines the rules in use for the American Psychological Association. In particular, p. 15-26 of that PDF was what I needed. However, where's the web page reference type? Most undergraduate institutions in NA still enforce a no web page citation policy on submitted term papers, theses, etc. so it really wasn't a surprise to see no consideration for web page citations. So, what is the Encyclopedia of Life to do? The best format I could match for EndNote's native handling of web pages was the following:

@misc{hancock30843,
author = {Hancock, John},
title = {Taxonomic and natural history description of FAM: THOMISIDAE, Ozyptila conspurcata Thorell, 1877.},
editor = {Shorthouse, David P.},
howpublished = {World Wide Web electronic publication},
type = {web page},
url = {http://www.canadianarachnology.org/data/spiders/30843},
publisher = {The Nearctic Spider Database},
year = {2006}
}

Now, BibTex is quite flexible in its structure so there could very well be a proper way to do this. But, the structure must be recognized by the rule-writing templates like APA otherwise it is simply ignored.

The EndNote download is available at the bottom of every authored species page in the database's website via a click on the EndNote icon (example: http://www.canadianarachnology.org/data/spiders/30843). I have no idea if the BibTex format above is appropriate so I welcome some feedback before I enable that download.

But, all this raises a question...

Would you import a reference to a peer-reviewed web page into your reference managing programs and, if you are an educator, should you consider allowing undergraduates an opportunity to cite such web pages? Would you yourself site such pages? Do we need a generic, globally recognized badge that exclaims "peer-reviewed" on these kinds of pages? Open access does not mean content is not peer-reviewed or any less scientific. Check out some myth-busting HERE. What if peer-reviewed web pages had DOIs, thus taking a great leap away from URL rot and closer toward what Google does with its index - calculations of page popularity. Citation rates (i.e. popularity) is but one outcome of the DOI model for scientific papers. If I anticipated a wide, far-reaching audience for a publication, I wouldn't care two hoots if it was freely available online as flat HTML, a PDF, or as MS Word or if the journal (traditional or non-traditional) has a high impact factor as mysteriously calculated by, you guessed it, Thomson ISI. If DOIs are the death-knoll for journal impact factors, are web pages the death knoll for paper-only publications?

Wednesday, September 26, 2007

IE6 is Crap By Design

I give up. I have had it. Internet Explorer 6 absolutely sucks.

Since I first started toying with JavaScript and using .innerHTML to dynamically add, change, or remove images on a web page in response to a user's actions or at page load (e.g. search icons tagged at the end of scientific references to coordinate user click-to-check existence of DOI, Handle, or PDF via Rod Page's JSON reference parsing web service at bioGUID.info), I have struggled with the way this version of the browser refuses to cache images.

If multiple, identical images are inserted via .innerHTML, IE6 makes a call to the server for every single copy of the image. Earlier and later versions of IE do not have this problem; these versions happily use the cache as it's meant to be used and do not call the server for yet another copy of an identical image. I tried everything I can think of from preloading the image(s), using the DOM (i.e. .appendChild()), server-side tricks, etc. but nothing works. Page loads are slow for these users and the appended images via .innerHTML may or may not appear, especially if there are a lot of successive .innerHTML calls. Missing images may or may not appear with a page refresh and images that were once present may suddenly disappear with a successive page refresh. Because AJAX is becoming more and more popular & as a consequence, there are more instances of .innerHTML = "<img src="..."> in loops, don't these people think something is not right with their browser? I doubt it. Instead, I suspect they instead think something is wrong with the web page and likely won't revisit.

Here's some example code that causes the problem for an IE6 user:

...
for (i=0;i<foo;i++) {
bar.innerHTML = "<img src="...">;
}
...


Web server log lines like the following (abridged to protect the visitor to The Nearctic Spider Database) fill 10-20% of these daily files:

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)


So, I wonder how much bandwidth on the Internet superhighway is consumed by this terrible design flaw? Microsoft is aware of the issue, but weakly tried to convince us that this behaviour is "by design" as can be read in their Knowledge Base Article Q319546. Give me a break. The solution in that article is pathetic and doesn't work.

Depending on what you do in your JavaScript, you can either force an HTTP status code of 200 OK (i.e. image is re-downloaded, thus consuming unnecessary bandwidth) or an HTTP status code of 304 Not Modified ("you already got it dummy, I'm not sending it again" - but still some bandwidth). Though I haven't yet investigated before/after in my web logs, others have reported that the following code at the start of some JavaScript will force a 304:

try {
document.execCommand("BackgroundImageCache", false, true);
} catch(err) {}


I'm not convinced that this will work.

After pulling my hair out for 2+ years trying to come up with a work-around, I have concluded that the only way this problem can be prevented is if the user adjusts how their own cache works. If you are an IE6 user, see the note by Ivan Kurmanov on how to do this - incidentally, Ivan's server-side tricks also don't work). In a nutshell, if you use IE6, DO NOT choose "Every visit to the page" option in your browser's cache settings. Unfortunately, there is no way for a developer to detect a user's cache settings. Like pre-IE6 and post-IE6 users, people with IE6 whose cache is set to check for newer content "Automatically", "Never" or "Every time you start Internet Explorer" also do not have this problem.

Developers (like Mishoo) have suggested that this "by design" flaw is a means for Microsoft to artificially inflate its popularity in web server log analyses. But, I doubt analysis software confuses multiple downloads from the same IP address with browser popularity. So how can this possibly be by design if there is no way for a developer to circumvent it? Who knows.

The only real solution I have been able to dream up, though I haven't implemented it, is to perform a JavaScript browser version check. If people choose to use IE6 and don't upgrade to IE7 then they won't get the added goodies. I certainly don't want to degrade their anticipated experience, but I also have to think about cutting back on pushing useless bandwidth, which is not free. If you insist on using IE, please upgrade to IE7. This version of the browser was released a year ago. Better yet, make the switch to FireFox.

Tuesday, September 25, 2007

Google Search JSON API

Google is obviously producing JSON outputs from most (if not all) of its search offerings because they have an API called "Google AJAX Search API". Signing up for a key allows a developer to pull search results for use on web pages in a manner identical to my iSpecies Clone. However, it is an absolute pain in the neck to use this API. It is nowhere near as useful as Yahoo's web services. Google goes to great lengths to obfuscate the construction of its JSON outputs and these are produced from RESTful URLs, buried deep in its JavaScript API. So, I went on a hunt...

Within a JavaScript file called "uds_compiled.js" that gets embedded in the page via the API, there are several functions, which are (somewhat) obvious:
GwebSearch(), GadSenseSearch(uds), GsaSearch(uds) [aside: Is this Google Scholar? There is no documentation in the API how-to], GnewsSearch(), GimageSearch(), GlocalSearch(), GblogSearch(), GblogSearch(), and finally GbookSearch().
Since all these functions are called from JSON with an on-the-fly callback, why make this so difficult to? If I had some time on my hands, I'd deobfuscate the API to see how the URLs are constructed such that I could reproduce Yahoo's API. Incidentally, I just now noticed that Yahoo has an optional parameter in its API called "site" whereby one can limit the search results to a particular domain (e.g. site=wikipedia.org).

So Google, if you are now producing an API like AJAX Search, why make it so difficult and force the output into a Google-based UI? Developers just want the data. You could just as easily force a rate limit as does Yahoo for its API: 5,000 queries per IP per day per API. Problem solved.

iSpecies Clone

For kicks, I created an iSpecies clone that uses nothing but JSON. Consequently, there is no server-side scripting and the entire "engine" if you will is within one 12KB external JavaScript file. The actual page is flat HTML. What this means is that you can embed the engine on any web page, including <<ahem>> Blogger. You can try it out yourself here: iSpecies Clone. Or live:

iSpecies


Rod Page was the first to create a semi taxonomically-based search engine called iSpecies (see iSpecies blog)that uses web services. He recently gave it a facelift using JSON data sources. This has significantly improved the response time for iSpecies because it is now simple and asynchronous. Rod could continue to pile on web services to his heart's content.

Dave Martin is producing JSON web services for GBIF and recently added a common name and scientific name search. It occurred to me that iSpecies could initially connect to GBIF to produce arrays of scientific and GBIF-matched common names prior to sending off a search to Yahoo or Flickr. Most tags in these image repositories would have common names. Likewise, if the Yahoo News API is of interest, then of course it would be useful to obtain common names prior to making a call to that web service. That's how the iSpecies clone above works. Oh, and scientific names are also searched when a common name is found and recognized by GBIF.

This clone is naturally missing material compared to the results obtained when conducting a real iSpecies search (e.g. genomics, Google Scholar - though is simply disabled here - and what looks to be some scripting for name-recognition above the level of species). If big player data providers like NCBI, uBio, CBOL, etc. produced JSON instead of or in addition to XML, then it would be incredibly easy to make custom search engines like this that can be embedded in a gadget.

After having tinkered a lot with JSON lately, it is now abundantly clear to me that The Encyclopedia of Life (EOL) species pages absolutely must have DOIs and plenty of web services to repurpose the data it will index. If we really want EOL to succeed in the mass media, then these species page DOIs should also be integrated into Adobe's XMP metadata along with some quick and easy ways to individually- and batch-embed them.

Thursday, September 20, 2007

GBIF web services on the rise

Look out EOL, Dave Martin has been very quickly creating some superb web services for GBIF data. Check out the GBIF portal wiki or the various name search APIs that produce text, xml, JSON, or simple deep-linking: HERE. As an example of the kinds of things Dave and Tim Robertson have been producing, here is a map gadget showing all the records in a 1 degree cell shared to GBIF from The Nearctic Spider Database:

Wednesday, September 12, 2007

EOL "WorkBench" Ideas Loosely Joined

Just how quickly The Encyclopedia of Life's "WorkBench" environment can be assembled will be interesting. For those of you unfamiliar with this critical aspect of the initiative, it will be the environment in which users will access and manipulate content from web services and other data providers, EOL indexed content, and a user's local hard drive AND simultaneously contribute (if desired, I expect this to be optional) to public facing species pages. As you can imagine a suite of things have to fall into place for all the pieces to play nicely in a simple graphical user interface. Expectations are high for this application to be THE savior, which I hope will differentiate EOL from WikiSpecies and other similar projects/initiatives.

I envision WorkBench as a Semantic Web browser of sorts, capable of pulling dozens of data types from hundreds of sources into a drag-drop whiteboard something like a mindmap. Coincidentally, I stumbled across MindRaider. Though I'd much rather see a Flex-based solution (such that Adobe AIR can be used), MindRaider (Java-based) developed by Martin Dvorak looks to be a very interesting way to organize concepts as interconnected resources and also permits a user to annotate components of the mindmap. Sharing such Semantic mindmaps is also a critical piece of the puzzle as is making interconnections to content in a user's local hard drive.

What then do we need for EOL's WorkBench? 1) web services, 2) web services, and 3) web services, AND 4) commonly structured web services such that resources acquired from hundreds of data providers must not require customized connectors.

Somehow, I'd like to see data providers ALL using OpenSearch (with MediaRSS or FOAF extensions) for fulltext, federated search and/or TAPIR for the eventual Species Profile Model's structured mark-up. Then, I'd like to see RSSBus on EOL servers. Lastly, I'd like to see a melange of MindRaider, Mindomo, and a Drupal-like solution to permit self-organization of interest groups of the kind Vince Smith champions with Scratchpads. Vince and others no doubt argue that there is great value in centralized hosting. But, the advantage is 99% provider. End users don't give two hoots for this so long as the application is intuitively obvious and permits a certain degree of import, export, configuration and customization.

So, the pieces of the puzzle:

Providers -> RSS -> RSSBus -> MindRaider/Mindomo/Scratchpad (in Adobe AIR) <- local hard drive

Sounds easy doesn't it? Yeah, right.

Friday, September 7, 2007

Forward Thinking


I previously criticized CrossRef for the implementation of new restrictive rules for use of its OpenURL service, but Ed Pentz, Executive Director of CrossRef, stopped by and reassured us that CrossRef exists to fill the gaps. The most restrictive rule has now been relaxed. Well done, Ed.

While browsing around new publications in Biodiversity and Conservation, I caught something called "Referenced by" out of the corner of my eye. This may be old hat to most of you and I now feel ashamed that I have not yet discovered it. Perhaps I have subconsciously dismissed boxes on web sites because Google AdWare panels have constrained my eyeball movements. Anyhow, CrossRef have used the power of DOIs to provide a hyperlinked list of more recent publications that have referenced the work you are currently examining. Ed Pentz has blogged about this new feature. Now, this is cool and is the stuff dreams are made of. For example, a paper by Matt Greenstone in '83 entitled, "Site-specificity and site tenacity in a wolf spider: A serological dietary analysis" (doi:10.1007/BF00378220) is referenced by at least 6 more recent works as exemplified in that panel including several by Matt himself. Besides the obvious way that this permits someone to peruse your life's work (provided you reference yourself and publish in journals that have bought into CrossRef), this is a slick way to keep abreast of current thinking. If your initial introduction to subject matter is via pre-1990 publications, you can quickly examine how and who has used previous works regardless of what journal that article has appeared. Hats off CrossRef!

Now, what we need are publishing firms still mired in the dark ages to wake-up to the power of DOIs. If you participate in the editorial procedures for a scientific society and your publisher has not yet stepped-up by providing you with DOIs, get on the phone and jump all over them! You would be doing your readers, authors, and society a disservice if you accepted anything less than full and rapid cooperation by your chosen publisher.

So Ed, will "Forward Linking" be a web service we can tap into?

Wednesday, September 5, 2007

CrossRef Takes a Step Back

UPDATE Sept. 8/2007: Please read the response to this post by Edward Pentz, Executive Director of CrossRef in the comments below.



Mission statement: "CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure."

Not-for-profilt, hunh? Money-grabbing in the professional publishing industry has once again proven to be more important than making scientific works readily accessible. As of September 7, 2007, CrossRef will roll out new rules for its OpenURL and DOI lookups. Unless you become a card carrying CrossRef "affiliate", there will be a daily cap of 100 lookups using their OpenURL service, which will require a username/password. If >100 lookups are performed, CrossRef will reserve the right to cancel the account and will force you to buy into their senseless pay-for-use system. Here are the new rules as described at http://www.crossref.org/04intermediaries/60affiliate_rules.html:
  1. Affiliates must sign and abide by the term of the CrossRef Affiliate Terms of Use
  2. Affiliates must pay the fees listed in the CrossRef Schedule of Fees
  3. The Annual Admininstrative Fee is based on the number of new records added to the Affiliates service(s) and/or product(s) available online
  4. There are no per-DOI retrieval fees. There are no fees based on the number of links created with the Digital Identifiers.
  5. Affiliates may "cache" retrieved DOIs (i.e. store them in their local systems)
  6. The copyright owner of a journal has the sole authority to designate, or authorize an agent to designate, the depositor and resolution URL for articles that appear in that journal
  7. A primary journal (whether it is hosted by the publisher or included in an aggregator or database service) must be deposited in the CrossRef system before a CrossRef Member or Affiliate can retrieve DOIs for references in that article. For example, an Affiliate that hosts full text articles can only lookup DOIs for references in an article if that journal's publisher is a PILA Member and is depositing metadata for that journal in the CrossRef System
  8. Real-time DOI look-up by affiliates is not permitted (that is, submitting queries to retrieve DOIs on-the-fly, at the time a user clicks a link). The system is designed for DOIs to be retrieved in batch mode.

So what's the big deal?
The issue has to do with scientific society back-issues like the kind served by JStore. Without some sort of real-time DOI look-up, it is near impossible to learn of newly scanned and hosted PDF reprints for older works. After September 7, the only solution available to developers and bioinformaticians is to periodically "batch upload" lookups. CrossRef sees Rod Page's bioGUID service and my simple, real-time gadget as a threat to their steady flow of income even though it clearly fits within their general purpose "...to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research."

Saturday, September 1, 2007

Giant Texas Spider Web

This past week, there have been numerous stories about the Giant Texas Spider Web in Lake Tawakoni State Park such as this compiled CNN video re-posted on YouTube:




View Larger Map

There hasn't yet been a definitive identification of the species involved (stay tuned for more), but from the videos I have seen, the primary culprit looks to be a tetragnathid (long-jawed orbweaver) and not the assumed social spiders like Anelosimus spp. (doi:10.1111/j.1096-3642.2006.00213.x). This behaviour is rather unusual for a tetragnathid and reminds me of what was thought to be a mass dispersal event gone awry near McBride, British Columbia several years ago. In that case, the species involved were (in order of numerical dominance): Collinsia ksenia (Crosby & Bishop, 1928), Erigone aletris Crosby & Bishop, 1928, a Walckenaeria sp., and Araniella displicata (Hentz, 1847). See Robin Leech et al.'s article in The Canadian Arachnologist (PDF, 180kb). In the case of this massive Texas webbing, there also appear to be several other species present in the vicinity as evidenced by the nice clip of Argiope aurantia Lucas, 1833 in the YouTube video above.

Update:
Mike Quinn, who compiles "Texas Entomology", has a great page on the possible identity of the species involved in the giant web. The candidate in the running now is Tetragnatha guatamalensis O. P.-Cambridge, 1889, which has been collected from Wisconsin to Nova Scotia, south to Baja California, Florida, Panama, and the West Indies. The common habitat as is the case for most tetragnathids is on streamside or lakeside shrubs and tall herbs.

Another potential candidate (if these are indeed tetragnathids) is Tetragnatha elongata Walckenaer, 1842. I suspect the tetragnathids and A. aurantia are incidentals and not the primary culprits who made the giant mess of webbing. Since Robb Bennett and Ingi Agnarsson both have suspicions that the architect is Anelosimus studiosus (Hentz, 1850), and since it is highly unlikely it is a tetragnathid, I have my bets on ergionine linyphiids much like what happened in McBride, BC.

Wednesday, August 22, 2007

JSON is Kewl


While messing around with the new fangled reference parser script that connects to bioGUID to get the goods, it occurred to me that this slick, simple technique that requires next to no mark-up, can be applied to all sorts of nifty things. Yahoo produces JSON for its search results and you can specify your own callback function. So, for kicks, I adjusted my JavaScript file a bit to use Yahoo instead of Rod Page's bioGUID reference parser and also added some cool DHTML tooltips developed by Walter Zorn. So, hover your mouse over a few animal and plant names that I list here with no particular relevance: Latrodectus mactans, Blue Whale, blue fescue, and, Donald Hobern's favourite, Puma concolor. Incidentally, I may as well try it with Donald Hobern himself (Disclaimer: I take no responsibility for what may pop-up in the tooltip). Now that I have been messing with this JavaScript for pulling JSON with a callback, this stuff is quite exciting. You have to remember that there is next to NO mark-up or any additional effort for someone to take advantage of this. I only have a few JavaScript files in the <body> section of this page and I mark-up the stuff I want to have a tooltip with <span class="lifeform">name here</span>. This is pretty cool if I do say so myself.
I initially tried this technique with Flickr, but they don't permit square brackets in a callback function. So, I wrote the developers and alerted them to this cool new toy. Hopefully, they'll open the gates a little more and not be so restrictive.

Forgive me...I just can't help myself:
Carabus nemoralis
Argiope aurantia
Culex quinquefasciatus
duck-billed platypus
slime mould
...How many more million to go?...

Sunday, August 19, 2007

Gimme That Scientific Paper Part III


Update Sep 28, 2007: Internet Explorer 6 refuses to cache images properly so I have an alternate version of the script that disables the functionality described below for these users. You may see it in action HERE. Also, the use of spans (see below) may be too restrictive for you to implement so I developed a "spanless" version of the script HERE. This version only requires the following mark-up for each cited reference and you can of course change a line in the script if you're not pleased with the class name and want to use something else:
<p class="article">Full reference and HTML formatting allowed</p>

Those who have followed along in this blog will recall that I dislike seeing references to scientific papers on web pages when there are no links to download the reprint. And, even when the page author makes a bit of effort, the links are often broken. One solution to this in the library community is to use COinS. But, this spec absolutely sucks for a page author because there is quite a bit of additional mark-up that has to be inserted in a very specific way. [Thankfully, there is at least one COinS generator you can use.] I was determined to find a better solution than this.
You may also recall that I came up with an AJAX solution together with Rod Page. However, that solution used Flash as the XMLHTTP parser, which meant that a crossdomain.xml file had to be put on Rod's server, i.e. this really wasn't a cross-domain solution unless Rod were to open up his server to all domains. Yahoo does this, but it really wasn't practical for Rod. As a recap, this is what I did in earlier renditions:
The JavaScript automatically read all the references on a page (as long as they were sequentially numbered), auto-magically added some little search icons such as , & when clicking these, the references were searched via Rod Page's bioGUID reference parsing service. If a DOI or a handle was found, the icon changed to ; if a PDF was found, the icon changed to ; if neither a PDF or a link via DOI or handle were found, the icon changed to whereby you could search for the title on Google Scholar; and finally, if the reference was not successfully parsed in bioGUID, then the icon changed to an "un"-clickable . If you wanted to take advantage of this new toy on your web pages, you had to either contact Rod and ask that your domain be added to his crossdomain.xml file or you'd have to set-up a PHP/ASP/etc. proxy. But, Rod has now been very generous...

Rod now spits out JSON with a callback function. What this means is that there are no longer any problems with cross-domain issues as is typically the case with XMLHTTP programming. To make a long story short, if you are a web page author and include a number of scientific references on your page(s), all you need do is grab the JavaScript file HERE, grab the images above, adjust the contents of the JavaScript to point to your images, then wrap up each of your references in span elements as follows:

<p><span class="article">This is one full reference.</span></p>
<p><span class="article">This is another reference.</span></p>
etc.
How easy is that?!
To see this in action, have a peek at the references section of The Nearctic Spider Database.

Or, you can try it yourself here:

Buckle, D. J. 1973. A new Philodromus (Araneae: Thomisidae) from Arizona. J. Arachnol. 1: 142-143.



For the mildly curious and for those who have played with JSON outputs with a callback function, I ran into a snag that caused no end of grief. When one appends a JSON callback to a page header, a function call is dynamically inserted. This works great when there is only need for one instance of that function at a time. However, in this case, a user may want to call several searches in rapid succession before any previous call was finished. As a consequence, the appended callback functions may pile up on each other and steal each others' scope. The solution was to dump the callback functions into an array, which was mighty tricky to handle.

Wednesday, August 15, 2007

Mind Dumps...er...Maps/Graphs/Trees


Since Adobe has been driving across the US, selling some AIR, I thought I'd take a closer look at Flex/Flash applications that might fit the bill for some tough ideas I have been wrestling with. In a somewhat similar GUI struggle, Rod Page has been fevershly playing with Supertrees, trying to find a web-based, non-Java solution. So, I did a bit of digging into some showcased Adobe AIR applications - tutorial and demo sites are cropping up all over the place - and a Flex application that will soon be transformed into AIR caught my eye: Mindomo. Now, if this mindmapping application had RDF tying the bits together, which came via distributed web services from GBIF, GenBank, CrossRef, etc., we'd have a real winner here. Mindomo is exactly the application I have been dreaming about for The Encyclopedia of Life (EOL)'s WorkBench; it just has to be pinned down into the biological, semantic web world. Since an AIR application can wrap all this up in a web/desktop hybrid application, I am convinced this is what EOL absolutely must produce.

Monday, August 13, 2007

Google Finds Spiders in Your Backyard

The Google API team have added a new DragZoomControl to the list of available functions. This feature has been bandied about for quite some time and various people have hacked together approximations for this using other JavaScript functions. My interest in this isn't so much the zoom function as cool as that is, but the ability to query resources bound by the drawn box.

Kludging DragZoomControl to perform a spatial query isn't particularly practical or useful so I used a Yahoo YUI "Drag & Drop - Resizable Panel" to fix-up what I once using. What I used in the past that didn't perform well for Safari users was some scripting from Cross-Browser.com called the X Library. Now, with Yahoo's improvement on this, the function works as expected. Because it's very easy to add things that stay positioned within such a draggable box, the Yahoo YUI component is a much better solution. So, just like you can zoom in / zoom out with the Google DragZoomControl, so too can you put these functions within a draggable, resizable box. I'll also add that the resizing function in Yahoo's component is much smoother than Google's own DragZoomControl. Now the fun part...






Two little icons within the draggable, resizable box allow you to search for spider images or produce a spider species list, which are based on collections records submitted to The Nearctic Spider Database. Click HERE to try your hand at it and search for spiders in your back yard.

The advantage of such a simple function is that one need not have a spatial database like PostgreSQL, but can make use of any enterprise back-end. The query run is the typical minX, maxX, minY, maxY to define the four corner coordinates. With a ton of records in the backend however, the query can take a long time to complete so an index on the latitude and longitude columns may be required as explained in the Google API Group. If you want to see what you can do with a spatial database however, have a look at what programmers for the Netherlands Biodiversity Information Facility have put together.

Happy spider hunting...

Tuesday, August 7, 2007

Dare to Dream Big

This post will be quite off-topic, but I just had to share some recent stuff in the works that caught my eye.

First up is a spin-off from research at MIT, led by Sanjit Biswas who temporarily left his Ph.D. program (are you sure, Sanjit?) to lead a company called Meraki. The cheap, little router/repeaters permit the creation of "smart", distributed networks such that a single DSL connection can feed dozens of end-points. The firmware in each little gizmo permits a network admin to monetize these ad-hoc connections. Consequently, getting connected to the 'net could be as cheap as a $1 a month once a user buys the attractive Meraki mini. The company also recently announced a Meraki Solar kit. Now that's forward thinking. There are dozens of testaments on the Meraki web site including one from the town Salinas, Ecuador where a network of schools are now connected even though there are no phonelines!

Distributed, ad hoc connections like this reminded me of an email I recently received from Rod Page who alerted me to FUSE, which stands for "File System in Userspace". This is a Linux-based, Sourceforge project that allows a user to create & mount virtual drives that contain or represent a vast array of file types. For example: 1) Fuse::DBI file system mounts some data from relational databases as files, 2) BloggerFS "is a filesystem that allow Blogger users to manipulate posts on their blogs via a file interface.", and 3) Yacufs "is a virtual file system that is able to convert your files on-the-fly. It allows you to access various file types as a single file type. For instance you can access your music library containing .ogg, .flac and .mp3 files, but see them all as if being .mp3 files." This all sounds very geeky, but I draw your attention to MacFUSE (sadly, there is not yet a WindowsFUSE, though it appears this functionality has not gone unnoticed):



So what? Isn't this just like some sort of peer-2-peer system? Absolutely not. This is more like a distributed content management system and, coupled with a highly intelligent Yacufs-like extension, it means that file types (e.g. MS Word, OpenOffice, etc.) can be converted on-the-fly to whatever file format you want or need. To step this thinking up a bit in case you have no idea why this is relevant to ecology or systematics, have a look at the cool things Cynthia Parr and her colleagues are doing to visualize distributed data sets: doi:10.1016/j.ecoinf.2007.03.005. FUSE means the work Cynthia & others are doing (e.g. SEEK) don't need a GUI. Rather, we just need a way to organize the gazoodles of files that would/could be present in an ecologically- or taxonomically-relevant filesystem. Maybe I should coin these EcoFS and TaxonFS :)~

Tuesday, July 24, 2007

The Next Scientific Revolution?


With the advent of Science Commons (see the popsci.com article) and cool new tools available at Nature like its Precedings, opportunities to share research outside for-profit business models (i.e. journals) are starting to gain momentum. Systems like Neurocommons allow for a redrawing of money tranfer trajectories around and within publishing firms such that everyone can play nice in the sandbox. Medical research sells no matter how you redraw the lines or rebuild the castle. But, I wonder if the discussions around these "revolutions" have attempted to include pure research whose business model is based entirely on public funds and funding institutions that receive their monies from government pots? How do you sell Science Commons to mathematicians, theoretical physicists, systematists, ecologists, among other scientists whose research is not closely tied to big bucks pharmaceutical companies and a Gates Foundation? Repositories like iBridge are useful for niche markets and interests, but I'd hardly call them revolutionary for all of science. If proponents of the Semantic Web want to sell their ideas, I'd like to see buy-in by one tenured ecologist & then some demonstrable evidence for how this will accelerate his/her research.

Thursday, July 19, 2007

Biodiversity on the Desktop


In earlier posts, I described my vision of an Encyclopedia of Life (EoL) where the web and your desktop environment are blurred or mashed together. In such a manner, I envisioned a tool where users and contributors to EoL could maintain either a public or private working space to mix and mash data from multiple sources with those in their own hard drives.

One of the grandiose dreams I had was the ability to create a private, content management-like community within EoL in which co-authors of a proposed manuscript like a taxonomic revision could merge their datasets, grab data from third party sources if useful (e.g. GenBank, GBIF, etc.), and collectively work on the manuscript. Upon completion of the manuscript, there may be some elements of use to EoL that the authors could later push out, which in no way diminishes the value of their publication or would cause editors to frown and reject the paper, but offers immediate value to the public at large. Granted I'm not totally clear on how this will work, but I really don't think it would be completely unrealistic. But, what I can't stress enough, is that EoL cannot be like WikiSpecies where contributors have to sit and author content solely for use in WikiSpecies. Rather, it absolutely must have a system that can somehow slip into the workflow of biologists. I'm now convinced that such a vision is not science fiction.

I watched a few Adobe AIR (Adobe Integrated Runtime - code named Apollo) videos this morning and got thoroughly fired-up about the possibilities. Google has been working on an offline API, which looks pretty good, but AIR blows all this out of the water because it allows a developer to maintain an application's presence on the desktop. Further, it can be integrated into the operating system. So, imagine working on a manuscript with a handful of colleagues and be notified with a desktop toast pop-up when he/she has completed his/her revisions or sections of the manuscript, or if a dataset in EoL suddenly became available while the manuscript was being prepared.

The really cool thing from a web developer's perspective is that AIR uses existing web programming tools, i.e. the learning curve to create these cross-platform applications is quite shallow because code can be repurposed for either the web or the desktop...at least that's the impression I get from watching a few of these videos. Here's one such video where Christian Cantrell demonstrates a few simple applications built on AIR...just mentally substitute Amazon for EoL and you get my drift.

Now, all this stuff is really cool and there is indeed the potential for EoL or the totally unknown "MyEoL" to truly transform the science of biology because it can & should slip into biologists' workflows. Of course, there's no reason why multiple flavours of these AIR desktop applications couldn't be built for children, amateur naturalists, scientists, or however EoL sees its user base. In fact, what I'd really like to see is a BioBlitz class of user where observational data can be merged across interest groups at any geographic scale in their respective desktop applications. An Encyclopedia is great, but we ultimately want people to use the encyclopedia and go outside with fellow human beings to look at, count, touch, or otherwise experience life on our planet. Each flavour of the desktop application can be customized to do different chores or expose different aspects of the EoL system to various classes of end users. The first step then is to nail the needs and design these applications around them. So, here are a few questions to kick-start this process, at least for biologists/systematists who are excited about EoL but just don't believe there will ever be time for them to contribute content:

1. When conducting a taxonomic revision, what are the significant organizational and communications impediments that have slowed down the process?
2. What data sets are vital to conducting an effective revision?
3. What desktop software applications are critical when conducting a revision? [remember, because AIR is on the desktop, there may be an opportunity here to automate file type transformations produced by one provider/database/spreadsheet to the applications required to actually analyze the data]

#2 above will be a challenge and will require that data providers produce similarly structured APIs (e.g. DiGIR, TAPIR, OpenSearch, and the like). This I believe, is where EoL has to take firm leadership. What it needs now is a demo application like that produced by Adobe and Christian Cantrell so we can all visualize this dream. Currently, the EoL dream is very fuzzy and has lead to a lot of miscommunication and confusion. e.g. isn't this just WikiSpecies? THEN, EoL absolutely must write some recipes for content providers to start sharing their data in a manner like what Google does with their Google Maps API. The selling point will be iconographic and link-back attribution for content providers, which can be constructed already if all content providers used a system like OpenSearch, MediaRSS, GeoRSS, and various other RSS flavours to open their doors.

Wednesday, July 4, 2007

Digital Species Descriptions & the new GBIF portal

The Biodiversity Information Standards (previously known as TDWG, the Taxonomic Database Working Group) has recently rolled out a new subsgroup called "Species Profile Model" led by Éamonn Ó Tuama (GBIF). Thirteen people attended a workshop April 16-18, 2007 in Copenhagen, DK shortly after the Encyclopedia of Life informatics workshop in Woods Hole, MA. The point of this Copenhagen workshop was to hash out a specification to support the retrieval and integration of data with the lofty goal of "reaching consensus and avoiding fragmentation" of existing species-level initiatives. I'm all for this, but I wonder if it will work? I believe "consensus" as it's described here is meant to be a common way of presenting the data rather than a true taxonomic, ecological, or political consensus. A specification does not preclude the possibility for several variants of a species profile served from multiple (or even the same) provider. These could of course have conflicting or dated information and ultimately result in misleading COSEWIC-type recommendations. So, what about consensus as we usually define it? Or, is that beyond the responsibility of this subgroup?

A standard for specimen data (Darwin Core, ABCD, etc.) is obvious, but I'm not convinced that a standard for species descriptions is wise unless such a standard were developed and solely hosted by the nomenclators and sanctioned by the various Codes. A standard for species descriptions without ties to the nomenclators and the authors who conducted the original species description or revision merely democratizes fluff. Before a standard Species Profile Model is put into practise, such RDF representations have to at least explicitly incorporate peer review, authorship, and a date stamp.

I also noticed that the Species Profile Model is attempting to integrare citations to scientific literature. I suggest the team take a good close look at OpenURL, which lends itself to useful functionality when building lists of references in front-end applications (see Rod Page's post in iPhylo on this very subject and several posts in this blog). The OpenURL format will influence how the elements in the proposed Species Profile Model ought to be constructed.

On other fronts, GBIF just rolled out their new portal: http://data.gbif.org. It looks as if the whole index and back-end was reconstructed and there remain some missing provider data tables. In time, these will probably blink on as they were presented via the old portal. What I appreciate seeing for the first time is a concerted effort to give providers some auto-magic feedback about what is being served from their boxes. Vetting data is a very important part of federation and I hope providers sit up and take notice. GBIF calls these "event logs", which is too obtuse. I'd like to see this called "Questionable Data Served from this Provider", "Problem Records", or "The Crap You're Serving the Scientific Community", or something similar. "Event logs" is easily dismissed and overlooked. For example, here are the event logs for the University of Alaska Museum of the North Mollusc Collection: http://data.gbif.org/datasets/resource/967/logs/. GBIF also has a flashy new logo & plenty of easy to use web services.

Tuesday, June 19, 2007

DiGIR for Collectors


The Global Diversity Information Facility (GBIF) has done an excellent job at designing the infrastructure to support the federation of specimen and observation data. The majority of contributing institutions use Distributed Generic Information Retrieval (DiGIR), an open-source PHP-based package that nicely translates columns of data in one's dedicated database (e.g. MySQL, SQL Server, Oracle, etc.) into Darwin Core fields. So, even if your data columns don't match the Darwin Core schema, you can use the DiGIR configurator to match your columns to what's needed at GBIF's end. Europeans tend to prefer Access to Biological Collection Data (ABCD) as their transport mechanism. The functionality of these will soon be rolled into the TDWG Access Protocol for Information Retrieval (TAPIR). To the uninitiated like me, this is a jumbled, confusing alphabet soup and at first I couldn't navigate this stuff.

Suffice it to say, the documentation isn't particularly great on either the TDWG or GBIF web sites. To the TDWG folks: a screencast with step by step install for both Windows & Linux would go a long way! I don't mean a flashy Encyclopedia of Life webcast, I mean a basic TDWG for Dummies. If you have a dedicated database and a web server that can push PHP-based pages, it's actually pretty straight forward once you get going. It's really just a matter of jumping through a few simple hoops. Click here; do this; match this; click there - not much more difficult than managing an Excel datasheet. The downloads & step by steps for DiGIR can be found HERE. The caveat: you need a dedicated database, a dedicated web server, and you need your resource to be recognized by a GBIF affiliate before it's registered for access. That's unfortunately how all this stuff works.

So what about the casual or semi-professional collector that may have much larger collections than what can be found in museums? It's not terribly likely countless, hard-working people like these have the patience to fuss with dedicated databases (we're not talking Excel here) or web servers. Must they wait to donate their specimens to a museum before these extremely valuable data are made accessible? In many cases, a large donation of specimens to a museum sits in the corner and never get accessioned because there simply isn't the human power at the receiving end to manage it all. Heck, some of the pre-eminent collections in the world don't even have staff to receive donations of any size! This is a travesty.

An attractive solution to this is to complement DiGIR/ABCD/TAPIR with a fully online solution akin to Google Spreadsheets. For the server on the other end, this means a beefy pipe and a hefty set of machines to cope with this AJAX-style, rapid & continuous access. But, for small taxa-centric communities, this isn't a problem. In fact, I developed such a Google Spreadsheet-like function in The Nearctic Spider Database for collectors wanting to manage their spider data.


Turn Up Volume!


Watch the video above or HERE (better resolution). Everything is hosted on one machine on a residential Internet connection & I have had up to 5 concurrent users + all the usual 2,500 - 3,500 unique visitors a day with no appreciable drop in performance. Granted things are a little slower in these instances, but the alternative is no aggregation of data at all. To help users that each have their own table in the database, I designed some easy and useful tools. For example, they may query their records for nomenclatural issues, do some real-time reverse geocoding to verify that their records are actually in the State or Province they specified, check for duplicates, among a few other goodies like mapping everything as clickable pushpins in Google Map. Of course, one can export as Excel or tab-delimited text files at any time. The other advantage to such a system is that upon receiving user feedback and requests, I can quickly add new functions & these are immediately available to all users. I don't have to stamp and mail out new CDs, urge them to download an update, or maintain versions of various releases. If you're curious about wanting to do the same sort of thing for your interest group, check out Rico LiveGrid Plus, the code base upon which I built the application.

What would be really cool is if this sort of thing could be made into a Drupal-like module & bundled into Ubuntu Server Edition. A taxon focal group with a community of say 20-30 contributors could collectively work on their collection data in such a manner & never have to think about the tough techno stuff. They'd buy a cheap little machine for $400, slide the CD into the drive to install everything & away they go.

The real advantage to the on-line data management application in the Nearctic Spider Database is the quick access to the nomenclatural data. So, the Catalog of Life Partnership & other major pools of names ought to think about simple web services upon which such a plug-and-go system can draw their names. It's certainly valuable to have a list of vetted names such as what ITIS and Species2000 provide, but to really sell them to funding agencies they no doubt have to demonstrate how the names are being used. Web services bundled with a little plug-and-go CD would allow small interest groups to hit the ground running. Such a tool would give real-world weight to this business of collecting names and would go a long way toward avoiding the shell games these organizations probably have to play. I suspect these countless small interest groups would pay a reasonable, annual subscription fee to keep the names pipes open. Agencies already exist to help monetize web services using such a subscription system. Perhaps it's worth thinking like an Amazon Web Service (AWS) where users pay for what they use. Unlike AWS however, incoming monies would only support the Catalog of Life Partnership wages and infrustructure to take some weight off chasing grants.

Monday, June 18, 2007

Impossibility of Discovery

For the past couple of years, I have scoured the Internet for spider-related imagery and resources. I think I have a pretty good handle on it. But, there are some gems out there that at are near impossible to find. The discoverability issue has a lot to do with poor web design and that means little to absolutely no consideration for how search engine bots work. While it's commendable to put all that content on a website, it's equally important to ensure the work can be discovered. Many of the authors below should look at some of the offerings in Web Pages That Suck and pay close attention to the list of web design mistakes. Without good design, what's the point? Let it be clear that I'm not knocking the content; these are extremely valuable and obviously very time-consuming works. However, consideration must be given to the end user. Why not just get these works ready for a book & let a typesetter and layout editor handle the esthetics? A few of these examples are (in no particular order):


  1. Nachweiskarten der Spinnentiere Deutschlands

  2. Linyphiid Spiders of the World

  3. Arachnodata

  4. Aracnis

  5. Central European Spiders - Determination Key



Three words: kill the frames. If you have to use a frameset, give the user the option to turn it on or off.

Other sites have been improving dramatically like Jørgen Lissner's Spiders of Europe. But, it's worth thinking about a search function and also hiding the backend technology by creating URIs (i.e. aspx might be your preferred programming language, but what if you decide one day to switch to Apache and PHP?). A bit of server-side URL re-writing can go a long way to ensure longterm access to your content. If you switch to Apache, MySQL, and serve content via PHP, you can make use of Apache's mod_rewrite...none of your incoming links break.

Some pointers:

If you're going to use drop-down menus, please, please make them useful & hierarchical by using some simple AJAX to submit a form and adjust the options. Nothing is more frustrating than scrolling through an endless list of species only to find the one you're looking for is not there or to select a species only to find no content. A list of taxonomic references is at least some content even if that may seem rather thin. If Google and other search engines are having a rough time indexing your content, it is equally rough on end users. Another point is to lose the mindset that you're working with paper - the web is a highly interactive place and visitors have short attention spans. Limit the content to the most important bits. Use a pale background and dark-coloured text. Not only is printing web sites that use the reverse a pain, you are also saying, "I haven't thought about people with less than perfect vision." I could go on and on, but I'll leave it at that.

If you want a web site with hundreds of arachnid-related links, visit the Arachnology Homepage. Herman Vanuytven puts a lot of time trying to make sense of all the arachnid content out there.