Tuesday, June 19, 2007

DiGIR for Collectors

The Global Diversity Information Facility (GBIF) has done an excellent job at designing the infrastructure to support the federation of specimen and observation data. The majority of contributing institutions use Distributed Generic Information Retrieval (DiGIR), an open-source PHP-based package that nicely translates columns of data in one's dedicated database (e.g. MySQL, SQL Server, Oracle, etc.) into Darwin Core fields. So, even if your data columns don't match the Darwin Core schema, you can use the DiGIR configurator to match your columns to what's needed at GBIF's end. Europeans tend to prefer Access to Biological Collection Data (ABCD) as their transport mechanism. The functionality of these will soon be rolled into the TDWG Access Protocol for Information Retrieval (TAPIR). To the uninitiated like me, this is a jumbled, confusing alphabet soup and at first I couldn't navigate this stuff.

Suffice it to say, the documentation isn't particularly great on either the TDWG or GBIF web sites. To the TDWG folks: a screencast with step by step install for both Windows & Linux would go a long way! I don't mean a flashy Encyclopedia of Life webcast, I mean a basic TDWG for Dummies. If you have a dedicated database and a web server that can push PHP-based pages, it's actually pretty straight forward once you get going. It's really just a matter of jumping through a few simple hoops. Click here; do this; match this; click there - not much more difficult than managing an Excel datasheet. The downloads & step by steps for DiGIR can be found HERE. The caveat: you need a dedicated database, a dedicated web server, and you need your resource to be recognized by a GBIF affiliate before it's registered for access. That's unfortunately how all this stuff works.

So what about the casual or semi-professional collector that may have much larger collections than what can be found in museums? It's not terribly likely countless, hard-working people like these have the patience to fuss with dedicated databases (we're not talking Excel here) or web servers. Must they wait to donate their specimens to a museum before these extremely valuable data are made accessible? In many cases, a large donation of specimens to a museum sits in the corner and never get accessioned because there simply isn't the human power at the receiving end to manage it all. Heck, some of the pre-eminent collections in the world don't even have staff to receive donations of any size! This is a travesty.

An attractive solution to this is to complement DiGIR/ABCD/TAPIR with a fully online solution akin to Google Spreadsheets. For the server on the other end, this means a beefy pipe and a hefty set of machines to cope with this AJAX-style, rapid & continuous access. But, for small taxa-centric communities, this isn't a problem. In fact, I developed such a Google Spreadsheet-like function in The Nearctic Spider Database for collectors wanting to manage their spider data.

Turn Up Volume!

Watch the video above or HERE (better resolution). Everything is hosted on one machine on a residential Internet connection & I have had up to 5 concurrent users + all the usual 2,500 - 3,500 unique visitors a day with no appreciable drop in performance. Granted things are a little slower in these instances, but the alternative is no aggregation of data at all. To help users that each have their own table in the database, I designed some easy and useful tools. For example, they may query their records for nomenclatural issues, do some real-time reverse geocoding to verify that their records are actually in the State or Province they specified, check for duplicates, among a few other goodies like mapping everything as clickable pushpins in Google Map. Of course, one can export as Excel or tab-delimited text files at any time. The other advantage to such a system is that upon receiving user feedback and requests, I can quickly add new functions & these are immediately available to all users. I don't have to stamp and mail out new CDs, urge them to download an update, or maintain versions of various releases. If you're curious about wanting to do the same sort of thing for your interest group, check out Rico LiveGrid Plus, the code base upon which I built the application.

What would be really cool is if this sort of thing could be made into a Drupal-like module & bundled into Ubuntu Server Edition. A taxon focal group with a community of say 20-30 contributors could collectively work on their collection data in such a manner & never have to think about the tough techno stuff. They'd buy a cheap little machine for $400, slide the CD into the drive to install everything & away they go.

The real advantage to the on-line data management application in the Nearctic Spider Database is the quick access to the nomenclatural data. So, the Catalog of Life Partnership & other major pools of names ought to think about simple web services upon which such a plug-and-go system can draw their names. It's certainly valuable to have a list of vetted names such as what ITIS and Species2000 provide, but to really sell them to funding agencies they no doubt have to demonstrate how the names are being used. Web services bundled with a little plug-and-go CD would allow small interest groups to hit the ground running. Such a tool would give real-world weight to this business of collecting names and would go a long way toward avoiding the shell games these organizations probably have to play. I suspect these countless small interest groups would pay a reasonable, annual subscription fee to keep the names pipes open. Agencies already exist to help monetize web services using such a subscription system. Perhaps it's worth thinking like an Amazon Web Service (AWS) where users pay for what they use. Unlike AWS however, incoming monies would only support the Catalog of Life Partnership wages and infrustructure to take some weight off chasing grants.

Monday, June 18, 2007

Impossibility of Discovery

For the past couple of years, I have scoured the Internet for spider-related imagery and resources. I think I have a pretty good handle on it. But, there are some gems out there that at are near impossible to find. The discoverability issue has a lot to do with poor web design and that means little to absolutely no consideration for how search engine bots work. While it's commendable to put all that content on a website, it's equally important to ensure the work can be discovered. Many of the authors below should look at some of the offerings in Web Pages That Suck and pay close attention to the list of web design mistakes. Without good design, what's the point? Let it be clear that I'm not knocking the content; these are extremely valuable and obviously very time-consuming works. However, consideration must be given to the end user. Why not just get these works ready for a book & let a typesetter and layout editor handle the esthetics? A few of these examples are (in no particular order):

  1. Nachweiskarten der Spinnentiere Deutschlands

  2. Linyphiid Spiders of the World

  3. Arachnodata

  4. Aracnis

  5. Central European Spiders - Determination Key

Three words: kill the frames. If you have to use a frameset, give the user the option to turn it on or off.

Other sites have been improving dramatically like Jørgen Lissner's Spiders of Europe. But, it's worth thinking about a search function and also hiding the backend technology by creating URIs (i.e. aspx might be your preferred programming language, but what if you decide one day to switch to Apache and PHP?). A bit of server-side URL re-writing can go a long way to ensure longterm access to your content. If you switch to Apache, MySQL, and serve content via PHP, you can make use of Apache's mod_rewrite...none of your incoming links break.

Some pointers:

If you're going to use drop-down menus, please, please make them useful & hierarchical by using some simple AJAX to submit a form and adjust the options. Nothing is more frustrating than scrolling through an endless list of species only to find the one you're looking for is not there or to select a species only to find no content. A list of taxonomic references is at least some content even if that may seem rather thin. If Google and other search engines are having a rough time indexing your content, it is equally rough on end users. Another point is to lose the mindset that you're working with paper - the web is a highly interactive place and visitors have short attention spans. Limit the content to the most important bits. Use a pale background and dark-coloured text. Not only is printing web sites that use the reverse a pain, you are also saying, "I haven't thought about people with less than perfect vision." I could go on and on, but I'll leave it at that.

If you want a web site with hundreds of arachnid-related links, visit the Arachnology Homepage. Herman Vanuytven puts a lot of time trying to make sense of all the arachnid content out there.

Sunday, June 17, 2007

Sociology & Gabbing Web Images

Admittedly, I'm a Flickr noob and only recently made an account for myself. I'm not entirely pleased with the interface because frankly, it's too busy and inconsistent for my liking. However, I just stumbled across Picnik. From their FAQ:

What is Picnik?
Picnik is photo editing awesomeness, online, in your browser. It's the easiest way on the Web to fix underexposed photos, remove red-eye, or apply effects to your photos.

Not only that, it is well hooked into Flickr, Facebook and a number of other sites. What struck me about one of the features in their somewhat hidden tools are Firefox and Internet Explorer extensions. Now, extensions in Firefox are relatively easy to construct, but Internet Explorer extensions are a bit of a pain. So, I was curious to see what they did. It is a registry hack that contains the following:

Windows Registry Editor Version 5.00; See http://msdn.microsoft.com/workshop/browser/ext/tutorials/context.asp for details
[HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\MenuExt\Edit in &Picnik] @="http://www.picnik.com/extensions/ie-import.html"

That hack installs an option in the right-click context menu within Internet Explorer. When the hack is installed, you may right-click any image on a web page and choose "Edit in Picnik". I tried it and of course, the URL to the image is seamlessly grabbed and the image is immeditalely available in Picnik for editing using its rich array of very easy to use tools.

Being naturally curious, I pointed the URL in that above registry hack to http://www.picnik.com/extensions/ie-import.html and was presented with a web page alert whose source was the following:

// See http://msdn.microsoft.com/workshop/browser/ext/tutorials/context.asp for details
try {
var wndParent = external.menuArguments;
var strImportURL = wndParent.event.srcElement.src;
wndParent.location = "http://www.picnik.com/?import=" + strImportURL;
} catch (ex) {
alert("Unable to import image. Let us know at feedback@picnik.com");
Edit in Picnik for Internet Explorer

So, it seems the registry hack merely calls the URL to an image and feeds it as a querystring parameter into Picnik. Pretty slick. But there's more...

You can grab any image like this, edit it in Picnik, then seamlessly send it to your Flickr account using the behind-the-scenes Flickr API. Technically, you never have to interface with the busy Flickr site. Almost everything can be done within Picnik from grabbing either a local image from your hard drive, web site, etc., editing it using its fully online, smooth and easy to use image editing tools, then you can push the result out to Flickr or Facebbok, send it via email, save it back to your hard drive, plus a number of other export options. Wow.

Grabbing images like this off a web page & then sending it off to Flickr certainly has some serious copyright issues and speaks volumes about the sociology of the web culture. This is exactly the kind of content ownership issues that seem not to phase other emerging application developers like those behind Zude or ZCubes. Where's the justice? Do Creative Commons licenses make a whole hill of beans worth of difference when there are tools like these? I tried to get this notion across to the participants & contributors in/to BugGuide (see discussion) as encouragement to think about an API to propagate their great work in other resources like the Encyclopedia of Life. There are lots of synergistic reasons for building a simple API. BugGuide is a rich, longstanding, community-driven resource for people to post their images of NA arthropods, author guide pages, discuss issues in a forum, among other really useful functions. Getting nomenclatural data from an Encyclopedia of Life API would be one simple example of potential information flow back to BugGuide.

One of the shortcomings of Picnik, Zude, and ZCubes is that there is no way to retain any accreditation other than the host domain for where the image was "ripped". It also means that MediaRSS extensions for RSS 2.0 or the FOAF and Dublin Core vocabularies for RSS 1.0 are entirely useless in this context because they aren't being used even if the data were available. What I think Picnik, Zude, and ZCubes (& Flickr too for that matter) ought to consider is an embedded meta data reader/writer for images. If this is so commonly done with MP3 music files, why is this taking so long for image files?

Wednesday, June 13, 2007

Salticus scenicus (Clerck, 1757)

To date, I haven't posted anything about spiders. This blog is at its heart about araneids afterall, so I may as well get started on an interesting tidbit.

Here's a shot of a male Salticus scenicus (Clerck, 1757). I found this guy in my kitchen, busy terrorizing my 75 pound black, Labrador Retriever. I rushed to get the camera, took several shots, and tried my best to maintain a steady hand. While on knees and elbows, a large, wet tongue repeatedly sought appeasement in my left ear. Later, I submitted the image, locality, and a few comments to Spider WebWatch because it's one of the nine species featured in that citizen science initiative. With prodding from a few folks, I designed the backend and layout for Spider WebWatch. It's a bit like a forum or a blog where participants can quickly click a spot on a Google Map, pick a date, type a free-form observation, and upload an image. Other participants can submit comments on individual observations and everything I could think of has an RSS 2.0 feed with GeoRSS and MediaRSS extensions. In other words, if you're so inclined, you can grab these feeds and images and maintain textual accreditation for these contributions. I also have a dynamically created Google Earth download; the locales and observations are fed from the server to your machine when called such that you don't have to download a large Google Earth file...it's a bit like the Google Earth Santa tracker in that regard.

Anyhow, on to more interesting matters...

Astute adherents to the International Code of Zoological Nomenclature will notice the date Salticus scenicus was described: 1757. Directly from the ICZN is the following:

Article 3. Starting point. The date 1 January 1758 is arbitrarily fixed in this Code as the date of the starting point of zoological nomenclature.

3.1.Works and names published in 1758. Two works are deemed to have been published on 1 January 1758:
- Linnaeus's Systema Naturae, 10th Edition;
- Clerck's Aranei Svecici.
Names in the latter have precedence over names in the former, but names in any other work published in 1758 are deemed to have been published after the 10th Edition of Systema Naturae.

3.2. Names, acts and information published before 1758. No name or nomenclatural act
published before 1 January 1758 enters zoological nomenclature, but information(such as descriptions or illustrations) published before that date may be used. (See Article 8.7.1 for the status of names, acts and information in works published after 1757 which have been suppressed for nomenclatural purposes by
the Commission).

There apparently has been plenty of bickering about Clerck's 1757 Aranei Svecici, which of course was published before Linnaeus' Systema Naturae. The full reference is:

Clerck, C. 1757. Svenska spindlar, uti sina hufvud-slågter indelte samt under några och sextio särskildte arter beskrefne och med illuminerade figurer uplyste. Stockholmiae, 154 pp.

According to Article 3.1 of the ICZN, the authorship for Salticus scenicus ought to be 1758 yet arachnid systematists (not naming names) have fought tooth and nail to preserve full recognition/respect for Clerck's work. Clerck orginally described this species as Araneus scenicus; the Genus Araneus was a veritable trash bin for a lot of spiders. Linnaeus redescribed the species as Aranea scenica in 1758, also a trash bin. So who's the authority? In case you're interested, the spiders in Linnaeus' tome are:

Linnaeus, C. Systema naturae per regna tria naturae, secundum classes, ordines, genera, species cum characteribus differentiis, synonymis, locis. Editio decima, reformata. Holmiae, 821 pp. (Araneae, pp. 619-624).

Sunday, June 10, 2007

Gimme That Scientific Paper! Part II

If you're wanting to make textual lists of online references more useful for visitors to your page(s) such that you can turn references like this:

Work, Timothy T., David P. Shorthouse, John R. Spence, W. Jan A. Volney, David Langor. 2004. Stand composition and structure of the boreal mixedwood and epigaeic arthropods of the Ecosystem Management Emulating Natural Disturbance (EMEND) landbase in northwestern Alberta. Can. J. For. Res. 34(2): 417–430.

Into this:

Work, Timothy T., David P. Shorthouse, John R. Spence, W. Jan A. Volney, David Langor. 2004. Stand composition and structure of the boreal mixedwood and epigaeic arthropods of the Ecosystem Management Emulating Natural Disturbance (EMEND) landbase in northwestern Alberta. Can. J. For. Res. 34(2): 417–430.Search!

[Doesn't work here because of Blogger constraints]

Where the little magnifying glass allow visitors to your page to search for the paper without having to maintain the links, all you need to do is download this:

And follow the brief instructions in the JavaScript file. There's next to no additional mark-up required for the lists of papers (see Here in the comments section of a previous post). The script makes use of some cross-domain Flash, which requires that your domain be added to Rod Page's bioGUID reference parser. However, I included some simple php and asp examples to step around that constraint and also a link to an online file storage service where you can get all the images I used or directly accessible here: http://www.box.net/shared/685i4nyxxj#1:7066241

Tuesday, June 5, 2007

Fun With RSS

Seems Google has finally jumped on the MediaRSS bandwagon. For some time now, I have been producing MediaRSS as well as simple GeoRSS feeds from The Nearctic Spider Database and Spider WebWatch. Of note, you can paste a GeoRSS feed URL into Google Maps to get an instant mash-up. Since Google just released an AJAX Feed API, I thought I'd give it a shot in this blog. Here's a feed from the 10 most recent Spider WebWatch posts where contributors uploaded images with their observations:

What I'd really like to see now is for OpenSearch to adopt these extensions such that 3rd party, client-run search engines like ZoomSearch start incorporating this stuff.

Friday, June 1, 2007

Gimme That Scientific Paper!

What irks me about references cited on web pages is that you can't directly get the PDF or at least immediately search for it unless the page author has explicitly put a link to that paper. When a page author has taken the time to construct these links, they often point to a 404 (page doesn't exist) because the link is no longer working. In the digital age, surely this sort of thing can be done more effectively. Well, thanks to Rod Page who has developed a reference parsing tool in his bioGUID suite of applications, this functionality along with some nifty Flash-based, cross-domain AJAX that I used, is now possible. For a taste of this, have a peek at the references page in The Nearctic Spider Database.

Now, what the heck is going on here? Glad you asked.

The chain of events I think is very cool.

First, I simply wrap individual references in uniquely identified and sequentially numbered identifiers and put a "holding" span at the end of these with similarly identified span elements:

<p><span id="bioGUIDref_1">This is the first full reference.</span> <span id="bioGUIDres_1"></span></p>

That's pretty easy for anyone with a rudimentary knowledge of HTML.

Second, I put a reference to a JavaScript in the page header whose functions initialize when the page finishes loading. That script counts all the references with the simple mark-up shown above and puts little search icons in the holding spans. Of course this script can be hosted elsewhere and anyone can put it in their page headers.

Third, I put this mark-up at the bottom of the page that initializes a Flash item, which coordinates some cross-domain search functions via Rod's reference parsing API (more on that below):

<script type="text/javascript">FlashHelper.writeFlash();</script>

So, for the end user seeing a list of references with these little search icons stuck at the end of each of them as such:

Agnarsson, I. 2004. Morphological phylogeny of cobweb spiders and their relatives (Araneae, Araneoidea, Theridiidae). Zool. J. Linnean Soc. 141: 447-626.

...it's a simple matter of clicking each in turn to perform a real-time search for individual papers of interest [Disclaimer: of course the above example doesn't work here in this blog post]. If the paper is found somewhere in the ether, the icon changes either to in the case of a freely available PDF (yay!), if the paper can be found via other means (subscription may be required), if the reference was successfully parsed and searched but nothing was found, or if the reference was not successfully parsed and consequently a search couldn't effectively be constructed.

The really awesome part of this whole system is that it is laughably easy for anyone with a basic knowledge of HTML (no complex coding required!) to duplicate these functions on their authored web pages. But let's first have some background on how this works.

This cross-domain AJAX querying system uses Flash. Julien Couvreur worked with Jason Levitt (from Yahoo) to create an XMLHTTP transport that uses Flash. You can read about this in Julien's blog, Curosity is Bliss, where he also has a nice demo that produces search results from Yahoo's ImageSearch API using this technique. What Rod had to do was first get his reference parsing script to produce XML and also had to create a simple crossdomain.xml document and dump it in the root folder for his domain. Julien points out a potential security issue with these Flash-based cross-domain search queries so Rod at the moment only has The Canadian Arachnologists' domain in his crossdomain.xml document.

An end user clicking initiates a cross-domain request to Rod's machine. The reference is parsed in Rod's Perl script (i.e. split-up into Author, Year, Title, Publication, Pages, etc. as required for OpenURL) then sent off to CrossRef and elsewhere to obtain search results. This system works fantastically well for modern publications that have bought into CrossRef's DOI system (note: handles are also working in Rod's Perl scripting) but what about all those scientific societies that produce online PDFs but haven't bought into DOI's?

For smaller societies and publications like the Journal of Arachnology, Rod unfortunately must scrape the URLs to their digital reprints. [Aside: JoA does have DOIs, but these are issued from BioOne and an end-user accessing JoA articles via BioOne would of course be presented with a pay-per-view screen - sucky] In these cases then, the XMLHTTP system I have that sends citations to Rod's machine might return an erroneous link to a PDF if the source URL was changed & Rod hadn't yet updated his listings. But, as long as societies agree not to mess with their URL structure, the conduit to their PDFs remains viable. This is most certainly something The Encyclopedia of Life can coordinate.

Here then is a very slick little system that is easy for web page authors to implement and intuitively obvious for end-users. A potential pitfall worth mentioning is poorly constructed citations. Rod's algorithms that split a citation into is constituent bits are only as good as what goes in. In other words at my end for example, if a icon is returned to the end-user, a digital version of the paper might exist somewhere - I just didn't construct my citation well enough for Rod's algorithm to split the bits into an OpenURL format. So, I am contemplating adding a icon to sit alongside the icon such that an end-user who knows the paper can be found online can send me a quick note/poke to tell me that I need to re-write the citation.

If you want some background on what Rod did at his end, head over to his blog where he wrote about OpenURL Here & Here.