iSpiders: May 2007

Friday, May 25, 2007

DOI + EzProxy

I spent a few hours yesterday learning how to do a few rudimentary things in Perl. My goal was to create something useful with all the references in The Nearctic Spider Database. It's nice to have a list of papers on the species pages I host, but is this really useful to anyone? I'd much rather have a direct link to download the paper than to see a reference list, which is fundamentally useless to me. Because DOIs and resolving services like CrossRef have become popular now, I thought I'd at least add "PDF" icons for the relatively recent articles such that you can download the paper. This is why I have been trying to learn Perl, which has a few really cool modules to parse references into their constituent, openURL structure.

Here's what you can do with DOIs & why they're cool:

Ingi Agnarsson published a massive and thorough paper in the Zoological Journal of the Linnean Society entitled, "A revision of the New World eximius lineage of Anelosimus (Araneae, Theridiidae) and a phylogenetic analysis using worldwide exemplars". The full reference is:

Agnarsson, I. 2006. A revision of the New World eximius lineage of Anelosimus (Araneae, Theridiidae) and a phylogenetic analysis using worldwide exemplars. Zool. J. Linn. Soc. 146: 453-593.

That paper has the doi 10.1111/j.1096-3642.2006.00213.x and I store this value in the Nearctic Spider Database's references table exactly like that because it is persistent. This doi can be slapped behind http://dx.doi.org/ to give you http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x. Now there's a ready-made, direct link to Ingi's paper right from the Nearctic Spider Database and I never have to worry about a dead link. But, there's a catch. It's a copyrighted paper so you have to log on to Blackwell Synergy's web site (somehow) to actually retrieve the full PDF and not just look at the nice title and abstract. If you happen to be on your institution's network, your library system is fully integrated into your internal network, and you library subscribes to the resource, then you can directly access the PDF. But what if you're using a computer at home or your institution hasn't fully integrated their library systems (requiring authentication), are we any better off with DOIs? For the end user, not really.

However, in an earlier post where I tried to think about how The Encyclopedia of Life might generate a steady flow of income, I mentioned the use of EzProxy. Now my ideas have finally gelled.

To make the PDF links in The Nearctic Spider Database really work for remote users (i.e. working at home) who belong to an institution whose library subscribes to the Zoological Journal of the Linnean Society, I'd like to be able to dynamically rewrite http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x to http://login.ezproxy.library.ualberta.ca/login?url=http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x as is the case for the University of Albera. Clicking that link will of course take you to the U of A library login screen prior to redirecting you to the publisher's page. Via that one minor login hiccup on your institution's library page, you can then download Ingi's paper. So how do you rewrite the URL on-the-fly? A cookie! Recipe: read member's cookie...find institution's EzProxy base URL stored there...change the doi URL. Simple as that. Alternatively, the EzProxy base URL could be stored in a table in the server's database in the event an institution changes its EzProxy settings. You probably wouldn't want frutrated users trying to get things to work with a half-baked cookie in their cache. But (there's always a but), it would be quite unreasonable for me to store all these EzProxy base URLs in The Nearctic Spider Database. If there was a web service I could use that maintained such lists, then I would most certainly use it. I have no money, but The Encyclopedia of Life does! Folks leading EoL are searching for a way to encourage systematists to vet species pages, so here's a really nice way to give something in return: direct links to PDF downloads. The subscription fee charged to institutions would be to maintain working EzProxy base URLs (or similar base proxy URLs to coordinate this) in the EoL database. Now that would be cool.

Post Update

Apparently, I haven't been the only one thinking about these sorts of remote authentication issues. There's a blog called "The Distant Librarian" where these exact same ideas have been kicked around, though with use in Google Scholar (1,2)...same principle. So, it appears that The Encyclopedia of Life can be added to institutions EzProxy config to allow direct links to the PDF downloads. Also of interest is WAG the Dog PHP localizer (also available on SourceForge). Wouldn't you know it, Peter Binkley was one of the developers of WAG. He's a librarian at my own institution! I'll have to go have a coffee with him.

I also discovered a FireFox extension under development by a number of libraries called LibX, which does magic, client-side URL re-writing for DOIs, has Google Scholar integration, & works with EzProxy. Now all we need is buy-in by all libraries for this & for IE/Safari folks to use FF. If you absolutely can't wait for librarians in your institution to make a LibX FF extension, you can probably kludge together an approximation of its functionality using Greasemonkey and Jesse Ruderman's autolink script, which you can adjust (if you're familiar with regexp) to suit the way DOIs are typically presented on web sites.

Wednesday, May 23, 2007

How to Present Your First Paper at a Scientific Conference

Still chicken?

The Headless, Household Server

Many of you are aware that The Nearctic Spider Database and all the other goodies I have been fussing with are hosted off a machine in my basement. I presented its design and capabilities at the last ECN meeting in Indianapolis, IN. The student with a server in his basement drew a few chides and chuckles, but I suspect it made many stop & think. A few in attendance were wary of such a home-grown project. Back-ups? Theft? Damage? Flood? These same issues are faced by web hosting companies. As long as you have a reasonable solution to all these (e.g. scheduled and off-site storage of back-ups), and you periodically have a look at your web logs to assess traffic and bandwidth such that your Internet service provider doesn't pull the plug on you, what's the big deal? Databases and websites are portable. It would take perhaps an hour to remotely transfer the whole she-bang to another machine. What I also hope transpired from that meeting is an understanding that this stuff doesn't require rocket science and a massive team of database and web engineers. Just a bit of time and patience. So, this post is an introduction to a do-it-yourself, headless (no monitor), household server. If you have a ton of images and data that you haven't figured out what to do with, what are you waiting for?

Hardware: Any old PC will do so long is it has a reasonable amount of oomph and you can jam it full of memory. There are lots of local Mom & Pop computer stores that will sell you a brand new PC for less than $500. Remember, you don't need an Operating System (more below) and you don't need a monitor...you'll just need to borrow one for the initial install. What you do need though are: 1) a good, name-brand power supply unit, 2) a bare-bones, read-only CD-ROM, 3) a quiet case with good ventilation, 4) minimum of 2GB RAM, 5) A 2.4GHz processor or more (too much more is just a waste of energy), 6) a motherboard with onboard network connection & video, and 7) a couple of hard drives (size depends on your needs).

Software: Ubuntu Server Edition for a fully-functioning LAMP install as a free download that you can burn to CD. LAMP = Linux, Apache, MySQL database, and PHP for programmtic delivery of web pages. The Ubuntu community is very active and can help troubleshoot issues you may have with installation. With a reasonably well configured machine (cutting edge hardware is best avoided and unnecessary), you ought to be able to get a bare-bones LAMP install, ready for data import in an hour or so.

What you also need is a hardware router and a home Internet service provider that is reasonably lax when it comes to hosting stuff on their pipes. Some purposefully block Port TCP 80, the channel web traffic travels on, but many others recognize the stiff competition out there for customers and consequently, turn a blind eye. Since your home Internet protocol address might very well be dynamically-assigned, you can make use of free services liks DynDNS & configure your router to automatically send an update to this service should your provider assign you a new IP.

There's more to it than this of course (i.e. database development, web page design and delivery & remote access from another PC on your home network), but these are the basics. If you want a visual step-by-step, Falko Timme has a nice article on howtoforge.com.

Friday, May 18, 2007

Remixing the Web II

I wondered when Microsoft would enter the visual mashup IDE world. With Yahoo Pipes, Zude and similar projects, Microsoft has jumped in and is making great use of its cross-platform / cross-browser Silverlight plugin within a visual mashup interface they are calling Popfly. Google, where are you on this front? I had my reservations at first, but I am now thoroughly convinced that a MyEoL within The Enyclopedia of Life is well within reach. What remains is intelligent and simple use of these technologies while maintaining accreditation for contributions.

Thursday, May 17, 2007

Monetizing the Encyclopedia of Life

First, let me preface this post by saying I know next to nothing of library informatics or politics. However, I have been chewing on an idea that may help monetize The Encyclopedia of Life in an acceptable fashion, thus bulding a long term flow of income to help build this great resource.

While remotely using my institution's library to hunt for PDF reprints, it occured to me that EoL ought to negotiate using uBioRSS as a database similar in functionality to the largescale, educational databases like BioOne. My University uses EzProxy to coordinate logon by students to access library resources, which essentially means that remote sessions appear local. So, accessing a PDF from a publisher once authenticated through my library's system is quick and easy. In very simplistic (and admittedly naive terms), "login.ezproxy.library.ualberta.ca" for example gets tagged onto the suffix of the publisher's domain to coordinate this pass-through authentication. Couldn't the uBioRSS service be rolled into EoL along with a reasonable subscription fee for institutions such that students and employees can directly access PDF reprints right off the species pages in EoL? The majority of the content on species pages in EoL would of course still be fully open access, it's merely the direct links to copyrighted PDF downloads that would require prior authentication from within an institution. It would be absurd to charge institutions the typical BioOne-type subscription fees, but why not a reasonable fee that helps offset the EoL staffing and infrastructure costs?

Remixing the Web

There's a new social networking application on the horizon that quite frankly, scares me. If you thought Facebook, MySpace, and other social networking applications or systems were pervasive & viral, wait 'til Zude hits the scene. If you want a preview of the capabilities and have ~15-20 minutes to spare, I urge you to check out the ZDNet preview video: http://zdnet.com.com/1606-2_2-6176625.html. David Berlind hinted that a beta of Zude would be available May 1st, but this hasn't yet happened. Marketing ploy to generate more interest? Not yet ready? No matter. When this appears, it'll completely change the landscape on the Internet and we'll collectively have to think very seriously about copyright and content ownership. Regardless of what happens on those fronts, it sounds as if a third party can license the drag-drop functionality in Zude. This has direct relevance to the MyEoL environment in The Encyclopedia of Life.

Friday, May 11, 2007

The Living Encyclopedia - MyEoL

The greatest challenge engineers and architects for the Encyclopedia of Life (EoL) will face is the proposed MyEoL, part of which will be the workbench for authors to contribute content. Without this, EoL degrades to a search engine with a spider (aka EoLbot) with a bit of Catalogue of Life name-smarts. In MyEoL, material must be made accessible in some form of drag-drop interface along with some form of textarea WYSIWYG box to write content. Images and snazzery (my new word-of-the-day) aside, it's ultimately the textual content that will drive discoverability. Afterall, this is the basis for any local or remote search engine index because image and video metadata are terrible. So how do you get contributors to sit down and type content? Do you first create a politically messy granting scheme by getting public funding agencies on-board to fund such manual efforts? Or, do you create something beyond the catch-phrase, Web 2.0?

Like Rod Page of iSpecies fame, I have been following the progress on mash-up technologies like Yahoo's Pipes, Dapper, OpenKapow and similar emerging tools. Nick Gonzalez has a nice overview of these. The one that stands out from all these in my mind is RSSBus. There are two reasons it hasn't quite caught on like Yahoo's Pipes and the others: 1. There is no slick user interface, and 2. it is not yet cross-platform. However, don't sell it short because it is far more powerful than most have given it credit. What really attracts me to RSSBus is its server to desktop and back (push-pull) architecture with the ability to use or create any sort of connector. One can pull xml data from an on-line resource, mix it with local Excel data or other data objects, then churn it out as an RSS feed if so desired. Here then is a superb opportunity for the systematics community - heck any biological community - to leverage this great work. Coincidentally, Donald Hobern (GBIF) has already coined a Biodiversity Data Bus for EoL's server-server communications.

But why stop at the server environment? What would truly change the way we conduct biological research, thus building the EoL dream, is if this data bus were extended to the desktop. Wouldn't you like to mix your local data with that pulled from external resources? I sure would. Better yet, imagine creating a Facebook-like community of colleagues when preparing data for a manuscript. Each co-author contributes his/her data via their desktop RSSBus, leverages the great work on names management undertaken by the Catalogue of Life and uBio thus making a great first crack at merging data sets, then the co-authors in this little invite-only community can collectively work on analyses and presentation for the manuscript. What we typically have today with co-authored manuscripts is one or a few more individuals responsible for the grunt work of merging data sets and making sense of it. At the end of a much more simplified, RSSBus-like communal data merging effort, I would then be very much inclined to click a button and push such a creation or parts of this creation back out to EoL.

What also hasn't been effectively discussed is how EoL will acquire content to feed its pages. Will it be an EoLbot like what was hinted at in a few press announcements or will it be something like DiGIR with canned or modularized Darwin Core-like elements? That may work for existing species page providers who serve their pages from a backend, but what about all that great, flat HTML content out there for which only traditional search engines like Google, Yahoo, MSN and other big players have been scouring? Does EoL intend to use Google, Yahoo, and others and scrape their results to feed the initial EoL species pages? Yikes. That scares me because these engines may and often do produce erroneous results...they haven't got biological intelligence. Google Images is particularly bad at handling nomenclature and image associations as I discovered with some of the indexed content from The Nearctic Spider Database (e.g. HERE). Here's also where we might be creative with RSSBus if it could be married with something like OpenSearch and a client-run spidering and indexing tool for their served content. One such example is the inexpensive Zoom Search that has lots of great plug-ins to read image metadata and ultimately produces an index through its spidering algorithms for use in a template-driven search portal. This sort of system with a UDDI registry would be really cool because attribution is then possible without any great deal of effort. Stripping canned results from Google or Yahoo to build the initial content does not come bundled with attribution for the source. EoL can essentially create a search engine for content providers, freely hand it out and content providers can pretty-up their search portal and spider their content as they want. This is great value because as with Zoom Search, content providers can log search queries and get a sense for what people are actually searching for on their pages. Behind the scenes, EoL pulls content via OpenSearch to feed the RSSBus scaffolding in MyEoL.

Though this video doesn't really give RSSBus its deserved credit and it's tough to see the relevance in biology, it none the less provides a glimpse of what I have been talking about with the "Living Encyclopedia" as opposed to merely an "Encyclopedia of Life".

Thursday, May 10, 2007

Encyclopedia of Life

The Encyclopedia of Life was officially launched yesterday and I have been reading various public postings to guage response.

Of all these I have tried to scan, Slashdot is perhaps the most active and most informative. So, here is my best attempt at summarizing public reception:

1. There's a notable lack of understanding for how this will be different from WikiSpecies. There is little to no appreciation for the challenge of names management of the kind spearheaded by uBio. This is a critical piece of the puzzle that cannot be done in a 2-dimensional wiki environment.

2. Most people understand that content will be developed by first "harvesting" material scattered on the Internet then cleaned-up by "scientists". But there has been little to no discussion on how that will be accomplished or how accreditation will be maintained.

3. There hasn't yet been much discussion on the TAXACOM or ENTOMO-L listservs about the encyclopedia. A few suggested that monies would be better put into pure systematics rather than into a "bean-counting" exercise. Others recognize that the content will necessarily be created by systematists, but see that there is as yet no incentive to do so. Millions of $$ are dumped into scanning materials, wages, etc. but how does that filter down to individual contributors?

4. The "MyEoL" vs. the canonical "EoL" content is not well appreciated.

5. Timeframe for "completion" has been grossly misunderstood. Many believe they have to wait for 10 years to see anything come out of EoL.

The EoL Workbench: "The Living Encyclopedia"

While I do think an Encyclopedia of Life will be a most amazing resource, there is a very large & critically missing piece of the puzzle as it relates to #3 in my list above. The EoL promotional video grandly expresses that such an encyclopedia will transform the science of biology. How? What we see to date is a digital version of a paper encyclopedia with a bit of gadgetry to encourage public participation. If this is to transform biology, it must simplify communications or the work flow in day-to-day biological pursuits. This is where my "Living Encyclopedia" idea comes in.

First, the communications required to conduct revisionary work need to be very well understood. For starters, ALL type specimens must first be made available & directly tied to direct channels of communication to the curators charged with maintaining those holdings. The folks at GBIF have taken great first steps, but digital representations of ALL type specimens accessible via DiGIR, TAPIR or other means is no where near complete. Systematists must still scour old literature to learn where type specimens are held, write letters, etc. before acquiring specimens BEFORE any serious revision can be started.

Second, there is currently no effective link between publishers of taxonomic literature & the nomenclators. Before that happens, an encyclopedia will be woefully dated.

So here's the idea...

Integrated into "MyEoL" ought to be a blank slate - an organizational workbench if you will. Here, systematists run very simple web service queries to GBIF to create a visual "cloud" of specimens of interest. This would be much like Yahoo's Pipes. Through drag-and-drop, communications to curators is coordinated for loans. A systematist would continue to use this tool to add/remove specimens according to the concept they are working with, connect pieces of the cloud to other resources like those in GenBank, insert references, etc. all via drag-drop. Upon completion of the work, the circumscribed specimens and the entire visual representation of the workflow is "locked" and a permanent URL is issued, which can and ought to be present in the eventual publication. When the publication is accepted, the systematist then returns to the workbench to insert the publication's DOI. In such a manner, all the digital bits are in place. This permanent URL that describes the circumscription of specimens is then in the public domain such that other systematists may examine the "guts" behind the publication. Once all is said and done, this then becomes the scaffolding for a species page in EoL.

This is vastly different than a purely post-hoc encyclopedia because the incentive is a simplified & accelerated workflow - a much lower-level entry point. So, the amount of teeth-pulling required to build an effective Encyclopedia of Life with content written by the scientific community is vastly reduced. The former model doesn't require accreditation for resources or material but the present model is a mess of very difficult "mash-ups". Sure, a first crack at the encyclopedia can be harvested content, but to be sustainable, it has to either adopt a distasteful monitization scheme (i.e. supported by click-through advertising) or create a low-level, organizational workbench of immense value to the scientific community, very easily expressed to government and public fund agencies like NSF, NSERC, etc.