Friday, May 25, 2007

DOI + EzProxy

I spent a few hours yesterday learning how to do a few rudimentary things in Perl. My goal was to create something useful with all the references in The Nearctic Spider Database. It's nice to have a list of papers on the species pages I host, but is this really useful to anyone? I'd much rather have a direct link to download the paper than to see a reference list, which is fundamentally useless to me. Because DOIs and resolving services like CrossRef have become popular now, I thought I'd at least add "PDF" icons for the relatively recent articles such that you can download the paper. This is why I have been trying to learn Perl, which has a few really cool modules to parse references into their constituent, openURL structure.

Here's what you can do with DOIs & why they're cool:

Ingi Agnarsson published a massive and thorough paper in the Zoological Journal of the Linnean Society entitled, "A revision of the New World eximius lineage of Anelosimus (Araneae, Theridiidae) and a phylogenetic analysis using worldwide exemplars". The full reference is:

Agnarsson, I. 2006. A revision of the New World eximius lineage of Anelosimus (Araneae, Theridiidae) and a phylogenetic analysis using worldwide exemplars. Zool. J. Linn. Soc. 146: 453-593.

That paper has the doi 10.1111/j.1096-3642.2006.00213.x and I store this value in the Nearctic Spider Database's references table exactly like that because it is persistent. This doi can be slapped behind http://dx.doi.org/ to give you http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x. Now there's a ready-made, direct link to Ingi's paper right from the Nearctic Spider Database and I never have to worry about a dead link. But, there's a catch. It's a copyrighted paper so you have to log on to Blackwell Synergy's web site (somehow) to actually retrieve the full PDF and not just look at the nice title and abstract. If you happen to be on your institution's network, your library system is fully integrated into your internal network, and you library subscribes to the resource, then you can directly access the PDF. But what if you're using a computer at home or your institution hasn't fully integrated their library systems (requiring authentication), are we any better off with DOIs? For the end user, not really.

However, in an earlier post where I tried to think about how The Encyclopedia of Life might generate a steady flow of income, I mentioned the use of EzProxy. Now my ideas have finally gelled.

To make the PDF links in The Nearctic Spider Database really work for remote users (i.e. working at home) who belong to an institution whose library subscribes to the Zoological Journal of the Linnean Society, I'd like to be able to dynamically rewrite http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x to http://login.ezproxy.library.ualberta.ca/login?url=http://dx.doi.org/10.1111/j.1096-3642.2006.00213.x as is the case for the University of Albera. Clicking that link will of course take you to the U of A library login screen prior to redirecting you to the publisher's page. Via that one minor login hiccup on your institution's library page, you can then download Ingi's paper. So how do you rewrite the URL on-the-fly? A cookie! Recipe: read member's cookie...find institution's EzProxy base URL stored there...change the doi URL. Simple as that. Alternatively, the EzProxy base URL could be stored in a table in the server's database in the event an institution changes its EzProxy settings. You probably wouldn't want frutrated users trying to get things to work with a half-baked cookie in their cache. But (there's always a but), it would be quite unreasonable for me to store all these EzProxy base URLs in The Nearctic Spider Database. If there was a web service I could use that maintained such lists, then I would most certainly use it. I have no money, but The Encyclopedia of Life does! Folks leading EoL are searching for a way to encourage systematists to vet species pages, so here's a really nice way to give something in return: direct links to PDF downloads. The subscription fee charged to institutions would be to maintain working EzProxy base URLs (or similar base proxy URLs to coordinate this) in the EoL database. Now that would be cool.

Post Update

Apparently, I haven't been the only one thinking about these sorts of remote authentication issues. There's a blog called "The Distant Librarian" where these exact same ideas have been kicked around, though with use in Google Scholar (1,2)...same principle. So, it appears that The Encyclopedia of Life can be added to institutions EzProxy config to allow direct links to the PDF downloads. Also of interest is WAG the Dog PHP localizer (also available on SourceForge). Wouldn't you know it, Peter Binkley was one of the developers of WAG. He's a librarian at my own institution! I'll have to go have a coffee with him.

I also discovered a FireFox extension under development by a number of libraries called LibX, which does magic, client-side URL re-writing for DOIs, has Google Scholar integration, & works with EzProxy. Now all we need is buy-in by all libraries for this & for IE/Safari folks to use FF. If you absolutely can't wait for librarians in your institution to make a LibX FF extension, you can probably kludge together an approximation of its functionality using Greasemonkey and Jesse Ruderman's autolink script, which you can adjust (if you're familiar with regexp) to suit the way DOIs are typically presented on web sites.

6 comments:

David Shorthouse said...

Here's a novelty - a comment on my own post!

Rod Page beat me to the punch and is rolling a reference parsing tool into his http://bioguid.info store of applications. Rod just testing it at the moment, running through parsing mishaps, etc. but I'm sure it'll be accessible from a link on that page.

Roderic Page said...

The parsing tool is still a toy. Regarding the main point of your post, I don't see this as a money earner. Academic users will be able to use their own local resources, and indeed might expect EoL to play ball with their libraries OpenURL resolver, which means that you'd want to provide OpenURLs, not specifically ExProxy links. For example, it might be better to embed COinS in the HTML. Google Scholar maintains lists of institutional holdings for free (http://scholar.google.com/intl/en/scholar/libraries.html), and I wonder about the wisdom of a subscription model for universities. Perhaps a better way to think of this is to ask who would benefit from increased traffic to publisher's web sites...?

David Shorthouse said...

Goog point, Rod. The main thesis of my post stems from a frustration over how publishers and societies (big or small) represent links to their subscription-based or free materials. There's still such a pervasive expectation that if you put these resources on the Internet, they'll somehow be magically found. When they're not found, then there is widespread lamentation that getting materials online was wasted effort. What I hoped from such a model was a clear illustration for how links to PDF reprints etc. ought to be constructed for efficient and intelligent aggregation that doesn't involve inherently brittle screen-scraping functions. But, your last question is interesting. Ought big organization like EoL be chasing these sorts of subscription monies from academic publishers?

Anonymous said...

David, do take Peter for coffee - he'll be able to offer all sorts of assistance in this area!

Anonymous said...

Some of this work has been done for you: you should have a look at OCLC's OpenURL Resolver Registry (http://www.oclc.org/productworks/urlresolver.htm) and the DOI Cookie Pusher: http://www.doi.org/doi_proxy/appropriate_copy.html .

Unknown said...

The LibX edition builder can build a custom edition for you in just a few minutes, so you don't have to wait until your library has its own toolbar.

It uses OCLC's OpenURL Resolver Registry, allowing users to import OpenURL settings from there.

In addition, note that LibX supports EZProxy in a special way: it contacts the EZProxy before doing the proxying to learn whether a given URL could be proxied, i.e., is allowed by the specific proxy server. This way, you don't have to waste time typing your password if it's not.