Sunday, August 19, 2007

Gimme That Scientific Paper Part III

Update Sep 28, 2007: Internet Explorer 6 refuses to cache images properly so I have an alternate version of the script that disables the functionality described below for these users. You may see it in action HERE. Also, the use of spans (see below) may be too restrictive for you to implement so I developed a "spanless" version of the script HERE. This version only requires the following mark-up for each cited reference and you can of course change a line in the script if you're not pleased with the class name and want to use something else:
<p class="article">Full reference and HTML formatting allowed</p>

Those who have followed along in this blog will recall that I dislike seeing references to scientific papers on web pages when there are no links to download the reprint. And, even when the page author makes a bit of effort, the links are often broken. One solution to this in the library community is to use COinS. But, this spec absolutely sucks for a page author because there is quite a bit of additional mark-up that has to be inserted in a very specific way. [Thankfully, there is at least one COinS generator you can use.] I was determined to find a better solution than this.
You may also recall that I came up with an AJAX solution together with Rod Page. However, that solution used Flash as the XMLHTTP parser, which meant that a crossdomain.xml file had to be put on Rod's server, i.e. this really wasn't a cross-domain solution unless Rod were to open up his server to all domains. Yahoo does this, but it really wasn't practical for Rod. As a recap, this is what I did in earlier renditions:
The JavaScript automatically read all the references on a page (as long as they were sequentially numbered), auto-magically added some little search icons such as , & when clicking these, the references were searched via Rod Page's bioGUID reference parsing service. If a DOI or a handle was found, the icon changed to ; if a PDF was found, the icon changed to ; if neither a PDF or a link via DOI or handle were found, the icon changed to whereby you could search for the title on Google Scholar; and finally, if the reference was not successfully parsed in bioGUID, then the icon changed to an "un"-clickable . If you wanted to take advantage of this new toy on your web pages, you had to either contact Rod and ask that your domain be added to his crossdomain.xml file or you'd have to set-up a PHP/ASP/etc. proxy. But, Rod has now been very generous...

Rod now spits out JSON with a callback function. What this means is that there are no longer any problems with cross-domain issues as is typically the case with XMLHTTP programming. To make a long story short, if you are a web page author and include a number of scientific references on your page(s), all you need do is grab the JavaScript file HERE, grab the images above, adjust the contents of the JavaScript to point to your images, then wrap up each of your references in span elements as follows:

<p><span class="article">This is one full reference.</span></p>
<p><span class="article">This is another reference.</span></p>
How easy is that?!
To see this in action, have a peek at the references section of The Nearctic Spider Database.

Or, you can try it yourself here:

Buckle, D. J. 1973. A new Philodromus (Araneae: Thomisidae) from Arizona. J. Arachnol. 1: 142-143.

For the mildly curious and for those who have played with JSON outputs with a callback function, I ran into a snag that caused no end of grief. When one appends a JSON callback to a page header, a function call is dynamically inserted. This works great when there is only need for one instance of that function at a time. However, in this case, a user may want to call several searches in rapid succession before any previous call was finished. As a consequence, the appended callback functions may pile up on each other and steal each others' scope. The solution was to dump the callback functions into an array, which was mighty tricky to handle.


ioverka said...

Hi David,

I really like your idea of a on-the-fly reference link parser and find the implementation nice & fancy. Anyway, the main question is: How reliable will the algorithm to parse the reference to bioGUID be? I tested it a bit on the reference list you pointed to and found that the parsing fails for many citations. You are probably able to improve the success rate, but still it only would work with one of hundreds available citation styles.

You stated:
One solution to this in the library community is to use COinS. But, this spec absolutely sucks for a page author because there is quite a bit of additional mark-up that has to be inserted in a very specific way.

Agreed: Putting COinS manually to references sucks! But whenever structured information is available, it should be exported in a structured way - because this enables users and machines to re-use the data without applying any logic for parsing references to metadata.

Good luck for your further work,
inga (librarian... I need to confess ;)

David Shorthouse said...

Inga: Rod's reference parser associated with uses a ParaTools Perl module. You can read more about this here: Apparently, Rod has a number of reference templates commented out while he tests the service. So, indeed, parsing will improve in time. I expect it will become just as all-inclusive as CrossRef's own batch parser: but will obviously include many journals that have not bought into CrossRef, but do serve their PDF reprints off their society web pages. The utility of what Rod did and how I have tapped into it lies in future imaging of older reprints. While on-the-fly lookups may fail today, they may magically work at some point once a back-issued reprint becomes available. This means a web page author need not periodically check for availability of DOIs, handles, SICIs, URLs, etc. because Rod is doing all the heavy lifting.

PeterA said...

So is this parser exclusively for use with scientific papers or could it be adapted for use in other areas?