iSpiders: September 2007

Sunday, September 30, 2007

DOIs in the References Cited Section

I just read a recent post by Ed Pentz (Executive Director of CrossRef) who alerted his readers to some recent changes to recommended American Medical Association and American Psychological Association style guides. Ed also provides two examples. Essentially, a string like "doi:10.1016/j.ssresearch.2006.09.002" is to be tagged at the end of each reference (if these dois exist) in the literature cited section of journals that use AMA or APA styles. Cool! It would be a snap to write a JavaScript to recognize "doi:" on a page and magically add the "http://dx.doi.org/" and I have no doubt publishers can do something similar prior to producing PDF reprints. Ed seemed puzzled by the exclusion of "http://dx.doi.org/", but this is a really smart move by the AMA and the APA. DOIs afterall are URNs so it's best to avoid any confusion. If paper publishers want to make these actionable, then they can do so. If web publishers want to do the same, then a simple little JavaScript can do it.

So, what we need now are style editors to pick-up this recommedation across the board in all journals in all disciplines. This simple addition would do a world of good for human discoverability and for machine consumption/repurposing. It shouldn't just be a recommendation, it should be mandatory.

Hopefully, this will step-up the drive for including XMP metadata within PDF reprints. It may be a pain for authors to track down DOIs if they're not stamped on the covering page (usually hovering around the abstract or the very top of the page) and consequently, adoption of this new recommendation may be rather slow. If the DOI were embedded in the XMP, then reference managers like EndNote and RefMan will naturally read these metadata. In other words, building your reference collection would be as simple as dropping a PDF in a watched file folder and letting your reference manager do the rest. This would also open the door to zero local copies of PDFs or an intelligent online storage system. EndNote and RefMan need only have the DOI and they can pull the rest using CrossRef's DOI look-up services.

Thursday, September 27, 2007

Would You Cite a Web Page?

Species pages in The Nearctic Spider Database are peer-reviewed in the very traditional sense. But, instead of doling out pages that need to be reviewed, I leave it up to authors to anonymously review each others' works. Not just anyone can author a species page; you at least need to show me that you have worked on spiders in some capacity. Once three reviews for any page have been received and the author has made the necessary changes (I can read who wrote what and when), I flick the switch and the textual contributions by the author are locked. There is still dynamically created content on these species pages like maps, phenological charts, State/Province listings, etc. However, at the end of the day, these are still just web pages, though you can download a PDF if you really want to.

Google Scholar allows you to set your preferences for downloading an import file for BibTex, EndNote, RefMan, RefWorks, and WenXianWang so I thought I would duplicate this functionality for species pages in The Nearctic Spider Database, though limited to BibTex and EndNote. I'm not at all familiar with the last three reference managers and I suspect they are not as popular as EndNote and BibTex. Incidentally, Thomson puts out both EndNote and RefMan and recently, they released EndNoteWeb. As cool as EndNoteWeb looks, Thomson has cut it off at the knees by limiting the number of references you can store in an online account to 10,000. Anyone know anything about WenXianWang? I couldn't find a web site for that application anywhere. So, here's how it works:

First, it's probably a good idea to set the MIME types on the server though this is likely unnecessary because these EndNote and BibTex files are merely text files:

EndNote: application/x-endnote-refer (extension .enw)
BibTex: ?? (extension .bib)

Second, we need to learn the contents of these files:

EndNote: called "tagged import format" where fields are designated with a two-charcter code that starts with a % symbol (e.g. %A). After the fields, there is a space, and then the contents. It was a pain in the neck to find all these but at least the University of Leicester put out a Word document HERE. Here's an example of the file for a species page in The Nearctic Spider Database:

%0 Web Page
%T Taxonomic and natural history description of FAM: THOMISIDAE, Ozyptila conspurcata Thorell, 1877.
%A Hancock, John
%E Shorthouse, David P.
%D 2006
%W http://www.canadianarachnology.org/data/canada_spiders/
%N 9/27/2007 10:40:40 PM
%U http://www.canadianarachnology.org/data/spiders/30843
%~ The Nearctic Spider Database
%> http://www.canadianarachnology.org/data/spiderspdf/30843/Ozyptila%20conspurcata

BibTex: Thankfully, developers at BibTex recognize the importance of good, simple documentation and have a page devoted to the format. But, the examples for reference type are rather limited. Again, I had to go on a hunt for more documentation. What was of great help was the documentation for the apacite package, which outlines the rules in use for the American Psychological Association. In particular, p. 15-26 of that PDF was what I needed. However, where's the web page reference type? Most undergraduate institutions in NA still enforce a no web page citation policy on submitted term papers, theses, etc. so it really wasn't a surprise to see no consideration for web page citations. So, what is the Encyclopedia of Life to do? The best format I could match for EndNote's native handling of web pages was the following:

@misc{hancock30843,
author = {Hancock, John},
title = {Taxonomic and natural history description of FAM: THOMISIDAE, Ozyptila conspurcata Thorell, 1877.},
editor = {Shorthouse, David P.},
howpublished = {World Wide Web electronic publication},
type = {web page},
url = {http://www.canadianarachnology.org/data/spiders/30843},
publisher = {The Nearctic Spider Database},
year = {2006}
}

Now, BibTex is quite flexible in its structure so there could very well be a proper way to do this. But, the structure must be recognized by the rule-writing templates like APA otherwise it is simply ignored.

The EndNote download is available at the bottom of every authored species page in the database's website via a click on the EndNote icon (example: http://www.canadianarachnology.org/data/spiders/30843). I have no idea if the BibTex format above is appropriate so I welcome some feedback before I enable that download.

But, all this raises a question...

Would you import a reference to a peer-reviewed web page into your reference managing programs and, if you are an educator, should you consider allowing undergraduates an opportunity to cite such web pages? Would you yourself site such pages? Do we need a generic, globally recognized badge that exclaims "peer-reviewed" on these kinds of pages? Open access does not mean content is not peer-reviewed or any less scientific. Check out some myth-busting HERE. What if peer-reviewed web pages had DOIs, thus taking a great leap away from URL rot and closer toward what Google does with its index - calculations of page popularity. Citation rates (i.e. popularity) is but one outcome of the DOI model for scientific papers. If I anticipated a wide, far-reaching audience for a publication, I wouldn't care two hoots if it was freely available online as flat HTML, a PDF, or as MS Word or if the journal (traditional or non-traditional) has a high impact factor as mysteriously calculated by, you guessed it, Thomson ISI. If DOIs are the death-knoll for journal impact factors, are web pages the death knoll for paper-only publications?

Wednesday, September 26, 2007

IE6 is Crap By Design

I give up. I have had it. Internet Explorer 6 absolutely sucks.

Since I first started toying with JavaScript and using .innerHTML to dynamically add, change, or remove images on a web page in response to a user's actions or at page load (e.g. search icons tagged at the end of scientific references to coordinate user click-to-check existence of DOI, Handle, or PDF via Rod Page's JSON reference parsing web service at bioGUID.info), I have struggled with the way this version of the browser refuses to cache images.

If multiple, identical images are inserted via .innerHTML, IE6 makes a call to the server for every single copy of the image. Earlier and later versions of IE do not have this problem; these versions happily use the cache as it's meant to be used and do not call the server for yet another copy of an identical image. I tried everything I can think of from preloading the image(s), using the DOM (i.e. .appendChild()), server-side tricks, etc. but nothing works. Page loads are slow for these users and the appended images via .innerHTML may or may not appear, especially if there are a lot of successive .innerHTML calls. Missing images may or may not appear with a page refresh and images that were once present may suddenly disappear with a successive page refresh. Because AJAX is becoming more and more popular & as a consequence, there are more instances of .innerHTML = "<img src="..."> in loops, don't these people think something is not right with their browser? I doubt it. Instead, I suspect they instead think something is wrong with the web page and likely won't revisit.

Here's some example code that causes the problem for an IE6 user:

...
for (i=0;i<foo;i++) {
bar.innerHTML = "<img src="...">;
}
...

Web server log lines like the following (abridged to protect the visitor to The Nearctic Spider Database) fill 10-20% of these daily files:

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

GET /bioGUID/magnifier.png - 80 - HTTP/1.1 Mozilla/4.0+(compatible;+MSIE+6.0;...etc...)

So, I wonder how much bandwidth on the Internet superhighway is consumed by this terrible design flaw? Microsoft is aware of the issue, but weakly tried to convince us that this behaviour is "by design" as can be read in their Knowledge Base Article Q319546. Give me a break. The solution in that article is pathetic and doesn't work.

Depending on what you do in your JavaScript, you can either force an HTTP status code of 200 OK (i.e. image is re-downloaded, thus consuming unnecessary bandwidth) or an HTTP status code of 304 Not Modified ("you already got it dummy, I'm not sending it again" - but still some bandwidth). Though I haven't yet investigated before/after in my web logs, others have reported that the following code at the start of some JavaScript will force a 304:

try {
document.execCommand("BackgroundImageCache", false, true);
} catch(err) {}

I'm not convinced that this will work.

After pulling my hair out for 2+ years trying to come up with a work-around, I have concluded that the only way this problem can be prevented is if the user adjusts how their own cache works. If you are an IE6 user, see the note by Ivan Kurmanov on how to do this - incidentally, Ivan's server-side tricks also don't work). In a nutshell, if you use IE6, DO NOT choose "Every visit to the page" option in your browser's cache settings. Unfortunately, there is no way for a developer to detect a user's cache settings. Like pre-IE6 and post-IE6 users, people with IE6 whose cache is set to check for newer content "Automatically", "Never" or "Every time you start Internet Explorer" also do not have this problem.

Developers (like Mishoo) have suggested that this "by design" flaw is a means for Microsoft to artificially inflate its popularity in web server log analyses. But, I doubt analysis software confuses multiple downloads from the same IP address with browser popularity. So how can this possibly be by design if there is no way for a developer to circumvent it? Who knows.

The only real solution I have been able to dream up, though I haven't implemented it, is to perform a JavaScript browser version check. If people choose to use IE6 and don't upgrade to IE7 then they won't get the added goodies. I certainly don't want to degrade their anticipated experience, but I also have to think about cutting back on pushing useless bandwidth, which is not free. If you insist on using IE, please upgrade to IE7. This version of the browser was released a year ago. Better yet, make the switch to FireFox.

Tuesday, September 25, 2007

Google Search JSON API

Google is obviously producing JSON outputs from most (if not all) of its search offerings because they have an API called "Google AJAX Search API". Signing up for a key allows a developer to pull search results for use on web pages in a manner identical to my iSpecies Clone. However, it is an absolute pain in the neck to use this API. It is nowhere near as useful as Yahoo's web services. Google goes to great lengths to obfuscate the construction of its JSON outputs and these are produced from RESTful URLs, buried deep in its JavaScript API. So, I went on a hunt...

Within a JavaScript file called "uds_compiled.js" that gets embedded in the page via the API, there are several functions, which are (somewhat) obvious:
GwebSearch(), GadSenseSearch(uds), GsaSearch(uds) [aside: Is this Google Scholar? There is no documentation in the API how-to], GnewsSearch(), GimageSearch(), GlocalSearch(), GblogSearch(), GblogSearch(), and finally GbookSearch().
Since all these functions are called from JSON with an on-the-fly callback, why make this so difficult to? If I had some time on my hands, I'd deobfuscate the API to see how the URLs are constructed such that I could reproduce Yahoo's API. Incidentally, I just now noticed that Yahoo has an optional parameter in its API called "site" whereby one can limit the search results to a particular domain (e.g. site=wikipedia.org).

So Google, if you are now producing an API like AJAX Search, why make it so difficult and force the output into a Google-based UI? Developers just want the data. You could just as easily force a rate limit as does Yahoo for its API: 5,000 queries per IP per day per API. Problem solved.

iSpecies Clone

For kicks, I created an iSpecies clone that uses nothing but JSON. Consequently, there is no server-side scripting and the entire "engine" if you will is within one 12KB external JavaScript file. The actual page is flat HTML. What this means is that you can embed the engine on any web page, including <<ahem>> Blogger. You can try it out yourself here: iSpecies Clone. Or live:

Rod Page was the first to create a semi taxonomically-based search engine called iSpecies (see iSpecies blog)that uses web services. He recently gave it a facelift using JSON data sources. This has significantly improved the response time for iSpecies because it is now simple and asynchronous. Rod could continue to pile on web services to his heart's content.

Dave Martin is producing JSON web services for GBIF and recently added a common name and scientific name search. It occurred to me that iSpecies could initially connect to GBIF to produce arrays of scientific and GBIF-matched common names prior to sending off a search to Yahoo or Flickr. Most tags in these image repositories would have common names. Likewise, if the Yahoo News API is of interest, then of course it would be useful to obtain common names prior to making a call to that web service. That's how the iSpecies clone above works. Oh, and scientific names are also searched when a common name is found and recognized by GBIF.

This clone is naturally missing material compared to the results obtained when conducting a real iSpecies search (e.g. genomics, Google Scholar - though is simply disabled here - and what looks to be some scripting for name-recognition above the level of species). If big player data providers like NCBI, uBio, CBOL, etc. produced JSON instead of or in addition to XML, then it would be incredibly easy to make custom search engines like this that can be embedded in a gadget.

After having tinkered a lot with JSON lately, it is now abundantly clear to me that The Encyclopedia of Life (EOL) species pages absolutely must have DOIs and plenty of web services to repurpose the data it will index. If we really want EOL to succeed in the mass media, then these species page DOIs should also be integrated into Adobe's XMP metadata along with some quick and easy ways to individually- and batch-embed them.

Thursday, September 20, 2007

GBIF web services on the rise

Look out EOL, Dave Martin has been very quickly creating some superb web services for GBIF data. Check out the GBIF portal wiki or the various name search APIs that produce text, xml, JSON, or simple deep-linking: HERE. As an example of the kinds of things Dave and Tim Robertson have been producing, here is a map gadget showing all the records in a 1 degree cell shared to GBIF from The Nearctic Spider Database:

Wednesday, September 12, 2007

EOL "WorkBench" Ideas Loosely Joined

Just how quickly The Encyclopedia of Life's "WorkBench" environment can be assembled will be interesting. For those of you unfamiliar with this critical aspect of the initiative, it will be the environment in which users will access and manipulate content from web services and other data providers, EOL indexed content, and a user's local hard drive AND simultaneously contribute (if desired, I expect this to be optional) to public facing species pages. As you can imagine a suite of things have to fall into place for all the pieces to play nicely in a simple graphical user interface. Expectations are high for this application to be THE savior, which I hope will differentiate EOL from WikiSpecies and other similar projects/initiatives.

I envision WorkBench as a Semantic Web browser of sorts, capable of pulling dozens of data types from hundreds of sources into a drag-drop whiteboard something like a mindmap. Coincidentally, I stumbled across MindRaider. Though I'd much rather see a Flex-based solution (such that Adobe AIR can be used), MindRaider (Java-based) developed by Martin Dvorak looks to be a very interesting way to organize concepts as interconnected resources and also permits a user to annotate components of the mindmap. Sharing such Semantic mindmaps is also a critical piece of the puzzle as is making interconnections to content in a user's local hard drive.

What then do we need for EOL's WorkBench? 1) web services, 2) web services, and 3) web services, AND 4) commonly structured web services such that resources acquired from hundreds of data providers must not require customized connectors.

Somehow, I'd like to see data providers ALL using OpenSearch (with MediaRSS or FOAF extensions) for fulltext, federated search and/or TAPIR for the eventual Species Profile Model's structured mark-up. Then, I'd like to see RSSBus on EOL servers. Lastly, I'd like to see a melange of MindRaider, Mindomo, and a Drupal-like solution to permit self-organization of interest groups of the kind Vince Smith champions with Scratchpads. Vince and others no doubt argue that there is great value in centralized hosting. But, the advantage is 99% provider. End users don't give two hoots for this so long as the application is intuitively obvious and permits a certain degree of import, export, configuration and customization.

So, the pieces of the puzzle:

Providers -> RSS -> RSSBus -> MindRaider/Mindomo/Scratchpad (in Adobe AIR) <- local hard drive

Sounds easy doesn't it? Yeah, right.

Friday, September 7, 2007

Forward Thinking

I previously criticized CrossRef for the implementation of new restrictive rules for use of its OpenURL service, but Ed Pentz, Executive Director of CrossRef, stopped by and reassured us that CrossRef exists to fill the gaps. The most restrictive rule has now been relaxed. Well done, Ed.

While browsing around new publications in Biodiversity and Conservation, I caught something called "Referenced by" out of the corner of my eye. This may be old hat to most of you and I now feel ashamed that I have not yet discovered it. Perhaps I have subconsciously dismissed boxes on web sites because Google AdWare panels have constrained my eyeball movements. Anyhow, CrossRef have used the power of DOIs to provide a hyperlinked list of more recent publications that have referenced the work you are currently examining. Ed Pentz has blogged about this new feature. Now, this is cool and is the stuff dreams are made of. For example, a paper by Matt Greenstone in '83 entitled, "Site-specificity and site tenacity in a wolf spider: A serological dietary analysis" (doi:10.1007/BF00378220) is referenced by at least 6 more recent works as exemplified in that panel including several by Matt himself. Besides the obvious way that this permits someone to peruse your life's work (provided you reference yourself and publish in journals that have bought into CrossRef), this is a slick way to keep abreast of current thinking. If your initial introduction to subject matter is via pre-1990 publications, you can quickly examine how and who has used previous works regardless of what journal that article has appeared. Hats off CrossRef!

Now, what we need are publishing firms still mired in the dark ages to wake-up to the power of DOIs. If you participate in the editorial procedures for a scientific society and your publisher has not yet stepped-up by providing you with DOIs, get on the phone and jump all over them! You would be doing your readers, authors, and society a disservice if you accepted anything less than full and rapid cooperation by your chosen publisher.

So Ed, will "Forward Linking" be a web service we can tap into?

Wednesday, September 5, 2007

CrossRef Takes a Step Back

UPDATE Sept. 8/2007: Please read the response to this post by Edward Pentz, Executive Director of CrossRef in the comments below.

Mission statement: "CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure."

Not-for-profilt, hunh? Money-grabbing in the professional publishing industry has once again proven to be more important than making scientific works readily accessible. As of September 7, 2007, CrossRef will roll out new rules for its OpenURL and DOI lookups. Unless you become a card carrying CrossRef "affiliate", there will be a daily cap of 100 lookups using their OpenURL service, which will require a username/password. If >100 lookups are performed, CrossRef will reserve the right to cancel the account and will force you to buy into their senseless pay-for-use system. Here are the new rules as described at http://www.crossref.org/04intermediaries/60affiliate_rules.html:

Affiliates must sign and abide by the term of the CrossRef Affiliate Terms of Use
Affiliates must pay the fees listed in the CrossRef Schedule of Fees
The Annual Admininstrative Fee is based on the number of new records added to the Affiliates service(s) and/or product(s) available online
There are no per-DOI retrieval fees. There are no fees based on the number of links created with the Digital Identifiers.
Affiliates may "cache" retrieved DOIs (i.e. store them in their local systems)
The copyright owner of a journal has the sole authority to designate, or authorize an agent to designate, the depositor and resolution URL for articles that appear in that journal
A primary journal (whether it is hosted by the publisher or included in an aggregator or database service) must be deposited in the CrossRef system before a CrossRef Member or Affiliate can retrieve DOIs for references in that article. For example, an Affiliate that hosts full text articles can only lookup DOIs for references in an article if that journal's publisher is a PILA Member and is depositing metadata for that journal in the CrossRef System
Real-time DOI look-up by affiliates is not permitted (that is, submitting queries to retrieve DOIs on-the-fly, at the time a user clicks a link). The system is designed for DOIs to be retrieved in batch mode.

So what's the big deal?
The issue has to do with scientific society back-issues like the kind served by JStore. Without some sort of real-time DOI look-up, it is near impossible to learn of newly scanned and hosted PDF reprints for older works. After September 7, the only solution available to developers and bioinformaticians is to periodically "batch upload" lookups. CrossRef sees Rod Page's bioGUID service and my simple, real-time gadget as a threat to their steady flow of income even though it clearly fits within their general purpose "...to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research."

Saturday, September 1, 2007

Giant Texas Spider Web

This past week, there have been numerous stories about the Giant Texas Spider Web in Lake Tawakoni State Park such as this compiled CNN video re-posted on YouTube:

View Larger Map

There hasn't yet been a definitive identification of the species involved (stay tuned for more), but from the videos I have seen, the primary culprit looks to be a tetragnathid (long-jawed orbweaver) and not the assumed social spiders like Anelosimus spp. (doi:10.1111/j.1096-3642.2006.00213.x). This behaviour is rather unusual for a tetragnathid and reminds me of what was thought to be a mass dispersal event gone awry near McBride, British Columbia several years ago. In that case, the species involved were (in order of numerical dominance): Collinsia ksenia (Crosby & Bishop, 1928), Erigone aletris Crosby & Bishop, 1928, a Walckenaeria sp., and Araniella displicata (Hentz, 1847). See Robin Leech et al.'s article in The Canadian Arachnologist (PDF, 180kb). In the case of this massive Texas webbing, there also appear to be several other species present in the vicinity as evidenced by the nice clip of Argiope aurantia Lucas, 1833 in the YouTube video above.

Update:
Mike Quinn, who compiles "Texas Entomology", has a great page on the possible identity of the species involved in the giant web. The candidate in the running now is Tetragnatha guatamalensis O. P.-Cambridge, 1889, which has been collected from Wisconsin to Nova Scotia, south to Baja California, Florida, Panama, and the West Indies. The common habitat as is the case for most tetragnathids is on streamside or lakeside shrubs and tall herbs.

Another potential candidate (if these are indeed tetragnathids) is Tetragnatha elongata Walckenaer, 1842. I suspect the tetragnathids and A. aurantia are incidentals and not the primary culprits who made the giant mess of webbing. Since Robb Bennett and Ingi Agnarsson both have suspicions that the architect is Anelosimus studiosus (Hentz, 1850), and since it is highly unlikely it is a tetragnathid, I have my bets on ergionine linyphiids much like what happened in McBride, BC.