I tinkered a bit more with the LibraryLookup project yesterday. First, I noticed that the Build your own bookmarklet feature was broken in Mozilla. It turns out that any undeclared variable in the JavaScript will break it. Some kind of security feature, perhaps? Anyway, fixed. While I was at it, I added a feature that previews the link that will be embedded in the bookmarklet, so you can test it first. It's the same principle as the ASP.NET test page.
The bookmarklet generator also now emits a streamlined script. The original version, I'm embarrassed to say, went like so:
var re=/[\/-](\d{9,9}[\dX])|isbn=(\d{9,9}[\dX])/i; if ( re.test ( location.href ) == true ) { var isbn=RegExp.$1; if ( isbn.length == 0 ) { isbn = RegExp.$2 }; ...
Of course, all that was really necessary was:
var re= /([\/-]|isbn=)(\d{9,9}[\dX])/i; if ( re.test ( location.href ) == true ) { var isbn = RegExp.$2 ...
How did this happen? The usual way: when I expanded the original pattern to include the "isbn=" case, I didn't refactor. An instinctive programmer would have refactored on the fly. I'm not one, so I didn't see this until later. The problem with seeing it later is that you run smack into Don's Amazing Puzzle. It's far too easy to see a written text in terms of what we think it should say, rather than what it actually says.
(Here, by the way, are two tips for Radio UserLand folks who want to include JavaScript in items and stories. First, remove all blank lines from your script, because the Radio formatter will turn these into <p> tags that will break the script. Second, backslash-escape all instances of \// -- which if it occurs nowhere else, will be found before the closing end-comment tag. Radio's not-very-discriminating URL auto-activator is triggered by an unescaped \// -- like this one: //.)
Next, I took another look at the service lists. The first one came from Innovative's customer page, since withdrawn. The others I found by Googling for URL signatures. But I had been meaning to dig into the Libdex lists that a Palo Alto librarian, Martha Walters, referred me to. That turned out to be a fairly straightforward text-mining exercise which yielded, for Innovative and Voyager libraries in particular, greatly expanded lists with much more descriptive library names -- and international coverage. Some of the many newly-added libraries:
Hong Kong -
Kowloon - City University of Hong Kong
Scotland - St
Andrews - University of St Andrews
Wales -
Bangor - University of Wales Bangor and North East Wales
Institute
Finland - Helsinki - Helsinki University
Puerto Rico - Gurabo - Universidad del Turabo
Scotland - Edinburgh - Edinburgh University
Because the Libdex catalog uses an extremely regular HTML format, it was not hard to reinterpret the HTML as a directory of services. But it wasn't as easy as it could have been, either. On the Backweave blog, Jeff Chan wonders whether Mark Pilgrim's use of the CITE tag is really an improvement over raw text mining. And Jeff mentions my report on Sergey Brin's talk at the InfoWorld conference, where I quote him as saying:
Look, putting angle brackets around things is not a technology, by itself. I'd rather make progress by having computers understand what humans write, than by forcing humans to write in ways computers can understand.
This isn't an either/or proposition. Like Mark, I strongly recommend exploiting to the hilt every scrap of latent semantic potential that exists within HTML. Like Jeff, I strongly recommend sharpening your text-mining skills because semantic markup, in whatever form, will never capture the totality of what can be usefully repurposed.
I guess I'm an extreme anti-extremist.
Former URL: http://weblog.infoworld.com/udell/2002/12/28.html#a556