Search engine, find engine

I've experimented with a few different ways of searching this blog. The structured search feature exploits microformats, and can find things like Ward Cunningham quotes and XSLT snippets. The infoworld explorer works in a purely navigational mode, relying on tag discovery. And the infoworld power search repackages the output of InfoWorld's Ultraseek engine into a slightly more palatable format, classifying articles by type.

Each of these experiments is interesting in its own way, but the bottom line was that I was still having trouble finding things in my own archive. So I took another run at the problem.

The new service I've wired up to the search box in my template seems like it'll do the trick. It's one of those delightful Do The Simplest Thing That Could Possibly Work kind of deals. I began by trying to improve on my Ultraseek post-processor, but that didn't work out. The worst problem was that Ultraseek doesn't know the difference between the real text of my items and all the other text -- blogroll names, recent item titles -- wrapped around the real stuff. I've always thought there should be a "noindex" tag that would hide template content from search engines, but there isn't.

So back to the drawing board. As it happens, I have all of my recent stuff in a single XML file. My structured search service slurps that file into memory and runs XPath queries over it. You can use that service for regular text queries too, but don't, it's dog slow.

Why not just run a more efficient kind of text search over that file? That's just what I did. We don't need no stinking index. Everything I've written here since April 2003, which amounts to about a third of a million words, comes to less than three megabytes. I'm tempted to call this new service a find engine rather than a search engine because, at its core, it uses Python's find and rfind functions to pinpoint the location of your search term in a memory-resident copy of that file.

I say search term, rather than terms, because I haven't even bothered to implement any fancy Boolean operators. Maybe I'll get around to that, but I've got a hunch I won't need to. Most queries on the web use single terms. And in a small corpus like this one, multiple terms will likely be more useful as literal phrases than as complex queries. In any event, for now I'm doing nothing at all with multiple terms. What you type is what you get. Since I also haven't bothered to exclude XHTML element and attribute names, searching for those will give confusing results, but in general it seems to work like a champ.

The results view describes the found URLs along three axes: date, frequency, and tags. You can sort by date or frequency. Since this is my own little search engine, I avoid describing frequency as relevance, a word that's always annoyed me in this context. Relevant to whom, and for what? Absent some real human evaluation, like the linking that drives Google's PageRank, I figure that frequency is just frequency, take it for what it's worth.

The tags, of course, are a product of real human evaluation -- mine, in fact. Surfacing them provides lots of useful cues. Although they link to their corresponding del.icio.us pages, a next step would be to weave tag surfing directly into this interface, a la the infoworld explorer.

Update: There is nothing quite like launching a new search service on the Internet and then watching for the first query to come in from the outside world. In this case, the sought phrase was ... wait for it ... Adrian Barbeau. Sadly nothing was found. But try again tomorrow!

Further update: Oops. For an obscure reason, it was displaying nothing at all in MSIE. Fixed now. Gotta remember to check that.

Further update: See also: Information architectures: print versus online.


Former URL: http://weblog.infoworld.com/udell/2006/02/09.html#a1385