Yesterday I said I'd try using Mark Logic's Content Interaction Server for structured search of blog content. A first version of that is now running here. If you've followed my adventures in this area, you'll know that I've used a similar service to do the same thing for about a year -- against my own blog content, exploiting some special XHTML coding conventions I use. The Mark Logic server is one of a series of engines I've used to extend a more general kind of structured search to all the blogs I read.
At the moment, the database contains about 3600 blog items -- basically, everything in my inbound RSS feeds that was convertible (using HTML Tidy) to XHTML. From a standing start, it loads and indexes that amount of stuff in just under 10 seconds flat on a dual-processor Compaq DL360. And as you can see, the queries are nice and snappy.
I've yet to drill into, and expose, the more advanced features of the product. As an XQuery engine, it can go way beyond the restricted XPath-only mode I'm using here. And as a fulltext search engine, it can do a whole other set of tricks: relevance, stemming, wildcarding, and so on. Before I push those envelopes, though, I think I'll burn in this first implementation for a while and let it accumulate more data.
Update: It's humming along nicely so far, with almost 4000 items. I've added a bunch more sample queries, including this one which makes a nice little collection of items about the DEMO conference. It works by finding items that link to demo.com, or that contain 'DEMO' or 'Demo@15'. The last time I played around with this idea I managed to excite the XML geeks but not many others. Maybe this time, with a fresh start and a new engine under the hood, I'll find a way to make structured search more compelling to a wider audience.
Former URL: http://weblog.infoworld.com/udell/2005/02/16.html#a1178