Structured search, phase two

The next phase of my structured search project is coming to life. For the new version I'm parsing all 200+ of the RSS feeds to which I subscribe, XHTML-izing the content, storing it in Berkeley DB XML, and exposing it to the same kinds of searches I've been applying to my own content. Here's a taste of the kinds of queries that are now possible:

The paint's not dry on this thing yet. I have yet to normalize the dates, and I'm still getting the hang of DB XML, but here are some things that become immediately obvious:

Feeds that deliver only partial content are at a disadvantage.
HTML Tidy is able to coerce a surprisingly large number of the feeds I take from HTML to XHTML.
Once coerced, they're addressable in terms of the elements you find in HTML: links, images, tables, quotes.

Until now, I've thought the major roadblock standing in the way of more richly structured content was the lack of easy-to-use XML writing tools. But maybe I've been wrong about that. If it's going to be practical to XHTML-ize what current HTML writing tools, maybe we can make a whole lot more progress than I thought by working toward CSS styling standards that will also provide hooks for more powerful searching.

At the very least, this will be a nice laboratory in which to experiment with a growing pool of XML content, using a variety of XML-capable databases. My hope, of course, is to offer a service that's as useful to you -- the writers of the blogs I'm reading, aggregating and searching -- as it is to me. And ideally, useful to you in ways that invite you to think about how to make what you write even more useful to all of us. We'll see how it goes.

Former URL: http://weblog.infoworld.com/udell/2004/01/29.html#a901