Semantic screenscraping

I'm in San Francisco for a couple of days, meeting with my fellow InfoWorlders. When I looked at san.francisco.eventguide.com to see what's going on, I expected to find an accompanying Google Maps mashup but didn't. So I cooked up a quick and dirty one, mostly because I've never actually tried the official Google Maps API and wanted to see what it's like. (My Google Maps walking tour of Keene, NH, done almost a year ago, predated the API.) As I expected, it's almost trivial to create a map, inject points into it, and attach documentation to the points.

The hard part is coming up with the data. Many if not most Google Maps mashups crawl websites to get it, and my eventguide.com solution follows in their footsteps. It's customary to apologize for web screenscraping, and to decry the ongoing failure of websites to offer data-oriented interfaces. But lately I'm less embarrassed about screenscraping than I used to be.

Regular expressions once dominated my screenscraping code. Now XPath expressions do. Screenscraping is becoming more declarative, more query-like. There's no single breakthrough I can point to, just an accumulation of trends:

HTML is readily convertible to XHTML. When I was working with MarkLogic's XQuery engine last year, I was pretty surprised to find that virtually all of the RSS feeds I collected were convertible to XHTML using HTML Tidy.

The resulting XHTML is semantically richer thanks to CSS. It's been a long slog, but CSS is now close to pervasive. That creates all kinds of hooks for structured query.

There are more and better structured query technologies. XPath implementations have matured and are widely available. XQuery implementations are maturing rapidly too. And here's a key point often overlooked. Even when an HTML page resists conversion to XML by all other means, browsers can still make sense of it. And having made sense of it, they can use structured query to analyze it.

The upshot is that pulling data out of eventguide.com felt very different than comparable tasks did in days of yore. It felt like I was navigating a linked web of data, to borrow Adam Bosworth's phrase again.

In fact, that's just what I was doing. The data web required for this exercise includes event and venue pages from eventguide.com, which I am "just" screenscraping, and location pages from geocoder.us, which I am fetching as XML fragments (using REST calls in preference to the XML-RPC and SOAP options). Could eventguide.com change its interface and break my code? Sure. So could geocoder.us. But in practice both are likely to remain rather stable over their lifetimes. And there's no guarantee that the one will be more stable than the other.

Web services at one end of the tolerance continuum, and microformats at the other, are helping to create the data web. But it's nice to see that the plain old web is getting more data-friendly too.


Former URL: http://weblog.infoworld.com/udell/2006/02/02.html#a1380