Confession time

It's time for a confession. I've been acting as though all this cool XPath search stuff I've been demonstrating for the past few weeks were based on plain vanilla XHTML. Well, it's not (quite) true. In general my point has been to illustrate two things:

That the XHTML equivalent of ordinary HTML content includes metadata (links, tables, images) that can be usefully exposed as XML.
That legal ways of enlarging the namespace used within HTML -- in particular, CSS class attributes -- can enhance this approach.

But in truth, as some have noticed, I've been cheating on XHTML a bit. Here's one cheat: in order to support this query -- Python snippets -- I've been writing HTML like this:

<pre class="code" lang="python">
...
</pre>

The class="code" bit is OK, but there is no lang attribute defined for the <pre> element, I just made it up to support queries of this form. So far nobody has noticed or complained, but it's not right.

Here's another cheat. In order to support this query -- quotes from Stefano -- I've been writing HTML like this:

<blockquote cite="Stefano Mazzocchi">
...
</blockquote>

This isn't right either. The value of the cite attribute is supposed to be a URI, not somebody's name. In this case, a few people have noticed and complained. I'm willing to switch to the correct usage of cite, and since my content is in an XML database I can fix it backwards as well as forwards. But here's the thing: I still want to be able to search quotes by person, not by URI. And I'd like there to be a standard way for other people to write quotes that they, or I, can search by person.

More generally, there are zillions of such use cases which I don't think we can know in advance of discovering them. So I can't imagine proposing any specific extensions to XHTML that would accommodate such discovery. I can think of two general approaches, though. One might go like this:

<blockquote 
  cite="http://www.betaversion.org/~stefano/linotype/news/35/" 
  X-who="Stefano Mazzocchi">
...
</blockquote>

In other words, agree to allow a class of experimental attributes in a manner analogous to the experimental X- headers of SMTP.

Another might go like this:

<blockquote xmlns:exp="http://XHTML-Experimental"
  cite="http://www.betaversion.org/~stefano/linotype/news/35/" 
  exp:who="Stefano Mazzocchi">
...
</blockquote>

In other words, use another namespace for attributes carrying extra data intended to facilitate search and reuse.

I've hesitated to even raise this issue because, in my experience, it's the kind of thing that can just get bogged down in endless discussion and debate. So I've gone ahead and cheated a bit on XHTML in the service of what I think is a proper ambition: to find some workable middle ground between the unstructured real web that exists all around us and the structured Semantic Web that exists only in our imagination. Or rather, to suggest how the latter can emerge from the former. But to those of you who've wondered: yes, I do feel guilty about cheating, and I'd like to come clean. Are there ways to enlarge the carrying capacity of the HTML namespace without doing violence to the spec? And without inventing mechanisms too complex for writers of ordinary everyday documents, or too far removed from existing writing tools?

Former URL: http://weblog.infoworld.com/udell/2004/02/03.html#a908