How (and why) to include an xhtml:body in a Radio UserLand RSS feed

Sam Ruby and Don Box have both demonstrated valid RSS 2.0 feeds (Sam, Don) that include a <body> element, properly namespaced as XHTML. Quietly, last week, I joined the party. My primary feed now includes:

A brief <description>.
The full text of the item, as an HTML-escaped string, in a <content:encoded> element.¹
The full text of the item, as XHTML, in a <xhtml:body> element

Although it enlarged my RSS feed, the XHTML body shouldn't have affected -- and indeed seems not to have affected -- any existing RSS-aware software. So, what's it good for? In an upcoming O'Reilly Network column, I lay out the case. I want to aggregate feeds into XML-aware databases, and be able to run precise XPath queries against those databases. Given these capabilities, it makes sense to invest in more and better descriptive markup. That, in my view, is how we bootstrap from the existing blogosphere to the semantic web. Not by defining a cosmic ontology. But rather from the bottom up, by building consensus around incremental enrichment of the stuff we write every day.

At the moment, of course, that's just one more theory. So I won't speculate further now. I do, however, want to suggest a way to do the experiment. That boils down to some nitty-gritty implementation details. Here, I'll discuss how to simplify XHTML authoring for Radio UserLand, because that's the blog software I use. I hope users of Movable Type and other platforms will offer similar tutorials.

For me, writing XHTML isn't a big deal. Having abandoned Radio's embedded MS DHTML edit control -- because I'm often working on a Mac nowadays, and also because it sucks -- I just write my blog entries in simple, clean HTML. As a result, they're already very close to XHTML. If you remember to quote all your attributes, and close all your tags (for example, <img ... />,
, and

), you're almost there. Of course, I don't always remember that, so I need some help. There's also the nasty problem of bare ampersands and HTML entities, which need to be escaped or altered for XML transmission. Life's too short to deal with this kind of thing; clearly you want some tool support.

I knew, of course, that I wanted to use HTML Tidy, which can not only clean up the worst of the mess that the DHTML edit control makes, but can also be used to XHTML-ify your content. The question was how to integrate it with Radio UserLand's publishing process. I'm happy to report that David Carter-Tod has done the heavy lifting. His Tidy Tool wraps HTML Tidy in a Radio script. (More generally, his tool shows how to spawn any command-line executable from Radio.) To use it, you need a local copy of the HTML Tidy program. Since the instance of Radio that I publish from runs on Windows, I acquired the Win32 version of HTML Tidy, tidy.exe, and put it in Radio's Tools directory along with David's files: tidy.root and tidyconfig.txt.

After some standalone experimentation with HTML Tidy's XML/XHTML output mode, I settled on these tidyconfig.txt settings:

output-xml: true
numeric-entities: true
markup: true

It seemed to me that only the first of these should have been necessary. But without numeric-entities set to true, my HTML entities weren't escaped as they need to be. And oddly, the same thing happened when the markup setting (which pretty-prints the XML output) wasn't set to true. Perhaps an HTML Tidy expert can explain why the first setting alone wasn't sufficient, but in any case, what I show here is working for me.

To initialize David's Tidy tool, I typed CTRL-; to launch Radio's QuickScript editor, entered tidySuite.init(), and clicked Run. This launches a file browser so you can identify the location of tidy.exe, which in my case was c:\\radio\\tools\\tidy.exe.

Next, I tested against some sample postings. In the QuickScript editor, I ran scratchpad.s = tidySuite.clean( "..." ), substituting item texts for "...", and inspected radio.root.scratchpad.s in Radio's database editor. A minor annoyance with HTML Tidy is that it returns complete HTML files, adding <HTML> and <HEAD> and <TITLE> tags if you omit them. In this case, though, only the content of the <BODY> tag is needed, and happily, that's exactly what tidySuite.clean returns by default.

So far so good. Now, how to stuff this XML-ified item into an RSS feed? The new extensibility hooks were almost, but not quite, sufficient to the task. You can do three useful things with these hooks: add namespace declarations to your feed, add channel elements, and add item elements. But when you add an item element, it will automatically be escaped for HTML transmission. That's not what I want here. I really do want to send XML. So, for now, I'm continuing to use the original extensibility hook, which has enabled me to completely replace Radio's RSS writer with a modified version. In that version, here's how I'm adding the XHTML body:

add ("<body xmlns=\"http:\//www.w3.org/1999/xhtml\\">");
add (tidySuite.clean( string ( adrpost^.text) ) );
add ("</body>")

This works. It's a shame, though, not to use the newer, less invasive, more elegant extensibility hooks. If there were a way for item-level callbacks to indicate when angle-bracket-escaping and entity-encoding are not wanted, then you could use this better approach. Of course, for the vast majority of users, a Pref ("Emit XHTML bodies in RSS feed") would be ideal. So would a version of HTML Tidy that's included as a DLL, just as regular expression support is included in regex.dll. But first things first. If some of us try this bootstrap approach, and if it produces real value, then we can make a case for a more general solution. For now, my screen flashes five or six times, as each item on my homepage is processed in a command window -- but there's no real delay, and that isn't a problem.

My next step will be to start aggregating XHTML-aware feeds -- mine, Sam's, Don's, others -- into a database. I want to get a feel for what it's like to search that database with element-level specificity. And then I want to start injecting descriptive markup into my own stuff that will enable me (or you) to search more precisely.

¹ The <content:encoded> element is only used by my alternate feed, which transforms it into <description> for those who prefer reading complete items in RSS newsreaders, rather than the first-paragraph-only truncated <description> I send in my primary feed.

Former URL: http://weblog.infoworld.com/udell/2003/04/14.html#a666