Closing the loop on XHTML blog content

James Farmer asks about the difference between WYSIWYG XML and HTML editing:

Micah says we need a WYSIWYG XML editing tool, um (me being naive here), what's wrong with WYSIWYG HTML tool? [James Farmer]

Here was Micah Alpern's response:

The bottom line is we need a way to edit/create structure while editing formatting. The high priests of XML want us to think about structure directly (this section should be H1 or H2), when most humans think in formatting (18 point font vs 9 point). WYSIWYG HTML editors have gotten us some of the way there, but the structure behind HTML is limited and co-mingles the presentation.

While as users we often want to manipulate the structure and the presentation at the same time it's important that within the underlying representation (the code behind the WYSIWYG editor) these two layers remain separate. [Micah Alpern]

I noticed these remarks because I've recently upgraded to Firebird 0.61 (from 0.6) in order to try out Jake Savin's Midas implementation for Radio UserLand. What I'd forgotten about Midas is that, unlike Mozile, it doesn't produce XHTML. I can probably coerce its output to XHTML using Tidy, and may do so because Midas is the more powerful of the two as an editor.

Meanwhile, though, I got to thinking about why I'm writing XHTML content in the first place. I laid out the case in this article, but although I've been steadily accumulating well-formed content since then, I hadn't gotten around to mining it.

Today I took the plunge, starting with this UserTalk fragment:

for i = sizeof(weblogdata.posts) - 112 to sizeof(weblogdata.posts) 
  s = s + table.tableToXml(@weblogdata.posts[i]);
file.writeWholeFile( "export.xml", s ) 

In other words, 112 postings ago I began requiring myself to post well-formed XHTML. This snippet exports those postings as XML.

Of course Murphy struck immediately. Radio was not expecting me to store well-formed XHTML, so it escaped all my entries. I was able to undo that escaping with this transformation:

<?xml version="1.0"?> 
  xmlns:xsl="" version="1.0"
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="node() | @*">
    <xsl:apply-templates select="@* | node()"/>
<xsl:template match="body">
<xsl:template match="text()">
<xsl:value-of disable-output-escaping="yes" select="."/>

But not completely. The problem is that when I write an entry, I distinguish between markup tags, such as <p>, and non-markup, such as &lt;description>. Radio's escaping eliminated that distinction. So after recovering my XML, I had to do a combination of scripted and manual fixup to restore it. Ugh. Going forward, I'll either have to convince Radio not to escape my stuff, or else maintain new items in the XML file I've extracted from Radio.

Anyway, the point of all this is to be able to blend style tags that make sense to ordinary users with structural cues that can facilitate intelligent search and recombination of content. Here's a search example. It's similar to some others I've done recently, and relies on the ability of MSIE or Mozilla to suck in XML and dynamically restructure it based on XPath search. It's not optimal for client-side use over the Web, since the first search hauls in .5MB of XML. Obviously a server-side implementation can be easily done as well, if needed.

So this closes the loop for me. Now when I add a CSS class attribute to an element -- like 'quotation' or 'minireview' -- I can think about it in two ways. As a writer, I'll assign some appropriate style to it. As a reader, I'll be able to filter my whole blog down to just elements of that class. Or to subsets of the class. For example, the XSLT and UserTalk fragments in this item are found by the corresponding canned XPath queries in the example.

This is not, in itself, very interesting to other people, though it's incredibly helpful to me as the author of my own blog. As it stands, of course, the cost/benefit ratio is way out of whack for most people. I'm willing to jump through hoops to make this happen, because I can and because I see the value of it. What I envision, though, is that a Midas-like thingy (tweaked to save as XHTML, and to integrate its CSS awareness with that of the host blog tool) could be used by lots of folks to enrich their blogs with named styles. If those blogs then flow their XHTML content out through RSS, we have a way to close the loop on a grander scale. Should people decide that a 'minireview' is a cool kind of blog element, they can use CSS styling to distinguish them visually. Meanwhile, as a secondary benefit, aggregators can collect and recombine these elements.

There are too many moving parts here, I realize, and it's going to be hard to get this whole concept over the activation threshold. But I'm ever hopeful!

Former URL: