Tangled in the Threads

Jon Udell, April 5, 2000

Microsoft XML: The cup half full

Word 2000's almost-but-not-quite-XML is a step in the right direction

Last week, Randy Switt posted an example of Word 2000's 'Save as HTML' output. He noted that this example looks far cleaner than the sorry excuse for HTML that Office97 coughs up. He also noted the strong XML flavor that permeates the text:

<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">

<head>
<meta http-equiv=Content-Type content="text/html;
charset=windows-1252">
<meta name=ProgId content=Word.Document>
<meta name=Generator content="Microsoft Word 9">
<meta name=Originator content="Microsoft Word 9">
<link rel=File-List href="./DSL_files/filelist.xml">
<link rel=Edit-Time-Data
href="./DSL_files/editdata.mso">
<!--[if !mso]>
<style>
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style>
<![endif]-->
<title>DSL: Heaven or Hell</title>
<!--[if gte mso 9]><xml>
 <o:DocumentProperties>
  <o:Author>Randal S. Switt</o:Author>
  <o:LastAuthor>Randal S. Switt</o:LastAuthor>
  <o:Revision>2</o:Revision>
  <o:TotalTime>933</o:TotalTime>
  <o:LastPrinted>1999-12-28T18:50:00Z</o:LastPrinted>
  <o:Created>2000-02-12T00:10:00Z</o:Created>
  <o:LastSaved>2000-02-12T00:10:00Z</o:LastSaved>
  <o:Pages>9</o:Pages>
  <o:Words>4761</o:Words>
  <o:Characters>27138</o:Characters>
  <o:Company>CNE Services</o:Company>
  <o:Lines>226</o:Lines>
  <o:Paragraphs>54</o:Paragraphs>

<o:CharactersWithSpaces>33327</o:CharactersWithSpaces>
  <o:Version>9.2720</o:Version>
 </o:DocumentProperties>
</xml><![endif]--><!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:ActiveWritingStyle Lang="EN-US" VendorID="8"
DLLVersion="513" NLCheck="0">1</w:ActiveWritingStyle>

<w:DisplayHorizontalDrawingGridEvery>0</w:DisplayHorizontalDrawingGridEvery>


<w:DisplayVerticalDrawingGridEvery>0</w:DisplayVerticalDrawingGridEvery>

  <w:UseMarginsForDrawingGridOrigin/>
  <w:Compatibility>
   <w:FootnoteLayoutLikeWW8/>
   <w:ShapeLayoutLikeWW8/>
   <w:AlignTablesRowByRow/>
   <w:ForgetLastTabAlignment/>
   <w:LayoutRawTableWidth/>
   <w:LayoutTableRowsApart/>
  </w:Compatibility>
 </w:WordDocument>
</xml><![endif]-->
<style>

Some people find this interesting mixture of HTML and XML to be yet another example of Microsoft trying to hijack the Internet. Angus Glashier cited a story by David Strom, on XML.com, which takes a dim view of Microsoft's XML intentions. Said Strom:

The real news is how MS-XML is designed from the start to be the common file interchange format for all Microsoft Office 2000 applications. In doing this, Microsoft has taken to extreme its time-honored practice of embracing and extending an ongoing standards effort. This time, MS-XML has something other than XML in mind. Microsoft is trying to move people away from ordinary HTML 3.0 documents and make Office 2000 the standard tool for Web authoring. And while earlier efforts, FrontPage most memorable, haven't really caught on, I think this time Office 2000 has a solid chance.

But Angus doesn't buy that argument, nor do I. "At least," he says, "the new XML format makes if feasible to parse Word documents with another tool." Not as easy as it could be, true, but not as hard as it used to be.

Not quite XML

In fact, Randy's example isn't XML. An XML parser will reject it. Consider, for example, this line:

<meta name=Generator content="Microsoft Word 9">

This fails to parse as XML for two reasons. First, the value of the name attribute is unquoted (although the value of the content attribute is quoted). Second, the <meta> tag is unclosed.

Here's one legal XML representation of the fragment:

<meta name="Generator" content="Microsoft Word 9"/>

Here's another:

<meta name="Generator" content="Microsoft Word 9"></meta>

Nowadays I tend to write all HTML in this XHTML-ish way, just for the convenience of being able to parse it as XML when I need to. Why shouldn't Word 2000 take the minimal steps required to emit XHTML -- that is, HTML which will parse in an XML parser? The benefits of doing this are enormous. XHTML is a wonderful transitional technology. You can simultaneously regard content as XML, for purposes of automated processing, and as HTML, for purposes of display. No XML-to-HTML transformation is needed to make the content viewable in a browser. So why doesn't Office 2000 (at least optionally) do this?

Phil Hunt

It's from Microsoft, wadda you think? (Or maybe I'm being too cynical today.)

Dominic Amann:

Hardly, unless you count the reasoning of US District Judge Thomas Penfield Jackson as "cynical." In the light of the judgement, this corruption of XML is one more link in a long chain of anti-competitive practices by MS, where they clearly see middleware to be a serious threat to their monopoly, and employ every means at their disposal to sabotage all efforts at developing successful middleware platforms (Netscape, Java, and now XML).

But wait a minute. Randy's example was produced by Word's "Save as HTML" feature, not by "Save as XML." In fact, as Peter Hess notes, Word 2000 doesn't offer a "Save as XML" feature and thus doesn't claim to be creating XML output. He reports:

Here's an interesting experiment I did. I saved a Word document as HTML. Then I made a copy of the HTML file and changed the file extension to XML. Opening the HTML file in IE5 worked fine. Opening the XML file caused the exact error you noted in your post regarding the missing quotes. Even Microsoft's XML parser can't parse their sorta-kinda XML files.

True, but that's because the output isn't XML, nor does it claim to be. As Peter later points out, the format embeds "islands" of XML inside HTML comments, like this:

<!--[if gte mso 9]><xml>

lotsa xml-like stuff here

</xml><![endif]-->

He adds:

It's probably more accurate to look at the Office HTML file format as being HTML/CSS with embedded chunks of XML to carry non-display data.

And he cites this explanation from the MS Office 2000 Product Enhancements Guide:

To prepare Office documents for the Web, HTML serves as the base technology, Cascading Style Sheets (CSS) serves as the mechanism for specifying layout and formatting of the file, and Extensible Markup Language (XML) is the method of storing nonviewing data in the document. Office 2000 uses CSS extensively, both to ensure high-quality output in the browser and to add information to the file that helps Office applications preserve more information in the file than can be rendered by the browser.

Towards XML-aware writing tools

It's certainly true that Microsoft (like other companies) has used proprietary file formats as a competitive weapon. And certainly, there is a proprietary flavor to the XML-ish HTML seen in Word 2000. But it's an important step forward, and will make a lot of content more accessible than it formerly was.

Angus Glashier:

So what if MS Office doesn't use official XML?

Take a "cup is half full" view of the situation. Before, it was effectively impossible to read Word documents outside of Word. You had to parse Microsoft's proprietary binary format, which changed with every version. Now it's possible to read Word documents using a text editor and have a good chance of decoding them.

I've always liked using MS Word, but I hated that all my documents were stored in a format that I couldn't read outside of Word. With MS Office 2000 I can work entirely in Microsoft's version HTML/XML, which retains all the Word-specific settings. Others can view my documents using a web browser and if I need to I can convert the document into another format with a minimum of fuss. If the XML is invalid, I can use Python or Perl to fix it.

I think that's a good perspective.

Still, I'm impatient to move ahead to a world in which well-formed and valid content is routinely produced by authoring tools, and understood by editing and viewing tools. I recognize the reality of the installed base, and its inertia. Nevertheless I'm saddened that MS Word, which so dominates the realm of textual data entry, is not yet able to support well-formed, valid writing, even though the infrastructure (e.g., XML parsing, CSS-driven XML viewing) is widely deployed and (on Windows) almost universally deployable. But as Angus says, the new Word format is more tractable than the old, and doesn't -- after all -- claim to be XML.

That said, I'd use Word 2000 in a heartbeat if it did what SoftQuad's XMetaL does, namely: support the writing of valid and well-formed XML (or XHTML).

Phil Hunt agrees, and says he's working on an open-source product along these lines. The need is tremendous. Everybody wants the results that flow from well-structured documents. Nobody wants to write documents that way, though. The obstacles are twofold:

XMetaL attacks the second point, with good results. But the 800-pound gorilla is MS Word. Not until it supports well-formed and valid writing, so that ordinary documents can easily conform to prescribed structure, will the majority of users get a chance to discover the benefits of this approach. Maybe in Word 2002?


Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He's now an independent Web/Internet consultant, and is the author of Practical Internet Groupware, from O'Reilly and Associates. His recent BYTE.com columns are archived at http://www.byte.com/index/threads

Creative Commons License
This work is licensed under a Creative Commons License.