Tangled in the Threads

Jon Udell, May 9, 2001

Document Engineering

Jon finds he is not alone in one of his obsessions

Documents hold most of the world's data. We need to engineer them accordingly, and the disciplines to do that are emerging.

Last week, Raymond Yee mentioned in my newsgroup that there's a name for one of my enduring interests: document engineering. And a website which describes an upcoming symposium:

ACM Symposium on Document Engineering 2001

Call For Papers
Atlanta, Georgia, USA November 9-10, 2001
in cooperation with ACM SIGCHI and ACM SIGWEB

Computer-based systems for creating, distributing and analysing
documents are one of the centerpieces of the new "Information
Society."  Documents are no longer static, physical entities.  New
document technology allows us to create globally interconnected
systems that store information drawn from many media and deliver that
information as active documents that adapt to the needs of their
users.  Furthermore, document technologies like XML are having a
profound impact on data modeling in general because of the way they
bridge and integrate a variety of paradigms (database,
object-oriented, and structured document).

Document engineering is an emerging discipline within computer
science that investigates systems for documents in any form and in
all media.  Like software engineering, document engineering is
concerned with principles, tools and processes that improve our
ability to create, manage and maintain documents.

Glad to hear it! Documents, after all, contain most of the data that we create, exchange, and consume. They are the currency of the networked, knowledge-based economy. They're as common as dirt, but scratch the surface and you'll find deep mysteries lurking within.

Consider the humble text file. We like to imagine that it's a stream of characters (bytes), delimited by separators (newlines). But of course the characters and the separators can be represented by a variety of single- or multi-byte sequences. When we happen to collaborate with others who use the same computer systems, and the same encodings, we tend to wish these differences away. But every now and again, we're reminded that things aren't so simple. It happened to me twice last week.

First, I ran into a separator problem. I take care of a magazine website that occasionally publishes code listings. Like all magazines, this one is produced by designers who run Quark XPress on the Mac. Normally, listings come from the original authors, but last week I received one that had been processed by the designers. So of course, it all ran together for me until I converted the Mac's CR separators into Unix-style (and Windows-compatible) LF separators.

Character encoding issues

Next, I ran into an encoding problem when a subscription website I work on refused to admit a new subscriber. It turned that this was the first subscriber whose name contained a character not representable in 7-bit ASCII. The character is the one which I can type in my emacs text editor (using its insert-ascii function) as the integer 232 (hex E8), thusly: è. What you will see, in your browser, depends on the encoding that it's using. For many of us, that encoding will be ISO-8859-1, and you will see the character whose Unicode number is 00E8, and whose name is LATIN SMALL LETTER E WITH GRAVE:

But if your encoding is set to ISO-8859-2, you will instead see the character whose Unicode number is 010D, and whose name is LATIN SMALL LETTER C WITH CARON:

The web form that accepted this ASCII 0xE8 character relayed it to a backend business system which happily stored it. But that backend system also communicated the character, by way of XML-RPC, to another system. And that system -- specifically, its XML parser -- choked on the character. It did so because the parser, MSXML, defaults to UTF-8. This, by the way, is one of those infuriating industry acronyms that is often used but rarely spelled out, and which must also be recursively expanded. Thus, UTF-8 stands for UCS Transformation Format 8, and UCS in turn stands for Universal Multiple-Octet Coded Character Set (UCS).

I found an excellent description of the properties of UTF-8 in the UTF-8 and Unicode FAQ for Unix/Linux:

  • UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  • All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00-0x7F) can appear as part of any other character.
  • The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
  • All possible 231 UCS codes can be encoded.
  • UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
  • The sorting order of Bigendian UCS-4 byte strings is preserved.
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

So, MSXML saw the 0xE8 as the first byte of a multibyte sequence, expected the next byte in the sequence to be in the range 0x80 to 0xBF, and choked when that wasn't the case.

There are a couple of ways to "fix" this problem. One would be to use only 7-bit ASCII characters, which mean the same thing in UTF-8. To do this, you'd represent 0xE8 as the entity è. Note that to ensure you see what I intend in the last sentence, I actually had to write è -- that is, "escape" the ampersand. Then, of course, to show you how I did that, I actually had to write è. And to show you how I did that, I had to write...oh, never mind, just view the source in your browser.

Another "fix," which we in fact adopted, was to tell the XML-RPC module that sent the packet of data containing the 00E8 character to use the ISO-8859-1 encoding. That meant that instead of beginning like this:

<?xml version="1.0"?>

it instead begins like this:

<?xml version="1.0" encoding="ISO-8859-1"?>

As my quotation marks around "fix" suggest, neither of these approaches is a complete solution. When I mentioned this issue in the newsgroup, a longtime correspondent -- whose name includes a Unicode 00E1, LATIN SMALL LETTER WITH ACUTE -- responded:

Ricardo L. A. Bánffy:

You have to use an encoding that suits the parser that will read the XML *AND* preserves the information you are trying to convey

If I am writing XML that will be read by a UTF-only parser, I will have no choice but to encode it so, even if I (as a Portuguese-speaking Brazilian) somewhat prefer ISO-8859-1 or Windows-1252 to anything else (and wonder why ASCII had to be 7 bits wide). That's sad, but true - XML is not that much portable.

There are parsers that don't care about encoding, but won't be able to change the encoding. It they get UTF-8 they will spit UTF-8 out.

I wouldn't encourage using numeric entities as they depend on the encoding (a "Ç" may have the same numeric under ISO-8859-1 code as "¥" under ISO-8859-12 - of course, I made this up) and this information is sure to be lost somewhere. I find "&Ccedil;" preserves the meaning much better. However, it is useless if you intend to put it inside an e-mail message or print it on a POS printer or sort it on a relational database.

Welcome to the wonders of XML.

Of course it wasn't really XML's charter to solve this problem. That's what Unicode is for, and XML can do no better than to follow the evolving Unicode saga.

Document namespace issues

It's hard enough to figure out how to represent and exchange characters. So it's not surprising that we're also facing tough issues when we string characters together. Lines (in text editors) and paragraphs (in word processors) are the primitives we all know and use. But in an increasingly hypertextual world, we want to be able to name, refer to, and even version these things.

Raymond Yee:

I've been wondering for a while there are any generalizations of this concept: what I'd really be interested is an operating system in which every document (and parts of documents) can be addressed. Kinda like URLs for everything on a machine. I've been wanting a way to refer to anything on my own machine (whether it's a cell in an Excel spreadsheet, a specific entry in my BibTeX database, a specific bookmark in a PDF file, any part of an HTML document -- whether something tacked on a anchor to it or not.)

What systems are available to provide such fine-grained naming of documents and their parts?

I responded with a few examples I'm aware of. In Zope, when you parse an XML document into the object database, every single element is URL-addressable. This is also true in Excelon.

When you fully generalize this, you end up with Xanadu -- a non-erasable storage system that remembers, and versions, everything. But a practical UI for dealing with such a thing seems almost impossibly elusive. In practice, I'd happily settle for the kind of granularity that gets you, in the case of documents, things like tables, paragraphs, subheads, and links -- the major features of the landscape -- but not every table cell, or word.

It would be very helpful for these features to carry natural names -- e.g. a leading fragment of the paragraph, or the text of a title, or the label of a link -- rather than a parser-generated name like Zope's http://my.zope/doc/memo/e3454.

It's hard to overstate the importance, and the difficulty, of naming. When I write at any length, nowadays, I tend to write in XHTML. And I tend to create an invisible namespace, within the doc, that supports references to chunks from both inside and outside the doc. An authoring tool that prompted with candidate names for these chunks, while allowing me to override if needed, might be very useful.

Alan Shutko:

Correction: _is_ very useful.

AUC-TeX in Emacs has this for TeX and BibTeX files. I haven't seen anything of that level for XML yet, but I'll agree that it's extremely useful.

Whenever you create a section/subsection/etc in AUC-TeX, it will generate a label of the form sec:title_here. Same thing for figures (fig:) and tables (tbl:). You use the labels whenever you need to refer to a table/figure/section in the text. In print, it'll give the table/figure/section number. In PDF/HTML, it'll give a link.

When editing bibliographic citations, it'll create a default label based on the author and title, used when you cite a work in the text.

When you need to use the reference, Emacs also has tools which will show the structure of the document to pick the reference (showing all the tables in the document, according to which section they're in).

The same thing could be done for XHTML or other XML DTDs, of course. I don't know of any cases where it has, though.

Raymond then referred us to the work of Tom Phelps, and in particular to his notion of robust hyperlinks (that's a trick link, by the way). In a paper on the subject, Phelps and Robert Wilensky propose augmenting URLs with lexical signatures. These are short lists of words, mechanically extracted from a page, which can then be used with existing search engine infrastructure to pinpoint the page no matter what its current address. In another paper on robust locations, these authors outline a strategy for defining document locations using parallel methods: unique ID, path, context.

Fascinating stuff! We should, of course, be concerned about the expressive qualities of the names of these locations. When I raised that issue, Peter Thoeny came up with a great example:

Peter Thoeny:

I strongly agree. Short and descriptive URLs are easy to type, don't wrap around in an e-mail, *and* are easy to parse by a script. An example gone bad is Whatis.com. They used to have descriptive URLs, like <http://whatis.com/term/XML.html> (I made up this example, can't remember the URL). Now they have <http://whatis.techtarget.com/definition/0,289893,sid9_gci213404,00.html> which is totally unusable to human minds.

A great point. I've always believe that a document is an engineered artifact, and that part of the engineering job is to assure usability. External and internal namespaces are, in fact, the primary APIs of documents. They need to be designed so that people and machines can effectively make use of them.

Maybe I should attend that ACM symposium in November. It would be fun to meet a bunch of folks who, like me, can't help but think about these kinds of things.


Jon Udell (http://udell.roninhouse.com/) was BYTE Magazine's executive editor for new media, the architect of the original www.byte.com, and author of BYTE's Web Project column. He is the author of Practical Internet Groupware, from O'Reilly and Associates. Jon now works as an independent Web/Internet consultant. His recent BYTE.com columns are archived at http://www.byte.com/index/threads

Creative Commons License
This work is licensed under a Creative Commons License.