Mastering XML namespaces

Modular namespaces are a familiar concept in many realms. Area codes disambiguate phone numbers; domain names qualify URLs; package names scope identifiers in programs. Partitioning XML vocabularies in the same way seems like a natural thing to do, and it is. But for a variety of reasons explained in Ronald Bourret's Namespace Myths Exploded -- an essay written way back in 2000 that still resonates today -- XML namespaces cause a lot of confusion.

Recently, for example, I needed to process some RSS 1.0 feeds. An RSS 1.0 feed is actually rooted in the RDF (Resource Description Framework) namespace, though its items live in the RSS 1.0 namespace. Such feeds typically also weave in elements from other namespaces -- for example, Dublin Core metadata. My task was simple: parse the feed, use XPath queries to carve out items, and unpack the elements contained within those items.

This proved surprisingly hard to do with my regular XML parser and toolkit, libxml2, which deals strictly with namespaces. I then repeated the exercise using three other toolkits -- Python's minidom module, E4X (ECMAScript for XML) implemented using Rhino, and Mark Logic's XQuery-based Content Interaction Server. Each made the task simpler, though some would argue not laudably so in the case of minidom and E4X, neither of which requires namespace prefixes to resolve to Universal Resource Identifiers. But what's most striking when you point a variety of XML toolkits at documents that use namespaces is how differently each of them approaches the problem.

That's understandable, given that namespaces were always -- and still are -- optional. But thanks to Microsoft and Apple, what was the exception may soon become the rule. That's good news in the long run. We'll increasingly want to mix and remix XML data, and to do so we'll need to master namespaces. In the short run, though, I expect more of the turbulence we ran into this week when Sam Ruby and Mark Pilgrim, co-developers of the RSS/Atom Feed Validator and contributors to the Atom specification, found problems with Apple's specification of an iTunes namespace, and with Apple's -- and other podcast publishers' -- use of that namespace. These folks should have known better. But they weren't the first to be bitten by the quirkiness of XML namespaces, and they won't be the last. [Full story at InfoWorld.com]

Although I did not intend this column as an attack on XML namespaces, some readers took it that way, notably Clemens Vasters who wrote (quoted with permission):

Fact of the matter is that namespaces are required for composing any document that uses elements defined by more than one person, project or company -- furthermore, they are one of the pillars of interoperability and give the necessary clues to understanding someone else's language. They don't carry the semantics, but they can potentially point you at the right dictionary -- and often do.
He also questioned the RSS example I gave in the column, arguing that RSS's use of the default namespace was the root of my problem. In reponse I pointed out that the RSS flavor in the example I cited was in fact the fully namespace-qualified RSS 1.0. More broadly, I agreed wholeheartedly with his endorsement of XML namespaces.

Here's a concise restatement of the column:

All that said, I want to give Clemens the last word here:

If namespaces confuse anybody then it's likely because people don't read even the simplest of specifications these days and books don't seem to do a good job of summarizing them. That's why we still have a bit to go in our field of engineering to catch up with the folks concerned with hardware. If they neglect physics, people die and nobody blames nature. If we neglect specs, people cry and blame the spec authors.


Former URL: http://weblog.infoworld.com/udell/2005/07/18.html#a1269