A conversation with Jonathan Robie about XQuery

Back in 1998 there was no consensus that anyone would need a full-fledged XML query language. Today, XQuery is being implemented by all the major relational databases, by middleware vendors, in content management systems, and by open source projects. It's even becoming part of the SQL standard. "You've got to consider that success for a language," says Jonathan Robie, one of the prime movers in the development of XQuery.
...
As XML becomes the lingua franca for everything from XHTML Web pages to Word documents, the value of a general-purpose XML query language becomes ever clearer. [Full story in InfoWorld.com]

In his talk at the MySQL Users Conference, Adam Bosworth said of XQuery's long gestation: "Anything that takes four years is probably not worth doing." While I wholeheartedly endorse the alternative vision spelled out in that talk -- which is the best statement of how and why simple item-oriented queries over RSS and Atom data sources will enable the data web to scale out -- I don't buy the critique of XQuery. Lord knows I'm a huge proponent of rapid iteration, but some things do take a while -- SQL, for example.

When I began using XQuery implementations, I found the language to be an efficient, natural, and indeed easy way to process XML. The web is a big place. There's room for smart XQuery processors that do powerful things with complex centralized data, and dumb RSS/Atom processors that do useful things with simple federated data, and I plan on using both technologies side-by-side.

For more background on XQuery's development, here's the full text of my interview with Jonathan Robie:

JU: What are the roots of your XML career?

JR: I started out doing database stuff that was hard to do with database technologies of the time. Back in '85 I was doing GIS stuff involving traffic accident data. Then I moved to Germany and worked on image databases and prepress. When the wall fell I went to work for an object database called POET. I was there from about 1990 to 1996, and because I was the only native English speaker there I moved from development to product and program management.

JU: That's interesting, because I've always felt there was a deep connection between object databases and what a lot of us have been trying to do with XML.

JR: There are some similar goals. I think that object databases had some disadvantages. First, the input problem: you had to write code in C++ or Java in order to load your data into the database, and that was a royal pain. Whereas everybody has floods of XML being shot at them...

JU: ...well, now they do...

JR: Yeah, now. And then on the output side, again you had these rigid objects, it wasn't easy to get stuff out in any format you want.

JU: Although you could argue that in an alternate universe, OQL (object query language) could have succeeded.

JR: OQL is a nice language. The grammar of XQuery has benefitted a lot from it, and some of the implementation has too. Of course, it wasn't commercially relevant for the object database vendors to implement OQL in any kind of complete way.

JU: Because?

JR: Every object database had a different market, and a different model, and people weren't looking for interoperability between ObjectStore and POET, because why would you use them for the same thing?

JU: So at what point did you realize that what you were really after was a general query language for XML?

JR: I started working on an SGML repository for POET. FA Davis, a medical publisher, called us, and when we started talking about it, the notion of using SGML as a database format made all kinds of sense to me, because I'd been struggling all my life trying to get data into my database so I could query it.

Now it turns out that Frank Tompa, maybe in 1988 [ed: 1987], wrote a paper called Mind your grammer that I was totally ignorant of, but I spent the next two years reinventing concepts he had put into that paper, which I never discovered until 1998.

JU: What were those concepts?

JR: He was talking about XML as an abstract datatype, and the operators on it, and being able to use the schema as the schema for a query language.

So by the time the product came out, it was an SGML/XML repository -- I think it was sold as the POET CMS.

JU: So you could use SGML as the input format to the object database, and then you needed a way to query it?

JR: Actually it was stored as a persistent DOM. You used OQL both to create it and to query it. Doing that is a lot like data dictionary-level programming, everything's always one level removed.

JU: Huh?

JR: In XQuery or XPath there's one primitive that means "elements named this" whereas in OQL you were looking for elements where one of the properties was the name, and the name was this.

JU: OK.

JR: Another thing was that because object databases are networks, not hierarchies, you couldn't say "this contains that somewhere in the hierarchy."

JU: So no guaranteed closure on a double slash.

JR: Right.

JU: When did the notion of general purpose query for XML really start to take hold?

JR: At the WWW conference that year in Brisbane, there was a workshop on query languages for XML. This was the result of our [ed: Lapp's and Robie's] effort to push XQL [XML Query Language, an XPath precursor referenced in the XPath specification]. The conclusion was that querying XML was different in different environments, so what was really needed was different little languages for each context. I vehemently disagreed but I didn't have anybody on my side [laughs].

Later in 1998, XML-QL came along. Now, XQL had basically said, let's take a full fidelity look at the XML. It was a filtering language, it didn't allow you to reshape things, it didn't have a SELECT FROM WHERE. XML-QL, on the other hand, was lower-fidelity with respect to the XML...

JU: ...meaning...

JR: ...well, it didn't pay attention to document order, it didn't understand the texture of XML as well. Coming from an academic database background, they managed to provoke an XML query workshop. There were, I believe, 66 papers presented there, and they went in a bunch of different directions. A lot of people thought the real issue was fulltext. XML is all about documents, fulltext search is where it's at. There was another group that thought it was all about semantic query because, after all, XML isn't really semantic, so what you need is some kind of RDF query language to get at the real information in the document.

JU: And then there was the SQL-oriented group.

JR: Yes, the XML-QL folks. We didn't really understand each other very well to start with, but Mary Fernandez is a very good networker, a very good listener, and Mary and I started talking very deeply at that point. I think that was pretty important to what finally came out. There was a lot of stuff I was ignorant about, and I learned a lot.

JU: What were the key takeaways?

JR: First, that we simply had many concepts of what a query language should do. What wound up being one of the most influential papers was one by Dave Maier called Database Desiderata for an XML Query Language. It basically said we need the things that were in XQL, and the things that were in XML-QL.

JU: It's interesting how XPath has had such a productive life as a sublanguage, in the same way the regular expressions are a sublanguage now available in so many contexts. And just as many people still don't fully appreciate what regex can do, many also don't fully appreciate what you can do by just running XPath queries over documents.

JR: I think that's true.

There are really three fundamental capabilities of XQuery that are XML-specific. First, the ability to find anything in an XML document. Second, the ability to create any XML structure. And third is the thing that FLOWR expressions are used for, which is the ability to correlate things to create new results. XPath is, to my mind, about a third of that power.

JU: The third that's finding stuff...

JR: Yes.

JU: And in combination with XSLT, maybe two-thirds.

JR: Yes. The problems there are that, first, nobody's figured out how to optimize it for large datastores, and second, that a lot of people just can't get over the syntax.

Now when you talked about XPath as a sublanguage, XML syntax is also a sublanguage that's used in various places: scripting languages, XSLT, XQuery. So if you take XPath, which is like 80% of the grammar of XQuery, and then you take the syntax of XML, what's left beyond these two sublanguages is not that big. And it's a very loosely coupled syntax. We followed the loose syntactic framework of OQL, that was Daniela Florescu's contribution.

JU: Where are we today, in terms of finding common ground among the various approaches to querying XML?

JR: What happened was that we came up with a set of use cases. We had seven different languages all vying to become the query language. Once we agreed on the use cases, we were able to take the seven languages and see which they could solve and which they couldn't. None of them could solve them all, and that's where Quilt came from. The name Quilt meant we were taking snippets from different languages and sewing them together and imposing some kind of design on the result.

JU: What was Quilt, exactly?

JR: A language designed to solve the XQuery use cases, and to be a starting point for XQuery. Don Chamberlin and Daniela Florescu and I worked on it, in 2000. We were competing against a SQL variant, and that fall the working group decided to take Quilt as the basis.

JU: How do you feel the use cases have held up over time?

JR: Pretty well, I think. They've done a good job of explaining to people what XQuery is, and they're often the first thing people read.

JU: Including me. I wish more things were organized that way, because it's absolutely clear that people learn best by example.

JR: Also when things are new, you want to be able to show people concretely what problem you're solving, so they can tell you whether that problem needs to be solved. I would never do anything of this size without doing use cases. When you write down requirements, they're so abstract that you can get it wrong and just never notice. The use cases wind up being the requirements from the perspective of the user.

At the same time, in the W3C, there were activities going on that I could look at and have no idea, concretely, what was happening there. I wish there were more use cases for the semantic web, for instance.

JU: It seems that in the last six or eight months, XQuery implementations are really starting to hit their stride.

JR: And we're getting interesting implementations in very different environment. If you think about the environment for Mark Logic, and for our product [DataDirect's ], and for Saxon [], the implementations have almost nothing in common. And of course Oracle and DB2 and SQL Server are all adding XQuery. And in fact, did you know that XQuery is now in the SQL Standard? You've got to consider that success for a language!

JU: Really? How does that work?

JR: There's a pseudo-function that calls an XQuery, I believe, you should really check facts on that. [Ed: See Jim Melton's Advancements in SQL/XML]

JU: So where do we stand today?

JR: If every major relational database is implementing it, plus middleware vendors including BEA and DataDirect, plus fulltext vendors like Mark Logic -- we'ver certainly got mindshare.

Even so, at this point I think that most people don't know about it, except the ones who already care a lot about XML and databases. Now, we know that it's useful for all kinds of things, whether you're creating HTML pages or Web messages. Because Web messages are complex hierarchies that don't look like dumps of two-dimensional tables, and because nobody in their right mind would put a bunch of tables up on a website and ask somebody to join them in their brain, we have these complex hierarchical XML reporting needs.

JU: And we have all kinds of transformational needs, if you buy the big vision of Web services as fundamentally about intermediation -- adding value between services, and using XML transformation to do that.

From my perspective, this is almost a generational thing. I mean, how long did it take SQL to really sink in? Well, forever. Then, last year, an Oracle VP told me he can imagine a time when the primary abstraction for dealing data will be XML. I thought that was an astonishing statement, but it also made me realize just how long this transition will take.

On the other hand, because it is componentized, there are a whole bunch of people who have experience with the sublanguage, with XPath, in a variety of contexts, and from that perspective when they do get to XQuery they already know a whole lot. Maybe they don't even know that they know it, but they do, and when they get there, it'll seem very familiar.

JR: Exactly.

JU: So what's the current status of the spec?

JR: We hope soon to go to candidate recommendation status. Optimistically, we should be done in the first half of next year.

By the way, I want to make real clear that with all these things, I've been one of the people working on them, and not just the people I've mentioned here.

JU: Thanks!

Former URL: http://weblog.infoworld.com/udell/2005/08/03.html#a1281