A conversation with Michael Rys about SQL Server "Yukon"

The other day I had a demo/discussion with Michael Rys on the XML features in SQL Server 2005, aka Yukon. I meant to capture the demo and use the highlights in a screencast, but I was late to the call and didn't manage to set up the capture. As an aside, the conferencing service we used provides audio retrieval (playback only, no download), but no video. Offering video playback of screen activity, ideally with a download option, should probably be a checklist feature for conferencing services.

In any event, here are some of the key points I took away from the discussion:

on schema-governed shredding: Yukon can associate XML schemas with its XML datatype, and will exploit the schema information to optimize query execution. But it does not use schemas to control the automatic shredding of XML content into relational tables, a la Oracle. Rys sees two problems with that approach. First, it freezes your data and associated schema into structures that can't easily evolve. Second, it produces structures that are in principle logical, but are in practice physical and therefore exposed to tampering. Rys says he thinks Oracle has acknowledged and is addressing these issues; it's a topic I'd like to learn more about.

on collections: Yukon's XQuery implementation is tied to SQL queries and scoped to a single column containing instances of the XML datatype. So if the rows in that column are XML documents, an XML query will return one or more of them. But if you want to query across multiple XML-typed columns, you will -- for now -- have to first aggregate them using the FOR XML mechanism.

on static typing: The XQuery implementation is statically typed, which means that explicit typecasting will sometimes be required. Performance is one obvious reason to do this. Error detection is another. The example given was an XPath query for a mistyped element name. In many implementations this returns an empty result; here it returned an error because no such name was defined in the schema.

on updating: Since the advent of XML databases, vendors have wrestled with how far to push update capability in the absence of a standard. With Yukon, Rys says, the idea was to do the basics, not get too far ahead of the still-emerging update language, and "back it up on the physical level so you get partial update behavior" -- i.e., not rewriting whole documents when changes occur. As with query, there is no schema requirement for update. However, if a schema is present, an update statement must produce a valid result.

on concurrency: In the first iteration of Yukon, concurrrency, access control, and logging are all done on the relational level. So that means you will be locking by row, not by XML node. Rys would like to bring this stuff down to the node level, as Oracle (thanks to its automatic shredding) already does, but there's no commitment to that yet.

on fulltext search: Yukon uses an improved version of the SQL Server 2000 fulltext search engine. If you define a fulltext index on an XML column, you can run fulltext search on the documents stored in it. However, the search isn't yet structurally aware. The results are documents, not XML subtrees, so a typical strategy will be to use a fulltext search that yields a subset of documents in combination with an XPath query that isolates fragments. (Rys points out that the optimizer will figure out in which order to perform these steps.) Why no contextualized search? The fulltext aspects of XQuery aren't baked yet, Rys says. And SQL Server customers today are mainly focused on relational queries that may also pull additional information from documents. When the game is pure content storage and retrieval, it becomes more important to lead the standard, as MarkLogic has done.

on uses of XML: The bread-and-butter use cases remain what they have been: acquiring relational data from XML documents, transporting relational data as XML, and publishing relational data as XML. But Rys highlighted two emerging use cases that say interesting things about the way in which XML is bringing more fluidity to data management. The first example is what I'll call "expanding entities" -- that is, entities that are primarily modeled in SQL, but that have fluctuating sets of properties which are more conveniently handled in XML. The second example, Rys says, was a surprise to him. People are serializing CLR object data and storing it in XML columns. Given that Yukon directly supports CLR objects, why do that? Two reasons. First, they're limited to 8K -- that's the boundary, in the database, between normal and large objects. But perhaps more importantly, you have to register your .NET assemblies in the database; it's hard to evolve your objects there; and you can't mix types in a single column. With XML, Rys says, you might register an initial schema, but you can then add more schemas over time, and you can mix document types in a single column.

I find this last point -- on XML's role in the evolution of data -- to be especially interesting. Content management and data management have been converging for some time, but there hasn't been common ground on which practioners of these disciplines could meet. Now that enterprise databases are providing that common ground, we should start seeing all kinds of fruitful collaborations.

Former URL: http://weblog.infoworld.com/udell/2005/04/26.html#a1222