The Longhorn cover story ran this week. It includes a main story, an interview with Quentin Clark, and an interview with Miguel de Icaza and Brendan Eich. Here are some outtakes from my interview with Quentin Clark, director of program management for WinFS.
On XML datatypes
We see being able to store structured data (like contacts), semi-structured data (XML), and unstructured data. In the case of a Word document, the XML isn't described by the doc type in WinFS, but the WinFS type defines an XML datatype, you can stick in the XML there, and we can reason over that. A JPEG, when you pull off the excess headers, is just a series of 1s and 0s you feed into an algorithm, that will never be structured. But even within a WinFS type, like a doctype, we allow for all three components to be within an instance of that type. A photo is a good example. We have a picture/photo type, things like what camera model, where the picture was taken, plus the unstructured bitstream. With respect to metadata handling -- and property promotion is only part of that -- we talk about picking the author out of a Word doc, putting it into a WinFS property. We also have property demotion, so coming through the APIs you can reprogram the title of some item, and it finds its way back into the filestream.
Just ignore WinFS types for a second, let's say I don't need no stinking types, I'm gonna build my own. You can make it from scalars, the XML datatype, and a binary field. We've defined the Windows doctype to have the ability to have a filestream as well as an XML datatype. That gives you a lot of power. You can walk up to WinFS, create a scope -- all documents, the whole store -- and then issue XPath queries into items that have XML datatypes, then we can go reason over those things.
On sharing
We want to use synch as a way to enable people to share stuff with each other, and to enable offline experiences. It's the Outlook 11 idea that I'm working always locally -- we're bringing that model to all data. You'll have ability to use any scoping mechanisms -- querying, or explicit wrangling where you drop 16 things into a list -- and say, hey, I want to share this with whatever, another machine or with another person.
Relationships that exist within the scoping are no problem, we know how to rehydrate them on the other end. For relationships outside the boundary, there's a couple of different mechanisms. So if you got the document from me, but I didn't give you contacts, then if you do have that contact, Sarah Wiley, on your end, we'll reconstitute the relationship. If it's not a thing we can positively identify, then it would dangle. But after the PDC we changed the data model to make dangling references not really dangle any more. There was a point where we tried to work through the user experience of finding and showing danglers where we realized hmmm, that sucks, people won't know why are they even doing this. Can we change how we store and model the data so that's not an issue? So you'll have a document, and an author, and the system knows nothing about that author because you don't.
On things being in more than one place
Consider three scenarios. First, I want to keep a list of stuff I need to do today -- a piece of mail, the notes preparing me for this call. Let's assume there's no query-based way to do this, it requires explicit wrangling.
Second: in Outlook 11 I have search folders -- stuff from direct staff, stuff where I'm on the To line, the Cc line -- I use these every day as part of my reasoning over my life. A lot of the reasons you want to do things in that query way imply that you don't want to manually intervene. You wouldn't want to inform the system manually about the To line, although you tell the system that my staff is the following people.
The third case is about where you want things to live, physically. Where they are contained. In Outlook, I have a PST and the OST, which is the reflection of my online Exchange mailbox. No big surprise, the 200MB Microsoft gives me is not big enough to contain all my mail, so I have PSTs to keep stuff. That's a containment thing, where do I want it to actually live? I have a removable hard drive at home, at some point I decide this photo will live there.
So those are the three axes. The limitation of Outlook 11 is that it doesn't allow you to put an item in more than one user lassooing. We want to allow multiple lists, or folders, where you can put the same thing in both. We're removing that Outlook limitation.
We encountered significant design challenges around user experience and expectations, and also problems around the DAG (directed acyclic graph). Consider security. I take an item, it lives in a bunch of folders, what is the security on that thing? Folder 2 has it too, then moves to folder 3. All the way back on folder 1, does the owner have any way to know what's happened? Then there's naming. If I have a doc, call it "jon's doc," created in a single folder, then I want to have it appear twice in that same folder, what is it called? If it's in a second folder, and I delete it from folder 1, then at some point I rename folders and put the doc back, calculating namespaces becomes complex.
On the object/relational/XML "trinity"
Why do you need all three? I take it that it's obvious why you need objects: you program to them. The CLR has given us some language independence, and we've done a fairly good job building an object universe better than we had before, we're strong believers in that. As for XML, there's no argument there either. The big thing isn't turning out to be industry schemas, but the fact that you have this self-describing thing, this is what I can learn about it, and I can reason on it in a programmatic way by pulling it up into an object.
Then there's relational, this is harder to describe. I will observe that nobody has built an XML store that has the level of scale, performance, or capabilities of today's relational stores. It's just true that the relational model has a set of design characteristics that give it performance characteristics that are are just inherent. Doing things in the XML store doesn't give you the same benefit -- and that's not even accounting for the fact that there's so much data in relational stores today.
But we're taking the start of the Yukon XML work, and bringing it into WinFS. Our vision is a marriage of these worlds, this is why they did so much work around SQL/CLR in Yukon. And we've done more since then in the WinFS part of the code.
Yukon doesn't reach the end of the journey, WinFS the client does not, but that's where they're headed. In terms of a data platform, that's what we want. This couples with the discussion of structured, unstructured, and semi-structured data. XPath doesn't make sense with the JPEG bitstream. Having that object/relational/XML trinity over a breadth of datatypes, that's the holy grail, that's completeness.
On WinFS benefits
First, having datatypes in Windows, so you can do things like program around a contact. The shared data is a huge benefit, though admittedly it's tricky to get it right. If you are Eudora or Act, how do we make sure you can plug in and own the contacts that are yours, while ensuring that Amazon can still query into a contact and pull out an address?
Second, the Outlook 11 experience of being always local. So an ISV builds a specialized app for architectural firms. Thanks to WinFS synch, the user can go offline and online. Using rules, if meeting notes come in that talk about changes in plans, he gets an action item.
Third, customized experience. If I drill into docs by Jon about WinFS, I can name that query, reuse it, write rules about it.
Fourth, finding things. How many times do you get the call: "I created this doc, help me find it." You'll never get that call again. If their field of view is too broad, they can narrow it easily. Fulltext search is there. Life will become a lot easier for end users. We had to turn off indexing in XP by default, it was too slow and chatty. When you're chasing the truth, it's hard to do. Being the truth is easier to do. When people come in and make changes, we know about it, we're built on a database.
Former URL: http://weblog.infoworld.com/udell/2004/07/20.html#a1044