Questions about Longhorn, part 2: WinFS and semantics

In the first installment of this series of questions about Longhorn, I concluded that the compelling benefit of WinFS must lie in the realm of "organizing stuff" rather than just "finding stuff" -- else why not just leverage existing and well-understood relational, free-text, and XML search methods? And I posited that the signature feature of WinFS -- "relationships" -- must be powerful enough to justify the creation of a proprietary new storage model that will enable (but also require) new applications and developer skills. Admittedly my "finding versus organizing" distinction was a bit of a cheat, since finding depends sensitively on prior organization. Except when it doesn't: brute-force free-text search routinely trumps navigation and structured search. But OK, we've all got to hope that better organization, someday, will level the playing field.

Today's personal information systems are organized hierarchically. WinFS proposes that they be organized semantically. A number of observers have noted a family resemblance between RDF (Resource Description Framework) "triples" and WinFS relationships. An RDF triple, in geek-speak, is a subject-predicate-object relation. Sets of RDF triples can be (and Semantic Web people say must be) used to represent and organize knowledge. Microsoft blogger Joshua Allen explicitly connects the dots between RDF/SemWeb and WinFS:

WinFS is going to enable numerous application scenarios that simply are not practical to implement with today's technology. WinFS is not based on RDF, of course, but they both share similar data models. And, while the scope of WinFS is local and "Semantic Web" is global, the scenarios are not that different. When you start to imagine what it would be like to extend WinFS stores to publish and synchronize data with one another, or alternately imagine a "personal semantic web," you can begin to see that the visions have some serious overlap. [Joshua Allen]

Although this stuff can get dangerously abstract, it's easy to state the practical benefit. If my personal information store contains items of types Person, Organization, Project, and Document, and if it knows about relationship types like Employment and Authorship, then I can easily answer questions like "Which Project X documents were written by Doug?" or "Which Project Y documents were written by employees of organization Z?"

Not everybody buys into the triples-oriented data model. Among them is another Microsoft blogger, Dare Obasanjo, who writes:

It seems that the point being argued is that with RDF you can get more understanding of the information in the document than with just XML. Being that one could consider RDF as just a logical model layered on top of an XML document (e.g. RDF/XML) I find it hard to understand how viewing some XML document through RDF colored glasses buys one so much more understanding of the data. [Dare Obasanjo]

Dare aims this critique at RDF/SemWeb, not WinFS, but I'll take the liberty of extending it to both. And I'll argue that in theory, an information system based on explicit knowledge representation -- using triples, or relationships, or whatever flavor of item-linking you prefer -- is way more powerful than a system in which the same knowledge is available only implicitly. But in practice, I wonder if anybody, whether it's Tim Berners-Lee or the Longhorn architects, can mandate such an approach given the chaotic messiness of reality. My favorite Joshua Allen quote, for example, is this one -- which I also used in my XML 2003 keynote:

The lesson, of course, is that real-world information is chaotic. In any but the smallest "proof of concept" systems, the best that one can hope for is to be able to recognize small pockets of structure within a sea of otherwise unstructured information. [Joshua Allen]

Maybe it depends how you construe "small pockets of structure." I've been getting decent mileage using nothing fancier than unschematized XML fragments. Microsoft, meanwhile, has taken a great leap forward in Office 2003 with support for schematized XML documents. The first glimmer of this stuff came almost two years ago. It shipped last fall. If asked to paraphrase the Office XML strategy then, I'd have put it this way:

Let's get schematized information out into the open, where any XML-aware tool can see it and touch it and work with it -- locally and globally, on Windows or any platform -- and then let's see what happens. If we play our cards right we'll broadly legitimize schematization, and we'll be able to use Windows to layer semantic value on top of it.

If asked to paraphrase the WinFS strategy now, I'd put it this way:

Let's put schematized information into Windows, where any CLR-aware Windows application can see it and touch it and work with it.

The first strategy envisions a plurality of schemas arising from the grassroots. You won't often hear support for this strategy from Microsoft, but I heard it last fall at the Enterprise Architect Summit from Jean Paoli, who appeared (with Sun's Jon Bosak) on my panel Schemas in the wild.

The second strategy envisions a canonical set of schemas woven tightly into Longhorn. Years from now it'll ship. Years later, it'll reach critical mass, developers will have mastered its APIs, and schema-aware Windows apps could start to make a "semantic" way of organizing and finding information real for lots of people.

Why wait? Microsoft is telling us to disregard the grassroots Office XML strategy, which is here now and doesn't lock us in, in favor of the ivory-platform WinFS strategy, which is years away and does lock us in. If a compelling argument can be made for the second approach, I haven't seen it yet.

Former URL: http://weblog.infoworld.com/udell/2004/06/07.html#a1017