Markup and emergence, yin and yang

Sam Ruby points to Stefano Mazzocchi's comment:

I'm more and more heading myself into the concept of 'data emergence' where you don't go around bothering people to markup their data as *you* like it, but *you* make an effort to collect their data and make a sense out of it.
As Sam notes, Stefano's term for this -- the 'pedantic Web' -- is unfortunate. Ironically, Stefano nailed the correct name in passing. 'Data emergence' gets it exactly right.

We have to face up to the fact, though, that this tussle between markup and emergence is the knotty yin and yang of semi-structured data management. The struggle has deep roots, and neither can nor should produce a victor. The best and only possible outcome is dynamic equilibrium.

That all sounds way too new age, but in practice it boils down to really basic and simple stuff. In a brief incisive essay called The Cornucopia of the Commons, which is chapter four of Peer-to-Peer: Harnessing the Power of Disruptive Technologies, Dan Bricklin drills down to the essence of the Napster phenomenon: using that simple, desirable user interface, you are also adding to the value of the database without doing any extra work. I'd like to suggest that one can predict the success of a particular system for building a shared database by how much the databases is aided through normal, selfish use.

With reference to the tragedy of the commons, Bricklin says:

In the case of certain ingeniously planned services, we find a contrasting cornucopia of the commons: use brings overflowing abundance...concentrate on whatever you can get from users, and use whatever protocol can maximize their voluntary contributions.

For many years I've been preoccupied with how to empower people to work together more productively. It really does come down to making virtues of laziness and selfishness. We see that happening all around us in blogspace, in ways that I hope (and believe) can transfer to the business enterprise.

Every corporate retreat begins and ends with the theme of communication. You've been there, done that. "We don't talk to each another." "I didn't know you were working on that project." "It's not my department." The missing ingredient is shared awareness. Blogspace, of course, is a laboratory in which new modes of shared awareness are being invented every day. Let's look at two examples.

When I began this weblog a little over a year ago, I debated whether to maintain a blogroll. There are selfish reasons to do so. By advertising your interests more precisely and selectively than a Google-related query can do, you can manufacture serendipity. But I was too lazy to maintain that list. Happily, there was no need. My RSS aggregator is a database that I was already maintaining "through normal, selfish use". It was only necessary to export it, which I (and now many others) do by means of a script 1. Alternatively -- and in retrospect better -- I could have fed the equivalent mySubscriptions.opml data through an XSLT transformation to achieve the same result. Either way, the point is that in this case, selfish routine use of a tool permits data emergence. Markup of one kind or another enables that emergence, but it's completely hidden by the tool used to interact with the database. That's a best case scenario.

Now let's switch gears and consider another example where we've yet to hide the markup as effectively. Another common weblog feature is a list of interesting books. We maintain these for the same selfish reasons we maintain blogrolls: to manufacture serendipity. And now a lazy solution has emerged here as well. I don't need to keep a list of the books that have drawn my attention, because All Consuming does it for me. The markup, in this case, takes the form of ISBN-bearning URLs -- to Amazon, All Consuming, or elsewhere. It works remarkably well, for those of us inclined to a) mention books in our weblogs, and b) write, copy/paste, or otherwise clumsily transmit the necessary URLs. The activation threshold for doing these things is dramatically lower than it used to be. But it's also dramatically higher than it needs to be.

Businesses need to work out something like scenario one in a general way. When teams form and work together, the "markup" that enables and documents team formation, and that represents shared work product, needs to arise naturally and invisibly as a consequence of tool use. The reality today is more like scenario two. Protocols are available -- "Use this tag in your email Subject: header" or "Send the updated schedule to this list" -- but they're awkward and crufty.

There are two ways to help data to emerge. We can scour the available sources and put them to better use. And we can improve the quality of the available sources. Both strategies need to be pursued in parallel. Both require smarter tools.

1 If you're using the original version, you may run into problems when your RSS reader encounters feeds with empty channel titles. The new version solves that problem.

Former URL: