Managing unstructured data: the virtuous cycle

Phil Windley has been thinking about the vast quantities of unstructured data that governments store but cannot effectively index or search:

Got a question? Somewhere, on some government computer, the information you need is probably available. Information you paid for and the government would gladly share with you -- if only they could find it. There are thousands and thousands of documents stored on thousands and thousands of hard drives just in the State of Utah. Throw in city governments, county governments, school districts, universities, water districts, and other special use districts and the problem is staggering. Multiply that by fifty states and add in the federal government and it's mind boggling. With all of the technology available to index, catalog, and store this data, what's wrong? [ Windley's Enterprise Computing Weblog]

Citing our InfoWorld story on this subject, he agrees that "soft ROI" can't justify the expense and effort of structuring this data so it can be more effectively mined.

Elsewhere, Phil proposes to run a bakeoff between the Google appliance and Autonomy's auto-categorizer. For what it's worth, I'd love to see some of my tax dollars used to pay for that experiment.

Meanwhile, I'd like to suggest a low-tech approach to the problem. I've written at length in my book and elsewhere about strategies for organizing search results. Quite extraordinary benefits can be realized by paying a bit of attention to the design of two complementary namespaces. One is the set of HTML doctitles in your Web pages. The other is the set of pathnames forming the URLs of those pages. Both are virtual repositories of metadata which can be, and I argue should be, managed with a view toward categorization of search results.

I'm talking about nothing fancier than common-sense naming conventions. I well understand that content management systems tend to use unhelpful and incompatible conventions. This is more than an annoyance. It can cripple the ability of organizations to design useful local namespaces, and to federate across namespaces. Brent's Law of CMS URLs -- "the more expensive the CMS, the crappier the URLs" -- is funny, but in a tragic way. A CMS vendor who wins a government contract ought to have to prove the ability to manage namespace in flexible and useful ways.

I also suspect, though, that a huge amount of that unstructured data is still produced, and posted to the Web, by hand -- without any explicit guidelines for naming. Why not provide such guidance? It's true that you can't cheaply provide authoring tools to enforce naming conventions. But you can cheaply spider your content after the fact. A public validator could work like the RSS validator. It could be cool to have your content validate. A virtuous cycle could, perhaps, be created without ruinous capital investment.

Former URL: http://weblog.infoworld.com/udell/2002/11/06.html#a500