Analyzing blog content

Suppose that we bloggers, collectively, wanted to migrate toward HTML coding and CSS styling conventions that would make our content more interoperable. Since none of us is starting from a clean slate, we'd need to analyze current practice. Well, now we can. Here, for example, is a concordance of use cases for HTML elements with class attributes, drawn from the database I'm building:

<a class="Troll">

  1. OLDaily: Theory in Chaos

<a class="listLinkLrg">

  1. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<a class="nodelink">

  1. Erik Benson: Pat Coa

<a class="offlink">

  1. Erik Benson: Pat Coa

<a class="regularArticleU">

  1. Jeroen Bekkers' Groove Weblog: Groove and Weblogs
  2. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<a class="weblogItemTitle">

  1. Seb's Open Research: Mario dans Le Devoir

<blockquote class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<div class="Section1">

  1. Clemens Vasters: Indigo'ed: Back to Business

<div class="active1">

  1. s l a m: Countering The Bush Doctrine

<div class="blogtitle">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<div class="caption">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article
  2. Windley's Enterprise Computing Weblog: Toysight

<div class="comment">

  1. Organic BPEL: Avalon is NOT representing the convergence between the Web and GUI!

<div class="date">

  1. Comments for Jon's Radio: None

<div class="inlineimage">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article
  2. Windley's Enterprise Computing Weblog: Toysight

<div class="node">

  1. s l a m: Countering The Bush Doctrine

<div class="personquote">

  1. Joi Ito's Web: With bloggers inside, Davos secrets are out - IHT article

<div class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<li class="MsoNormal">

  1. Hillel Cooperman: None
  2. Rob Howard's Blog: Continued...
  3. cbrumme's WebLog: Memory Model

<p class="ArticleBody">

  1. Telematique, water and fire.: Server vendors launch management initiative

<p class="MsoNormal">

  1. Luann Udell / Durable Goods: Myth #3 about Artists
  2. Clemens Vasters: Indigo'ed: Back to Business
  3. Rob Howard's Blog: Last post on the topic -- at least for now!
  4. cbrumme's WebLog: Memory Model

<p class="blogtitle">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<p class="code">

  1. Duncan Wilcox's weblog: Tag Soup

<p class="editorial">

  1. MobileWhack: Z600 Accessories, Accessories, Accessories

<p class="imagelink">

  1. Kevin Lynch: Intel Centrino

<p class="posts">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<p class="q">

  1. Duncan Wilcox's weblog: Trusting Corporations

<p class="text">

  1. Hillel Cooperman: None

<p class="times">

  1. Telematique, water and fire.: Metro AG and their RFID Plan

<span class="artText">

  1. Kingsley Idehen's Blog: Enterprise Databases get a grip on XML

<span class="bodytext">

  1. Seb's Open Research: Kottke: Guidelines for learning

<span class="byline">

  1. McGee's Musings: Russell Ackoff resources on systems thinking

<span class="closed">

  1. s l a m: Countering The Bush Doctrine

<span class="imagelink">

  1. Kevin Lynch: Adam Bosworth on Service Architecture

<span class="nxml-attribute-local-name">

  1. darcusblog: Names (again)

<span class="nxml-attribute-value">

  1. darcusblog: Names (again)

<span class="nxml-attribute-value-delimiter">

  1. darcusblog: Names (again)

<span class="nxml-element-local-name">

  1. darcusblog: Names (again)

<span class="nxml-tag-delimiter">

  1. darcusblog: Names (again)

<span class="nxml-tag-slash">

  1. darcusblog: Names (again)

<span class="nxml-text">

  1. darcusblog: Names (again)

<span class="o">

  1. ongoing: Genx

<span class="ofp">

  1. Seb's Open Research: None

<span class="rss:item">

  1. Blogging Alone: None

<span class="storyHead">

  1. Jeroen Bekkers' Groove Weblog: Disruptive in no small measure

<span class="text">

  1. s l a m: Countering The Bush Doctrine

<span class="title">

  1. Blogging Alone: None

<span class="topstoryhead">

  1. Dive into BC4J: BC4J Mentioned in the Latest Article in the OTN Architecture Series

<ul class="noindent">

  1. Corante: Social Software: Friendster notes
  2. Web Voice: And now for something different
  3. Dan Gillmor's eJournal: Electronic Voting: An Insecure Mess, but Full Speed Ahead

With only a few days' worth of accumulated content, I wouldn't dare to venture any recommendations about these use cases. But as the picture develops over time, we might start to see opportunities for convergence.

Update: I've been hoping for some external validation of this approach, and Giulio Piancastelli provides it today. As part of a much longer posting with lots of detailed technical analysis of RDF-oriented techniques, he writes:

What Jon is searching for, I think, is a good balance between the cost of providing metadata and the benefits gained by working on the provided metadata, while trying not to entirely move away from the web world as we know it. In fact, this is probably the most important characteristic of Jon's experiment: he is working with what he is able to find right now, that is lots of HTML documents, which can be converted to be well-formed XML quite easily, and then searched by means of XPath. While these are ubiquitous technologies, it's difficult to find RDF files spreaded around as such: proving that the RDF world is query-enabled, stating that the right place where to put metadata are RDF files because you can probably get higher quality and more complete results is useless if there are little or no data to query.

From my personal perspective, I see those two worlds, one working with XML and XPath, the other messing around with RDF and RDQL, still very far from each other. Jon's experiment is helping us to become conscious of the fact we already are on a metadata path as far as web content is concerned: XML and XPath are probably the first steps in this journey, leading us to a more semantic web augmented with technologies which nowadays seems not to be successful, but that will hopefully prove to be useful when more complex needs arise. We can only hope the virtuous cycle will start to spin soon.

[Through the blogging-glass]

Amen. Thanks, Guilio!


Former URL: http://weblog.infoworld.com/udell/2004/01/31.html#a903