Working with Bayesian categorizers

There's been some discussion in the blog world about using a Bayesian categorizer to enable a person to discriminate along various interest/non-interest axes. I took a run at this recently and, although my experiments haven't been wildly successful, I want to report them because I think the idea may have merit. [Full story: O'Reilly Network: Working with Bayesian Categorizers]
This month's O'Reilly Network column was a struggle because categorization itself is a struggle. I remain convinced that the automated classifiers that are doing such a good job beating back the tide of spam will also turn out to be more generally useful. But finding the right synergy between an automated assistant and a human overseer is a subtle and tricky thing.

Update: Interesting comment from Larry O'Brien:

Jon appears to be doing something dangerously more ambitious, which is creating a Bayesian categorizer that assigns Jon-meaningful categories (email, collaboration, family, etc.) to items. I say "dangerously more ambitious" because Jon's approach would seem to require a lot of supervision, while the genius of Bayesian spam-filtering is that pressing a button marked "Delete as spam" is no more onerous than deleting the spam in the first place. Similarly, a Bayesian RSS aggregator that just attempted to categorize "Will this item be read, will this item be clicked-through, will this item be deleted without pause?" requires no more supervision than what is natural to the task of RSS browsing. [Knowing .NET]
Agreed, this is speculative at best. For what it's worth, there's a twofold notion at work here. First, from the perspective of a blog author who already categorizes content (as many do), the question is: can effort that's already being invested pay more dividends? An automated review of things that have been already been categorized can help you sharpen your sense of the structure you are building. A prediction about how to categorize a newly-written item can be interesting and helpful too. As I worked through the exercise, I could (at times) imagine the software to be acting like a person you'd bounce an idea off of. "I can see why you choose that category," we can imagine it saying, "but for what it's worth, it has a lot in common with these items in this other category."

The second and even more speculative idea would be to create subscribable filters. Consider the set of items that I write myself, and categorize under, say, web_services. Some other set of items out there in the blogosphere, written by other folks, will tend to cluster with mine. Could we say that those other items have some affinity for "Jon's take on Web services"? And if so, by subscribing to my text-frequency database for that category could you use it to create one view of your own inbound feeds, or to suggest ones you're not reading? This part of the experiment failed badly, I'll freely admit. When I used my database to categorize items drawn from elsewhere, the results weren't promising. However, the sample size for the experiment was very small. It's conceivable to me that something could come of this approach, though I wouldn't bet money on it.

A final note: Patrick Phalen wrote to remind me of another toolkit: the Python-based Natural Language Toolkit (NLTK). In fact I did try it. NLKT is much more sophisticated than the other two kits I wrote about -- it's evidently used as a foundation for all kinds of natural language research -- but for some reason I didn't get a quick ramp-up with it. That said, for what I was attempting the toolkits weren't really the bottleneck. The time-consuming thing was setting up an environment in which items could be identified by descriptive titles, viewed in HTML, and shuffled around easily in drag-and-drop fashion. The category-per-directory approach is probably about the best you can do in that regard, and you could easily adapt NLTK or another kit to that approach.


Former URL: http://weblog.infoworld.com/udell/2003/11/20.html#a851