SpamBayes futures

pork futures
Spam being the visceral topic that it is, yesterday's item provoked a number of responses. An email correspondent asks whether SpamBayes can deal with tricks like image-only messages, or words obfuscated with interposed characters. So far, not a problem. Part of the reason seems to be that the email headers contribute to the analysis. As Paul Graham notes, spammers:

would have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. [A Plan for Spam]

Sam Ruby asks:

What if we could marry Chandler and SpamBayes (both in Python)... [Intertwingly]

Yup, that's a natural. Though the Python-ness of both may not be directly relevant. In Mark Hammond's Outlook implementation, the SpamBayes engine could as easily be a COM component or a local Web service as a Python module. It's important to SpamBayes that it's written in a flexible, dynamically-typed language. Likewise to the Outlook addin. But Python isn't, and shouldn't be, necessarily the glue between them.

Several bloggers have advanced a line of thinking that I too find fascinating, and that points to implications far beyond the world of spam:

Matt Griffith:

My problem is information overload. I'm much more interested in seeing the same thing for RSS. Instead of blocking stuff I don't want I want it to highlight the stuff I might want. [matt.griffith]
Ditto. Using a Bayesian approach, or some other form of machine learning, as applied to my aggregator and my viewing patterns is something I've been wanting for awhile now. [0xDECAFBAD]

In fact, a kind of RSS-Bayes is already available to users of NewsGator, since you could process its messages through SpamBayes along with your email. I wouldn't, though, unless it were possible to use multiple instantiations of SpamBayes, because the ham/spam distinction in email is very different from the read/skip distinction in RSS.

The multiple-instantiation idea is potentially huge, I think. Consider just your email. I can imagine many dimensions of classification beyond spam/ham. For example: family/not-family, projectX/not-projectX. I actually go to the trouble of creating filters for some of these kinds of things, but it's arguably more trouble than it's worth. A multidimensional classifier that could notice these patterns emerging, offer to set up the foldering and filtering for me, and then reinforce the classification by observing my behavior over time -- wow, isn't that what computers were supposed to be for?

One other thought prompted by my conversation with a PR person yesterday about mail gateways. It's true that even if I decide SpamBayes is a total success, my email administrator has a bigger problem. He'd like to keep that stuff off his disks and off his wires. And I think I see how that can happen. I'm almost, but not quite, ready to tell Outlook to delete what lands in my Spam folder, sight unseen. If I do make that choice, why not replicate my SpamBayes database up to the server? Since my local database is in constant flux -- as my disposition of messages refines it -- this would ongoing message flow between client and server. Sounds like a job for Web services!


Former URL: http://weblog.infoworld.com/udell/2003/05/09.html#a685