Bayesian vs. latent semantic analysis

By way of Michael Alderete's blog, I found this fascinating item by Tim Oren, a venture capitalist whose eight-year stint at Apple included advanced research on the use of latent semantic analysis for document categorization. Although he can't say for sure, Oren strongly suspects that although OS X Mail is widely thought to use Bayesian techniques, it in fact uses latent semantic analysis:

So what's Apple doing with latent semantics to catch spam? Not sure. The simplest approach is to use a related factor analysis technique to find the best fit to predicting spam/not-spam in a training sample; it's not a full PCA but I suppose you could call it latent semantics. It would be more interesting if they are using the full deal, maybe computing separate models for spam/not. Because, you see, latent semantics naturally lends itself to automatic sorting and organization of the document space over which the model was computed. And afew years back, the Apple group that I and then Dan Rose managed did an automatic e-mail organization project rather unfelicitously called piles, that included a user interface for just such a thing. (IP alert: Some of it was patented.) Hmmm.... [Due Diligence]

I wonder if Apple will clarify? Meanwhile, I just picked up a weekend's worth of mail. Here's the score:

     rightly sent to Spam folder:  237
     wrongly sent to Spam folder:    0
rightly left in original folders:   14

In other words, a perfect performance. How long can this last? One of the good messages, from an old acquaintance, pointed out that the SpamBayes glow will inevitably fade as spammers regroup to attack it. I'm sure that's true. In the long run a multi-pronged approach seems best. Server-based gateways to keep the worst of the junk off your network, digital identity, filtering, blacklisting, whitelisting, digital postage -- all of these strategies will play important roles.

To reiterate, though, there's more going on here than spam prevention. Bringing advanced computational methods for document categorization to the desktop will create a host of new opportunities. Our personal data stores are about to become the laboratory for some really fascinating experimentation.

Former URL: http://weblog.infoworld.com/udell/2003/05/12.html#a686