SpamBayes rocks

stopping spam
SpamBayes with Outlook Addin In an upcoming InfoWorld article, which will post next Friday and appear in print the following week, I review the SpamBayes filtering engine and Mark Hammond's brilliant Outlook addin. Thanks to this remarkable open source duo, I am ready to declare victory on spam.

I can't post the whole review yet, but neither can I resist reporting here what I think are remarkable results. I've been a skeptic when it comes to content-filtering solutions. I thought about driving down to Cambridge in January for the Spam Conference, for example, but at the time felt that its narrow focus on filtering -- to the exclusion of many alternatives, including strong identity management -- was short-sighted. Now I think I'm the one who was short-sighted.

Here are the first two paragraphs from a particularly interesting spam:

I came across your web site- 'http://jonudell.net/', the official website of "Jon Udell". I found your website to be very impressive and I went through the contents of your website which were quite interesting. Your article on "Distributed HTTP" has been great! I thoroughly enjoyed browsing through various web pages via the interactive links (InfoWorld) given in your site. The link to 'Analysis | XML alone won't cure Web security ills' has been very educative. I give you all the credit for creating such an incredible site and wish you all the best. I believe that you can enrich your prospects by having a better hosting platform for your website.

I represent xxxx.com a professional web hosting and a web design provider currently servicing over 75000 customers world wide and we are currently promoting a trial offer. I want to offer you 1 Year of web hosting absolutely FREE OF CHARGE. This is our attempt to project an important juncture or probability for you to move on to a better web host.

What's interesting is that this spammer has spidered my personal home page in order to gather vocabulary ("xml," "distributed," "platform") typical of legitimate mail to me. This is precisely the kind of tactic that you'd think might fool a Bayesian filter, which looks for both positive as well as negative evidence. It did, in fact, fool SpamAssassin.

For SpamBayes, in this case, the "hammy" words did help counteract the "spammy" words, yielding a score of only 80%. That was enough uncertainty to land the message in my MaybeSpam folder. After I declared it as spam, it rescored to 99%.

Paul Graham:

I think it's possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that. [Paul Graham: A Plan for Spam]

It's hard, at first, to see how SpamBayes can possibly work. When you look at a message from SpamBayes' point of view, you see a different and far more granular approach than SpamAssassin's, which reports things like:

PENIS_ENLARGE (2.2 points) BODY: Information on getting a larger penis

SUB_FREE_OFFER (0.3 points) Subject starts with "Free"
US_DOLLARS (2.0 points) BODY: Nigerian scam key phrase (million dollars)

SpamBayes doesn't know any of these rules. It just knows what I want to see, and what I don't want to see. It knows because I show it a bunch of positive and negative examples up front, and then refine its understanding of my wishes continuously as I process my (surprisingly few) MaybeSpam messages.

When I sent the review to my InfoWorld colleagues, I sent it twice:

Subject: Penis enlargement
Subject: SpamBayes review

SpamAssassin jumped all over the first message. But SpamBayes knew that neither was anything to worry about. The "spammy" clues were strong but the "hammy" evidence completely overwhelmed them -- in ways that are specific to my own unique patterns of communication.

You get some of this effect with Mac OS X's Mail app, but it doesn't feel like a complete solution to me. SpamBayes, as implemented for Outlook by Mark Hammond, does. I asked Mark if I could send him a PayPal contribution. He said: "No, it would be innappropriate for this project, as so many people smarter than I worked on the back end." Fair enough. Thanks to all of them for a job well done!

Update: I just got a phone call from a PR representative wanting to tell me about IronPort's messaging gateway, SenderBase service, and bonded sender program. It's interesting stuff. We talked some about client versus server solutions, and finally she asked: "So, how did my email message to you score?" I went back and looked: 64%. Slightly spammy, but not over the threshold. Here were the spammiest clues:

'truste'                     0.955709            0      5
'affiliate'                  0.963994            3     93
'spam'                       0.965056           14    424
'7000'                       0.974385            0      9
'forever!'                   0.99203             0     30

She took notes. We were both surprised to see that the word TrustE has so far showed up in 5 spams and no hams (until this one).


An ingenious approach foiled:
Spam Score: 0.993749
word                            spamprob         #ham  #spam
'url:roninhouse'                0.0461277          81      4
'jon'                           0.0506434        1722     99
'thanks!'                       0.0882482         136     14
'xml'                           0.120668          278     41
'interesting.'                  0.127463           33      5
'to:addr:judell'                0.188539          949    238
'great!'                        0.203616           26      7
'lead'                          0.20497           173     48
'appreciate'                    0.206009           72     20
'also,'                         0.20929           172     49
'to:addr:mv.com'                0.214304          900    265
'contents'                      0.218866          103     31
'pages'                         0.273282          153     62
'article'                       0.273369          180     73
'header:Received:3'             0.283642          330    141
'platform'                      0.308406          102     49
'user'                          0.312891          169     83
'around'                        0.313592          288    142
'etc.'                          0.316824          136     68
'best.'                         0.320998           12      6
'process,'                      0.322662           43     22
'know'                          0.326371          797    417
'enjoyed'                       0.331886           17      9
'web'                           0.332325          720    387
'mention'                       0.333422           65     35
'point'                         0.333817          233    126
'"distributed'                  0.340883            2      1
'hosting'                       0.344502           46     26
'udell,'                        0.345753           51     29
'there'                         0.346756          778    446
'quite'                         0.346958          103     59
'were'                          0.361252          357    218
'which'                         0.363948          801    495
'time.'                         0.365849          151     94
'represent'                     0.368909           35     22
'having'                        0.377575          226    148
'hear'                          0.379814          112     74
'noheader:reply-to'             0.381912         3028   2021
'reply-to:none'                 0.381912         3028   2021
'cure'                          0.62087             5      9
'to:no real name:2**0'          0.624296         1508   2707
'probability'                   0.625744            6     11
'alone'                         0.626741           17     31
'such'                          0.627115          301    547
'ensure'                        0.631623           35     65
'accounts'                      0.635156           35     66
'url:com'                       0.639288         1312   2512
'prospects'                     0.644734            5     10
'send'                          0.646326          355    701
'number,'                       0.64719            11     22
'online'                        0.650194          228    458
'experience'                    0.650383           83    167
'proto:http'                    0.651725         1483   2998
'link'                          0.656753          221    457
'currently'                     0.658234          109    227
'header:Return-Path:1'          0.665302         1635   3511
'official'                      0.670506           29     64
'skip:1 10'                     0.671471           95    210
'best'                          0.672719          262    582
'pleasure'                      0.675451            7     16
'further'                       0.678206           93    212
'please'                        0.682766          747   1737
'witness'                       0.685674            2      5
'immediate'                     0.695489           40     99
'name'                          0.700578          170    430
'attempt'                       0.705123           20     52
'website'                       0.707746           66    173
'simply'                        0.71687            92    252
'subject:Jon'                   0.717693           33     91
'account'                       0.720615           86    240
'incredible'                    0.723674           14     40
'thoroughly'                    0.729777            5     15
'includes'                      0.729946           80    234
'professional'                  0.741174           38    118
'interest'                      0.742465          101    315
'absolutely'                    0.744416           31     98
'header:Mime-Version:1'         0.755206          230    767
'url:udell'                     0.756494          302   1014
'mr.'                           0.758107           22     75
'trial'                         0.774293           19     71
'email'                         0.788011          345   1386
'here'                          0.790386          387   1577
'contact'                       0.792903          179    741
'dear'                          0.79496            76    319
'toll'                          0.800125            7     31
'card'                          0.803112           40    177
'offer'                         0.807203           98    444
'free'                          0.80753           224   1016
'subject:About'                 0.809947            2     10
'educative.'                    0.83645             0      1
'unaccounted'                   0.83645             0      1
'visiting,'                     0.83645             0      1
'x-mailer:ximian evolution 1.0. 0.83645             0      1
'satisfied'                     0.841567            4     24
'header:Message-Id:1'           0.844496          315   1849
'offer.'                        0.851727            9     57
'maintenance,'                  0.861673            1      8
'ordering'                      0.864587            3     22
'skip:h 30'                     0.894696            1     11
'obtained'                      0.898357            2     21
'credit'                        0.902204           29    291
'"pay'                          0.902236            0      2
'supplemented'                  0.902236            0      2
'wish'                          0.905889           50    522
'assisting'                     0.909155            1     13
'check"'                        0.93028             0      3
'cancel'                        0.938587            3     53
'incase'                        0.945821            0      4
'juncture'                      0.945821            0      4
'servicing'                     0.945821            0      4
'24/7'                          0.981978            0     13
'charge.'                       0.991468            0     28

Former URL: http://weblog.infoworld.com/udell/2003/05/08.html#a684