I can't post the whole review yet, but neither can I resist reporting here what I think are remarkable results. I've been a skeptic when it comes to content-filtering solutions. I thought about driving down to Cambridge in January for the Spam Conference, for example, but at the time felt that its narrow focus on filtering -- to the exclusion of many alternatives, including strong identity management -- was short-sighted. Now I think I'm the one who was short-sighted.
Here are the first two paragraphs from a particularly interesting spam:
I came across your web site- 'http://jonudell.net/', the official website of "Jon Udell". I found your website to be very impressive and I went through the contents of your website which were quite interesting. Your article on "Distributed HTTP" has been great! I thoroughly enjoyed browsing through various web pages via the interactive links (InfoWorld) given in your site. The link to 'Analysis | XML alone won't cure Web security ills' has been very educative. I give you all the credit for creating such an incredible site and wish you all the best. I believe that you can enrich your prospects by having a better hosting platform for your website.
I represent xxxx.com a professional web hosting and a web design provider currently servicing over 75000 customers world wide and we are currently promoting a trial offer. I want to offer you 1 Year of web hosting absolutely FREE OF CHARGE. This is our attempt to project an important juncture or probability for you to move on to a better web host.
What's interesting is that this spammer has spidered my personal home page in order to gather vocabulary ("xml," "distributed," "platform") typical of legitimate mail to me. This is precisely the kind of tactic that you'd think might fool a Bayesian filter, which looks for both positive as well as negative evidence. It did, in fact, fool SpamAssassin.
For SpamBayes, in this case, the "hammy" words did help counteract the "spammy" words, yielding a score of only 80%. That was enough uncertainty to land the message in my MaybeSpam folder. After I declared it as spam, it rescored to 99%.
Paul Graham:
I think it's possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that. [Paul Graham: A Plan for Spam]
It's hard, at first, to see how SpamBayes can possibly work. When you look at a message from SpamBayes' point of view, you see a different and far more granular approach than SpamAssassin's, which reports things like:
PENIS_ENLARGE (2.2 points) BODY: Information on getting a larger penis
SpamBayes doesn't know any of these rules. It just knows what I want to see, and what I don't want to see. It knows because I show it a bunch of positive and negative examples up front, and then refine its understanding of my wishes continuously as I process my (surprisingly few) MaybeSpam messages.
When I sent the review to my InfoWorld colleagues, I sent it twice:
SpamAssassin jumped all over the first message. But SpamBayes knew that neither was anything to worry about. The "spammy" clues were strong but the "hammy" evidence completely overwhelmed them -- in ways that are specific to my own unique patterns of communication.
You get some of this effect with Mac OS X's Mail app, but it doesn't feel like a complete solution to me. SpamBayes, as implemented for Outlook by Mark Hammond, does. I asked Mark if I could send him a PayPal contribution. He said: "No, it would be innappropriate for this project, as so many people smarter than I worked on the back end." Fair enough. Thanks to all of them for a job well done!
Update: I just got a phone call from a PR representative wanting to tell me about IronPort's messaging gateway, SenderBase service, and bonded sender program. It's interesting stuff. We talked some about client versus server solutions, and finally she asked: "So, how did my email message to you score?" I went back and looked: 64%. Slightly spammy, but not over the threshold. Here were the spammiest clues:
'truste' 0.955709 0 5 'affiliate' 0.963994 3 93 'spam' 0.965056 14 424 '7000' 0.974385 0 9 'forever!' 0.99203 0 30
She took notes. We were both surprised to see that the word TrustE has so far showed up in 5 spams and no hams (until this one).
Spam Score: 0.993749 word spamprob #ham #spam 'url:roninhouse' 0.0461277 81 4 'jon' 0.0506434 1722 99 'thanks!' 0.0882482 136 14 'xml' 0.120668 278 41 'interesting.' 0.127463 33 5 'to:addr:judell' 0.188539 949 238 'great!' 0.203616 26 7 'lead' 0.20497 173 48 'appreciate' 0.206009 72 20 'also,' 0.20929 172 49 'to:addr:mv.com' 0.214304 900 265 'contents' 0.218866 103 31 'pages' 0.273282 153 62 'article' 0.273369 180 73 'header:Received:3' 0.283642 330 141 'platform' 0.308406 102 49 'user' 0.312891 169 83 'around' 0.313592 288 142 'etc.' 0.316824 136 68 'best.' 0.320998 12 6 'process,' 0.322662 43 22 'know' 0.326371 797 417 'enjoyed' 0.331886 17 9 'web' 0.332325 720 387 'mention' 0.333422 65 35 'point' 0.333817 233 126 '"distributed' 0.340883 2 1 'hosting' 0.344502 46 26 'udell,' 0.345753 51 29 'there' 0.346756 778 446 'quite' 0.346958 103 59 'were' 0.361252 357 218 'which' 0.363948 801 495 'time.' 0.365849 151 94 'represent' 0.368909 35 22 'having' 0.377575 226 148 'hear' 0.379814 112 74 'noheader:reply-to' 0.381912 3028 2021 'reply-to:none' 0.381912 3028 2021 'cure' 0.62087 5 9 'to:no real name:2**0' 0.624296 1508 2707 'probability' 0.625744 6 11 'alone' 0.626741 17 31 'such' 0.627115 301 547 'ensure' 0.631623 35 65 'accounts' 0.635156 35 66 'url:com' 0.639288 1312 2512 'prospects' 0.644734 5 10 'send' 0.646326 355 701 'number,' 0.64719 11 22 'online' 0.650194 228 458 'experience' 0.650383 83 167 'proto:http' 0.651725 1483 2998 'link' 0.656753 221 457 'currently' 0.658234 109 227 'header:Return-Path:1' 0.665302 1635 3511 'official' 0.670506 29 64 'skip:1 10' 0.671471 95 210 'best' 0.672719 262 582 'pleasure' 0.675451 7 16 'further' 0.678206 93 212 'please' 0.682766 747 1737 'witness' 0.685674 2 5 'immediate' 0.695489 40 99 'name' 0.700578 170 430 'attempt' 0.705123 20 52 'website' 0.707746 66 173 'simply' 0.71687 92 252 'subject:Jon' 0.717693 33 91 'account' 0.720615 86 240 'incredible' 0.723674 14 40 'thoroughly' 0.729777 5 15 'includes' 0.729946 80 234 'professional' 0.741174 38 118 'interest' 0.742465 101 315 'absolutely' 0.744416 31 98 'header:Mime-Version:1' 0.755206 230 767 'url:udell' 0.756494 302 1014 'mr.' 0.758107 22 75 'trial' 0.774293 19 71 'email' 0.788011 345 1386 'here' 0.790386 387 1577 'contact' 0.792903 179 741 'dear' 0.79496 76 319 'toll' 0.800125 7 31 'card' 0.803112 40 177 'offer' 0.807203 98 444 'free' 0.80753 224 1016 'subject:About' 0.809947 2 10 'educative.' 0.83645 0 1 'unaccounted' 0.83645 0 1 'visiting,' 0.83645 0 1 'x-mailer:ximian evolution 1.0. 0.83645 0 1 'satisfied' 0.841567 4 24 'header:Message-Id:1' 0.844496 315 1849 'offer.' 0.851727 9 57 'maintenance,' 0.861673 1 8 'ordering' 0.864587 3 22 'skip:h 30' 0.894696 1 11 'obtained' 0.898357 2 21 'credit' 0.902204 29 291 '"pay' 0.902236 0 2 'supplemented' 0.902236 0 2 'wish' 0.905889 50 522 'assisting' 0.909155 1 13 'check"' 0.93028 0 3 'cancel' 0.938587 3 53 'incase' 0.945821 0 4 'juncture' 0.945821 0 4 'servicing' 0.945821 0 4 '24/7' 0.981978 0 13 'charge.' 0.991468 0 28
Former URL: http://weblog.infoworld.com/udell/2003/05/08.html#a684