SpamBayes update

I've been comparing notes with Tom Yager, who notices lately that spammers' use of nonsense words, especially in Subject: headers, seems to be effective against the Bayesian filter in OS X's I checked, and SpamBayes is (so far) unaffected by this ploy. One of the cool things about SpamBayes is its ability to reveal how it analyzes messages. See below for its take on a message that has the Subject: line "Jon nezinyunyane inflechies" and a bunch of angle-bracketed garbage in the text.

Evidently SpamBayes ignores all the garbage tags, but since these serve as word delimiters, it winds up seeing a bunch of word fragments -- like 'innov' and 'ative' -- which it finds suspicious. And as the spam counts indicate, it has seen these fragments before, so over time their discriminatory power should only grow, not diminish.

It's also fascinating to look at the handling of the giveaway phrase "Multi-Trillion Dollar Market." "Multi-Trillion" does not appear on SpamBayes' list of interesting tokens, though "dollar" and "market" do. That SpamBayes makes no effort to correlate these adjacent words seems like an obvious limitation, and yet it is (so far) continuing to perform spectactularly well for me despite that.

Update: As I keep forgetting for some reason, and as Giorgio Valoti reminds me, uses latent semantic analysis, not the Bayesian technique.

Spam Score: 1

word                                spamprob         #ham  #spam
'*H*'                               0                   -      -
'*S*'                               1                   -      -
'jon-'                              0.0313807          38      1
'from:addr:jonathan'                0.06584             3      0
'noheader:mime-version'             0.267816         3682   1332
'there'                             0.357648         1865   1027
'web'                               0.359379         1678    931
'noheader:reply-to'                 0.398404         8311   5444
'reply-to:none'                     0.398404         8311   5444
'your'                              0.607781         3493   5354
'now'                               0.609287         1198   1848
'header:Date:1'                     0.614892         5565   8789
'header:From:1'                     0.616075         5536   8787
'live'                              0.617098          227    362
'subject:Jon'                       0.628519          123    206
"you've"                            0.635875          294    508
'potential'                         0.637791          171    298
'header:Received:6'                 0.639839          738   1297
'url:com'                           0.643368         3651   6515
'must'                              0.651722          330    611
'year.'                             0.654359          135    253
'proto:http'                        0.657505         4086   7759
'area'                              0.657698          183    348
'break'                             0.668753           75    150
'skip:m 10'                         0.671444          690   1395
'header:Return-Path:1'              0.698156         3807   8710
'join'                              0.72811           152    403
'market.'                           0.736942           76    211
'serious'                           0.779225           64    224
'header:Message-Id:1'               0.782263         1220   4336
'sell'                              0.789965           88    328
'life'                              0.850081          110    618
'walls'                             0.86821            15     99
'url:htm'                           0.870883          144    962
'to:addr:jon'                       0.871895           87    587
'url:index'                         0.877064          152   1074
'dollar'                            0.917399           19    211
'url:173'                           0.934783            0      3
'unique,'                           0.949503            5     97
'subject:\\xe9'                      0.95032             1     23
'pac'                               0.96723             3     94
'independence'                      0.969427            3    101
'earn'                              0.969434           11    352
'000'                               0.971088            3    107
'url:61'                            0.974407            1     46
'url:133'                           0.97619             0      9
'margin'                            0.977151            2     94
'$100,'                             0.977616            2     96
'url:41'                            0.981928            0     12
'ing'                               0.982844            2    126
'infor'                             0.987666            1     97
'ailability.'                       0.994822            0     43
'rofit'                             0.994822            0     43
'ative'                             0.994938            0     44
'innov'                             0.994938            0     44
'aking'                             0.99505             0     45
'busin'                             0.99505             0     45
'ited'                              0.99505             0     45
'pture'                             0.99505             0     45
'azing.'                            0.995156            0     46
'lim'                               0.995156            0     46
'ame'                               0.99545             0     49
'che'                               0.99545             0     49
''     0.995627            0     51
''       0.99579             0     53
'rica'                              0.996014            0     56
'ess'                               0.996937            0     73
'amearn'                            0.997592            0     93
'amed'                              0.997592            0     93
'fina'                              0.997592            0     93
'kage.'                             0.997592            0     93
'ncial'                             0.997592            0     93
'ney!'                              0.997592            0     93
'dre'                               0.997667            0     96
'mation'                            0.997738            0     99
'to:name:jon'                       0.998132            0    120
''              0.998351            0    136

Message Stream:

Date: Thu, 10 Jul 2003 21:08:32 -0800
Subject: Jon nezinyunyane inflechies
X-Sender: Cole Nickol <>
<p><font face="Trebuchet MS">Jon-</font></P>
<p> <nenahospodari><interesses></p>
<p><font face="Trebuchet MS">Ca<NLsQcKSV>pture Your Dre<bSVhiD>amEarn
Fina<interesses>ncial Independence</font></p>
<p><font face="Trebuchet MS">You can now for the first
<BJxQNXNYyQ>own a busin<Pqktbw>ess in your area with the most unique,
<KvynRHwW>innov<FkEhUrwj>ative product in Ame<SatnmBdYiP>rica today. Work
le<mQvuxWo>ss a week with the potential to earn
$100,<interesses>000 a year. There is no sell<QgBARjeQ>ing and not
ML<gtdyhs>M. Join a Multi-Trillion Dollar Market.</font></p>
<p><font face="Trebuchet MS">The p<KChHi>rofit margin is
<p><font face="Trebuchet MS"><nenahospodari>Break down the walls and live
this life you've only dre<interesses>amed about.</font></p>
<p><font face="Trebuchet MS">Lim<QkWkUgv>ited av<UPeOWb>ailability.
for Your Fr<interesses>ee infor<RCbUkNPVg>mation
<p><font face="Trebuchet MS"><oCgNBti><a
<p><font face="Trebuchet MS">Y<rcMWMN>ou must che<bkvKm>ck this out if you
are serious about m<TQYVM>aking mo<interesses>ney!</font></p>
<p><font face="Trebuchet MS">O<interesses>pt o<yxXafJRXuQ>ut at

Former URL: