I've been a faithful spamassassin user for a long time. Never thought much about how it worked, just been happy. But a lot of spam is leaking through the last month, so I went looking for a tuneup.

Vipul's Razor
Collaborative spam database. On Debian just do apt-get install razor and you're razoring. More work required if you want to report spam.
Another collaborative spam database. apt-get install pyzor.
Bayesian spam filtering implementation for spamassassin. Requires 1000+ training messages. sa-learn --ham --mbox archive.mbox
I feel nervous mixing all these methods, I just hope SpamAsassin sorts them out. This spam detection rate discussion is vaguely interesting.

I tried my new spamassassin setup on 594 emails my old spamassassin setup said were not spam. The new setup correctly identified 342 as spam and 247 as non-spam. It identified 5 messages as non-spam when in fact they were spam, and a reassuring 0 messages as spam which were not spam. This is all excellent!

Of the 342 newly-found spam, 303 were caught by the Bayesian filter, 172 by Pyzor, and 160+ by Razor.

Update 2004-01-16: I made a boneheaded mistake in this evaluation. I trained the Bayesian filter on some data, then tested it on the same data. D'oh. The reality is the Bayesian filter is still much better than without, but not quite as stellar.

  2003-12-31 00:38 Z