June 22, 2003

Trying Out SpamBayes

Word to the wise... never post an email id that you value on the web or usenet. I think the usenet is especially toxic. My black-coffee2002@karavshin.org account receives a spam per hour on average. It's driving me nuts.

For a long time I ignored it.

Then I turned on server-side SpamAssasin. But it doesn't identify many spams, and furthermore, there was no filtering of it. So it just plops into my inbox anyway.

Six months ago I read a really fascinating article on using Bayesian statistical methods to filter spam.

Apparently a lot of other people did, too, and in short order tons of systems were written to implement this kind of system.

I run Outlook2002 on Windows XP. Many of the interesting prototype projects were written on unix-based platforms. I'm not dumping Outlook for a command-line based mail tool. [I would like to find a better replacement for Outlook] So all these were basically useless.

Then one day 31die pointed out: SpamBayes: Bayesian anti-spam classifier written in Python. The exciting thing is that it's designed to work inside Outlook!

On my first attempt, I didn't see the windows binaries. So I ended up installing Python so that I could make it myself. The fun of that wore off quickly, and it languished on my pc for several weeks. Just this afternoon I decided to see if I could find a binary for this program, and skip all the compilation and makefiles and rubbish.

Sure enough, I did.

Although warned that the codebase was somewhat out-of-date, I installed it anyway.

Installation took almost no time. The directions were very clear. And after fifteen minutes of tweaking, re-testing, and categorizing, SpamBayes has a set of rules that, at least, spotted the first message I received this hour.

One interesting thing you can do is pick an arbitary message and ask it "why are you giving it this Spam Score?" Then it gives you a list of all the tokens it found in that message and the probability of the message being spam if that token was found.

Curiously, a lot of signifcant tokens are found in the mail header itself, not the body that you read. For example: 'noheader:return-path' scores a .9907.

For the time being, I am keeping my 'Corpus of Spam.' This allows me to re-run the rule base easily. This will be helpful if I change systems, upgrade, etc.

Now I am sitting here like a little kid waiting for a spam to arrive so that I can see how it's dealt with.

In the meantime I've been scanning over old messages examining the tokens' significance.

One amusing thing, for example, is the 'fucking' token -- it only scores a 0.03 spam weight. Meaning... it's found much more often in my legitimate emails than spam! Howver words like, "this", "lot", and "more" are some of the highest-scoring spam identifier words. Strange. But that's the beauty of Bayesian Spam Filtering -- it identifies real patterns that you'd never recognize or even understand.

I did notice that Outlook acted a bit weird after I installed SpamBayes. A restart cleared that up. (Namely the auto-fill feature was displaying strange text). Outlook seems to start a lot more slowly, too.


Posted by Nils Blutig at June 22, 2003 04:58 PM | TrackBack