I’ve been using bogofilter against my spam for about a month now, and the results are looking good.
It’s catching a much higher percentage of my spam than SpamAssassin was, and I’ve only had one false positive. Although any amount of false positives is a major problem, this doesn’t concern me for two reasons.
Firstly, SpamAssassin was giving me at least one false positive every couple of days. I get a lot of solicited commercial emails, including quite a lot of financial related news. SpamAssassin had a nasty habit of assuming that things that talked about mortgages etc was spam. I trained bogofilter specifically against archives of this mail, and so far it hasn’t marked any as spam.
Secondly, the one false positive was rather an odd case. About a year ago I released a perl module, Games::Boggle that finds words on a Boggle board.
Recently I received email from a user who was having difficulty getting a script using it to work. With this script he included the entire dictionary file he was running the script against!
According to the theory of the pseudo-Bayesian Spam Filtering the spam detector should only pay attention to the 10 (or whatever) most significantly ham or spam words in your message (which is why the new wave of “include random phrases” or “include a chapter of a book” emails aren’t really causing me any difficulty). However, if there more than 10 “definitely spam” and more than 10 “definitely ham” words, I’m guessing they don’t cope very well…