google news customization || MAIN || firefox futures

March 10, 2005

tip on thunderbird junk filtering

Here is a tip for using Thunderbird's junk-mail filtering. I have no technical explanation for why this works -- and it may not work at all for others, but I've been testing the junk-mail feature since it was born and this has definitely worked for me.

Do not overtrain. This is perhaps the most important tip about using junk-mail controls that I can offer. In the early days, I thought that more training was better and so I'd mark everything as spam, including multiple identical emails. I figured that I'd really get it good by using this aggressive technique -- even keeping a massive and growing spam corpus in a special folder so I could retrain new accounts against it. I was re-flagging as junk any that the filters were catching, doing whatever I could to train it, and train it good!

Well, what I found was that with this massive training I got up to 80-90% spam recognition almost immediately -- in a day or to, but I wasn't ever able to get my junk-mail controls to flag more than about 90% of the junk I was receiving. Even months in with thousands of messages marked as junk, I couldn't break 90%, so I tried a new approach, minimal training.

I started with a clean slate, and only marked a few very distinct emails as junk on the first day, trying to avoid marking duplicates or even similar spams. Each morning I'd look at my inbox and compare with my junk folder to avoid any overlap, marking a few more distinct spams as junk. It took me about a week to get up to where I'd gotten with the massive training approach in just a couple of days -- but! in another week of following the minimal training technique I was up over 90% recognition and in two weeks I was bordering on 95%. I'm about 3 months into this, manually flaggin a mail about once a week, and in the last week, I calculate that my junk-mail controls have nabbed 97% of the spam that's hitting my inbox.

You may not think that it's worth the careful management to make the improvement from 90% to above 95% -- and that may be the case for some users, but for me, that means a long-term time savings that will be quite dramatic.

If you have tips for managing email in Thunderbird, I'd love to hear them. Next week I'll try to do another Thunderbird tip installment where I talk about using virtual folders.

Posted by asa at March 10, 2005 12:36 PM
Comments

As a major receiver of spam, and supporter of both Thunderbird, and open source in general, I would like to know if a marriage between two open source projects would be possible.

I use Thunderbird daily. I love it, and I'm not ready to trade it for anything. However, I don't touch the junk mail controls at all. For junk mail, I use Popfile (http://popfile.sourceforge.net, quite simply because I've found it to be better.

The junk mail controls in Thunderbird are good for the general public, and much more usable / readily accessible than those of the web-based UI that Popfile sports. Still, Thunderbird seems to lack where it counts.

You mention 90 - 97% recognition, the last being an estimate. Having trained Popfile for a couple of months now, I'm getting an accuracy of 99.11%. That is based on 13,350 classified emails (classified meaning emails Popfile has labelled junk or not junk). 118 of these classified emails were errors. Out of these 118, 16 were false positives that had mistakenly been classified as spam. These statistics were generated by the Popfile interface.

While the above statistics do not take into account that I might have missed a few pieces of junk / not junk emails, it's still far more impressive, and that is based on a LOT of training.

What I'd really love would be a marriage of Popfile and Thunderbird. The strength of Popfile and the flexibility / ease of use from Thunderbird.

Possible?

Posted by: Joen on March 10, 2005 01:21 PM

So, do you mean you don't correct it for all junk it misses? Say if it lets three spams into your inbox - do you only correct it on the one you consider most different from previous e-mails?

Posted by: David Naylor on March 10, 2005 01:22 PM

Asa, that's such counterintuitive behaviour I think you should file a bug on that one. I seem to recall problems with the spam filtering after marking an email erroneously tagged as spam as non-spam. The hit rate seemed to plummet thereafter.

Posted by: Phil Randal on March 10, 2005 01:35 PM

This would be a bug in the Junk Filter if this is true...
The junk Filter should internal optimize the training.dat and such things shouldn't happen....

Posted by: Matti on March 10, 2005 01:53 PM

I'd have to agree with Asa's first findings. I've used Thunderbird for about a year and just recently switched to having my mail routed through Gmail first. Thunderbird just wasn't doing well at sorting out the junk anymore. I thought maybe I hit a limit of how much it could handle. I also was frustrated at how it handled all the spam when I had to connect via dial up. Poor Thunderbird just couldn't handle that!!

I may try your technique and see how well it works out.

Posted by: Eliot on March 10, 2005 02:22 PM

Personally, I wish we could either hookup with the SpamBayes folks and exactly duplicate the method they are using (including having an unknown/maybe option rather than just spam or ham) *OR* work with SpamBayes to get a plugin for Thunderbird similar to the one for Outlook. When I've used SpamBayes, I've routinely seen well over 99% accuracy with no false positives. I think the Thunderbird Junk Mail filter falls short, myself.

Posted by: John T. Haller on March 10, 2005 02:25 PM

Same here -- I've long given up on Thunderbird's spam filter, because I found that K9 (a locally installed mail proxy) works much better. 99.37%, about 800 mails, 5 false negatives, 0 false postitives. (I started with a pre-ordered set of junk/non-junk-mails after rebuilding my mail profile; that's why there were no false positives). And I am flagging every single mail that the filter didn't catch.

I was curious if there indeed _was_ a way to improve thunderbird's junk filter, as I would like to use it without any additional filtering software. But listening to your technique, Asa, made me wonder why anyone would make such an effort. I have to time to parent my junk filter. File it as a bug.

Posted by: mardoen on March 10, 2005 02:36 PM

i think also this is a bug ...
When i was using Eudora, the spam filtering was more efficient ...

Posted by: jimich on March 11, 2005 02:10 AM

I have an interesting situation where I have two computers receiving email from the same account. One has been receiving email for a long time and I've marked all spam as such. The other setup is much younger. I started noting several months ago, that though I've been religious about marking spam on both of them, the younger setup was catching more of it while the older setup was letting more slip by.
This seems to give credence to the idea that overtraining does, in fact, lead to less accuracy.

Posted by: Mark on March 11, 2005 10:29 PM

Another "me too" -- I uninstalled Thunderbird from my wife's desktop, and put Mozilla in its place, precisely because Thunderbird was doing a bad job catching spam. I had indeed overtrained it, and I'd been wondering if perhaps I exceeded some limit, since I had (what seemed to me to be) a lot of spams -- perhaps 4,000.

Posted by: Eric on March 12, 2005 03:53 PM

When will someone, Please!, just do a Copy & Paste of, for example, the SpamAssassin (http://spamassassin.apache.org/) code, "The Powerful #1 Open-Source Spam Filter", into Thunderbird and Seamonkey and be done with it, and happy with it! Good grief...

Thank you,
Eddie Maddox

Posted by: Eddie Maddox on March 13, 2005 05:15 AM

Post a comment