TNPI - SpamAssassin tuning

home : internet : mail : toaster : filtering : content filtering : SpamAssassin : SpamAssassin tuning

SpamAssassin is giving me low Bayesian scores, what can I do?

A: It looks like SpamAssasin's Bayesian filtering isn't autolearning all spam that is being discarded by maildrop, even if I have autolearn enabled.

SpamAssassin will not autolearn unless the spam scores 12.0 or above *and* there is a score of at least 3.0 in the body and another 3.0 score in the headers. Usually, a score of 12 implies the body/header scoring rule. If you have custom SA rules which is being triggered by the spam, they do not count towards the autolearn threshold, only maildrop's discard threshold.

Because of this, if you have high scoring custom rules in SA and your toaster handles relatively small amounts of mail (say, just a few hundred accounts), chances are that your Bayesian token database will be too small or not relevant enough to give spams the high Bayes score they deserve. When SA starts expiring tokens from the Bayes db, the situation might become even worse.

Possible fix: If you don't mind the extra processing overhead, you can have maildrop route mails through SpamAssassin's sa-learn before they are deleted. This will ensure that every mail scoring 12 or above will be learnt as spam. Mails already learned by SA autoloearn will not be learned again.

Edit the site wide maildrop filter in /usr/local/etc/mail/mailfilter.

The code responsible for discarding spams should look like this:

if ( $MATCH2 >= 12 ) # from Adam Seniuk post to mail-toasters
{

#log "$TIMESTAMP Discarding mail for $EXT@$HOST: spam score $MATCH2"
cc "|sa-learn --spam"
EXITCODE=0
exit

}

Note: if your toaster is very busy or short on resources or both, you should definitely *not* do this. Sa-learn eats quite a lot of resources just to get started since it's a perl script. Needless to say, if something ain't broken, don't fix it. The likelihood of this marginal scenario materializing at a server handling a few thousand mails or more per day is practically zero.

by Tor Willy Austerslått

Last modified on 4/13/05.