Questions about SpamAssassin Training

Hello all, forgive me if this is not the correct place for this but I'd rather contribute content to this forum and not sign up for another. I have some questions for anybody experienced with SA Bayesian Filter training.

I have two mail servers. One of them is in production and the other I am developing. Soon I will be swapping the development with the production server. So my dev server will eventually take the hostname, domain, and IP of the production server. I work under policy that prohibits working on production servers as much as possible so I want to train SpamAssassin before going to production but I am worried that the filter may be skewed because I am taking mail from another server, although the mail stream is the same.

I am thinking of having SA Learn ignore most of the headers (maybe all headers now that I look at them) to avoid any of that data effecting the filter since headers in my spam and ham are similar.

Also, both servers are set up to add ******SPAM****** to the beginning of the subject line. I read that SA-Learn would account for and ignore this but when I use mail from another server, although SA is set up the same? I don't know?

So if I ignore headers and SA-Learn accounts for the ******SPAM****** tag, and I take mail from my production server, have SA-Learn go through it on the dev server, and then put the dev server into production; will I have any problems with the Bayesian filter?

One final concern I have. If I train the filter with ~500 spam, ~500 ham, disable auto-learning, and then never really re-train it, will Bayes still be effective? I know it won't get good at identifying the new spam - but it should still do okay with the spam it has already learned, right?

Thanks for any help in advance.
 
I did the spam training so I thought I'd post my findings so far regarding my questions. The server is still not in production so I do not know exactly how it will handle the production mail stream but I have been forwarding ham and spam to it. The Bayesian filter has been accurately adding and subtracting ~2 points from my spam and ham scores respectively. The mail I fed to sa-learn did have the "******SPAM******" tags in the subject. I've fed it about ~260 of spam and ham each. So far I am pretty happy with it and I am hoping it will get better when I feed it more mail. :)

The headers that I ignored are as follows:
Code:
 bayes_ignore_header X-Bogosity
 bayes_ignore_header X-Spam-Flag
 bayes_ignore_header X-Spam-Status
 bayes_ignore_header X-Virus-Scanned
 bayes_ignore_header X-Spam-Level
 bayes_ignore_header X-Spam-Score
 bayes_ignore_header Message-ID
 bayes_ignore_header Content-Type
 bayes_ignore_header Date
 bayes_ignore_header X-OriginalArrivalTime
 bayes_ignore_header Content-Type
 bayes_ignore_header Content-Transfer-Encoding
 bayes_ignore_header X-Amavis-Alert
 bayes_ignore_header X-Quarantine-ID
 bayes_ignore_header X-IronPort-Anti-Spam-Filtered
 bayes_ignore_header X-IronPort-Anti-Spam-Result
 bayes_ignore_header X-IronPort-AVE
 
tay9000 said:
I did the spam training

You don't have to train again, just export and reimport the bayes database.
See sa-learn
Code:
--backup
    Performs a dump of the Bayes database in machine/human readable format. 

    The dump will include token and seen data. It is suitable for input back into the --restore command.

--restore=filename
    Performs a restore of the Bayes database defined by filename.

Regards.
 
Back
Top