how to handle Rspamd results

cbrace · Aug 29, 2023

As many of you will know, mail/dspam hasn't been supported for years and it's been on my ToDo list to get mail/rspamd running on my mailserver. I followed the tutorial in the Howtos and FAQs section and have now have rspamd running with postfix. According to /var/log/rspamd/rspamd.log incoming messages are now being actively scanned.

One quick question: it's not immediately clear to me how to use the results. dspam added a header line indicated if a given message was considered spam (or innocent), allowing one to either set up a client-side filter or a server-based sieve filter, but rspamd doesn't appear to do that. What am I missing here?

I found another page with instructions for how to set up retraining messages, which I'll do tomorrow:

Once you've verified rspamd and postfix are working together, all that's left is to configure Dovecot to train rspamd when you move messages in and out of your Junk folder. The steps for achieving this were derived from this Dovecot guide. Add a new file, 90-imapsieve.conf, under Dovecot's conf.d directory with the following contents:

How To Run Your Own Mail Server (A guide to self-hosting your email on FreeBSD using Postfix, Dovecot, Rspamd, and LDAP)

hardworkingnewbie · Aug 30, 2023

Rspamd normally runs as milter with Postfix. The normal configuration is to reject heavy spam, while other gets an added header, which can then be filtered with Sieve.

Of course this can be configured otherwise, and fine tuned as well.

cbrace · Aug 30, 2023

Turns out that you have to deliberately switch on header info. I created local.d/milter_headers.conf containing:

Code:

use = ["x-spamd-bar", "x-spam-level", "x-spam-status", "authentication-results", "remove-headers"];
authenticated_headers = ["authentication-results"];
extended_spam_headers = true;
routines {
  remove-headers {
    headers {
      "X-Spam" = 0;
      "X-Spamd-Bar" = 0;
      "X-Spam-Level" = 0;
      "X-Spam-Status" = 0;
      "X-Spam-Flag" = 0;
    }
  }
}

Source: Rspamd header

cbrace · Aug 31, 2023

This is probably documented somewhere on the official rspamd website, but I discovered these third-party instructions for training the Bayesian filters based on an existing message corpus very helpful:

Training from your existing ham and spam emails
Have you been running a mail server with mailboxes in a Malidir structure before but without rspamd? Then you probably have a good amount of ham and spam emails. Let’s use those to train rspamd. It is important to train both ham and spam emails. The rspamc command will allow you to feed entire directories/folders of emails to the learning process. An example to train spam:

# rspamc learn_spam /var/vmail/example.org/john/Maildir/.INBOX.Junk/cur

And this would be an example to train ham from John’s inbox:

# rspamc learn_ham /var/vmail/example.org/john/Maildir/cur

Apparently rspamd won't start using Bayesian filtering until at least 200 mails have been scanned, so it is worth doing to get up to speed.

Source: Filtering out spam with rspamd.

cbrace · Sep 4, 2023

I'm gradually getting up to speed with rspamd; it's a complex system which is kinda daunting for part-time sysadmins like myself. The other day, it incorrectly flagged an innocent message as spam, the header added by rspamd indicating the following:

LEAKED_PASSWORD_SCAM(7.00)

Since I set the threshold to 7, this ensures the mail gets flagged as spam. But when I tried to to retrain the system by uploading the message via the website, rspamd responded as following:

all learn conditions denied learning ham in default classifier

I took this to mean that the message could not be re-trained. Since the same issue had been flagged in the previous message (a bi-weekly newsletter) from the same source, this left me wondering: how do I ensure that mails from this source don't get flagged again since this particular criteria seems to be consistent? I thought that surely there would be a way to white-list a sender on the web interface, but initially that did not appear to be the case. However, I eventually figured out how to configure this, using the multimap facility.

First you have to create this file /usr/local/etc/rspamd/local.d/multimap.conf with entries like these:

Code:

sender_from_whitelist {
type = "from";
filter = "email";
map = "file://${DBDIR}/from_whitelist";
symbol = "SENDER_FROM_WHITELIST";
action = "accept";
}

sender_from_whitelist_domain {
type = "from";
filter = "email:domain";
map = "file://${DBDIR}/from_domain_whitelist";
symbol = "SENDER_FROM_WHITELIST_DOMAIN";
action = "accept";
}

When you restart the daemon, entries then appear on the Configuration page on the web interface, where you can whitelist specific senders or sender domains.
The creator of this page also has a tables for whitelisting recipients and recipient domains as well as "authusers", the usefulness of which is not immediately obvious to me.

This info is most likely all to be found in the official rspamd documentation, but it is pretty dense and I find that third-parties can do a better job of explaining certain features.

Source: How to create the whitelists

cyclaero · Sep 4, 2023

From https://rspamd.com/features.html

Statistical approach includes many useful spam recognition techniques that can learn dynamically from messages being scanned. Rspamd provides different tools that could be learned either manually or automatically and adopt for the actual mail flow.

Bayes classifier is a tool to classify spam and ham messages. Rspamd uses an advanced algorithm of statistical tokens generation that might achieve better results than the mostly used naive Bayes method.

The advanced algorithm link points to a page introducing the then new OSBF algorithm by Fidelis Assis, and it is based on the OSB algorithm in the CRM114 project by William Yerazunis. Bill added the OSBF as well and in the Notes on Classifiers in his book on CRM114 he lets us know:

Another classifier is the OSBF (OSB with Fidelis mods) filter. The good news is it's even more accurate, faster, and needs fewer buckets than any of the other filters. The bad news is that it's voodoo; some parts of it seem to make little mathematical sense. But the reality is that it works very, Very, VERY well. It's incompatible with any of the other filters (uses .cfc files).

Now we also got some notes of Fidelis himself about The importance of the training method, telling us that the best training method is "Train On or Near Error TONE". That said, Fidelis Assis, William Yerazunis, Christian Siefkes and Shalendra Chhabra are the scientists who contributed to all this efforts, and who have written various scientific papers (besides the code), for example this one: CRM114 versus Mr. X. And the bottom line is that implementers of the algorithm may have missed some points.

I cannot be of much help with rspamd. I suggest to look into the source code on how OSBF is implemented and what's the threshold of the TONE method. Another important point on training is that we want to have base64 and other encoded mime parts being translated and non text mime attachments being removed altogether, since the classifier's databases become spoiled when you feed it with e.g. large JPEG and/or other binary files. I hope, the OSBF filter module in rpsamd does normalization of mime content, perhaps using the Anomy sanitizer, if not this could also explain your findings.

I myself stayed with the original CRM114 code for After Queue content filtering with Postfix. I had this running from 2005 to 2012 on various mail servers of small companies with very good results. At the end of last year I set up a new mail server for a small company for about 50 users. My new content filter implementation consist of a hand full of shell scripts and a classifier tool which I have written in C using libcrm114 for the filter algorithms and a customized version of normalizemime.cc for normalizing encoded content and removing non-text attachments before classification/learning. This is running for more than a month now and misclassification is about 0.5 %. I use the Markovian filter with microgroom and TONE with a threshold of pR = ±10.

I posted my setup on the German BSD forums:

I guess, there's is nothing wrong with rspamd's implementation, except that it is more like a black box and perhaps it is harder to find out how things fit together.

Personally, I appreciate the work of Fidelis on OSBF and acknowledge Bill's comment, that it is working very, Very VERY well. However, I don't like "voodoo which seems to make little mathematical sense" - someday it might turn against you. The very well understood Markov algorithm is good enough for me. For Before Queue filtering I simply stay with the builtin facilities of Postfix.

how to handle Rspamd results

cbrace

hardworkingnewbie

cbrace

cbrace

Training from your existing ham and spam emails​

cbrace

cyclaero

Training from your existing ham and spam emails