Solved Mailserver stops working after a few days (after 12-RELENG upgrade)

I've been running this mail server for years and occasionally had to upgrade config files whenever there was a deprecated config in Dovecot, Postfix, ClamAV, ....

Now, after updating to FreeBSD 12-RELENG, I have noticed after a few days I won't receive any new emails. I investigated and when I send an email from a gmail account I have the following log:

Code:
May  3 15:12:55 mailhost postfix/smtpd[74828]: connect from mail-ed1-f42.google.com[209.85.208.42]
May  3 15:12:56 mailhost postfix/smtpd[74828]: 97E8A7A81BB: client=mail-ed1-f42.google.com[209.85.208.42]
May  3 15:12:56 mailhost postfix/cleanup[66972]: 97E8A7A81BB: message-id=<CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com>
May  3 15:12:56 mailhost opendkim[16684]: 97E8A7A81BB: mail-ed1-f42.google.com [209.85.208.42] not internal
May  3 15:12:56 mailhost opendkim[16684]: 97E8A7A81BB: not authenticated
May  3 15:12:56 mailhost opendkim[16684]: 97E8A7A81BB: DKIM verification successful
May  3 15:12:56 mailhost opendkim[16684]: 97E8A7A81BB: s=20161025 d=gmail.com SSL 
May  3 15:12:56 mailhost postfix/qmgr[81214]: 97E8A7A81BB: from=<agmailinbox@gmail.com>, size=5852, nrcpt=1 (queue active)
May  3 15:12:56 mailhost postfix/smtpd[74828]: disconnect from mail-ed1-f42.google.com[209.85.208.42] ehlo=2 starttls=1 mail=1 rcpt=1 bdat=1 quit=1 commands=7
May  3 15:12:56 mailhost spamd[69691]: spamd: got connection over /var/run/SpamAssassin.socket 
May  3 15:12:57 mailhost spamd[69691]: spamd: setuid to mailnull succeeded 
May  3 15:12:57 mailhost spamd[69691]: spamd: processing message <CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com> for mailnull:26 
May  3 15:12:58 mailhost spamd[69691]: spamd: clean message (0.0/5.0) for mailnull:26 in 1.1 seconds, 5845 bytes. 
May  3 15:12:58 mailhost spamd[69691]: spamd: result: . 0 - BAYES_40,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_MSPIKE_H2 scantime=1.1,size=5845,user=mailnull,uid=26,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=/var/run/SpamAssassin.socket,mid=<CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com>,bayes=0.349810,autolearn=ham autolearn_force=no 
May  3 15:12:58 mailhost postfix/pickup[3169]: 1C8597A81C7: uid=65534 from=<agmailinbox@gmail.com>
May  3 15:12:58 mailhost postfix/pipe[3541]: 97E8A7A81BB: to=<someinbox@babaei.net>, relay=spamassassin, delay=1.6, delays=0.39/0.01/0/1.2, dsn=2.0.0, status=sent (delivered via spamassassin service)
May  3 15:12:58 mailhost postfix/qmgr[81214]: 97E8A7A81BB: removed
May  3 15:12:58 mailhost postfix/cleanup[60259]: 1C8597A81C7: message-id=<CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com>
May  3 15:12:58 mailhost opendkim[16684]: 1C8597A81C7: no signing table match for 'agmailinbox@gmail.com'
May  3 15:12:58 mailhost opendkim[16684]: 1C8597A81C7: DKIM verification successful
May  3 15:12:58 mailhost opendkim[16684]: 1C8597A81C7: s=20161025 d=gmail.com SSL 
May  3 15:12:58 mailhost postfix/qmgr[81214]: 1C8597A81C7: from=<agmailinbox@gmail.com>, size=6379, nrcpt=1 (queue active)
May  3 15:12:58 mailhost clamsmtpd[34590]: 1005AE: accepted connection from: XXX.XXX.XXX.XXX
May  3 15:12:58 mailhost postfix/smtpd[47876]: connect from mailhost.babaei.net[XXX.XXX.XXX.XXX]
May  3 15:12:58 mailhost spamd[43906]: prefork: child states: II 
May  3 15:12:58 mailhost postfix/smtpd[47876]: 51AB77A8274: client=mailhost.babaei.net[XXX.XXX.XXX.XXX]
May  3 15:12:58 mailhost postfix/cleanup[66972]: 51AB77A8274: message-id=<CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com>
May  3 15:12:58 mailhost opendkim[16684]: 51AB77A8274: no signing table match for 'agmailinbox@gmail.com'
May  3 15:12:58 mailhost opendkim[16684]: 51AB77A8274: DKIM verification successful
May  3 15:12:58 mailhost opendkim[16684]: 51AB77A8274: s=20161025 d=gmail.com SSL 
May  3 15:12:58 mailhost postfix/qmgr[81214]: 51AB77A8274: from=<agmailinbox@gmail.com>, size=6609, nrcpt=1 (queue active)
May  3 15:12:58 mailhost clamsmtpd[34590]: 1005AE: from=agmailinbox@gmail.com, to=someinbox@babaei.net, status=CLEAN
May  3 15:12:58 mailhost postfix/smtp[41736]: 1C8597A81C7: to=<someinbox@babaei.net>, relay=XXX.XXX.XXX.XXX[XXX.XXX.XXX.XXX]:10025, delay=0.59, delays=0.12/0/0.1/0.37, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 51AB77A8274)
May  3 15:12:58 mailhost postfix/smtpd[47876]: disconnect from mailhost.babaei.net[XXX.XXX.XXX.XXX] ehlo=1 xforward=1 mail=1 rcpt=1 data=1 quit=1 commands=6
May  3 15:12:58 mailhost postfix/qmgr[81214]: 1C8597A81C7: removed
May  3 15:12:58 mailhost dovecot[24205]: lmtp(70784): Connect from local
May  3 15:12:58 mailhost dovecot[24205]: lmtp(someinbox@babaei.net)<70784><8gIiKnpazFyAFAEAiv64TQ>: sieve: msgid=<CAB+8qgA0NazvVronv_p7fGz_0qnW38D+=tCqtWx9sHqF8g+2JQ@mail.gmail.com>: stored mail into mailbox 'INBOX'
May  3 15:12:58 mailhost postfix/lmtp[63859]: 51AB77A8274: to=<someinbox@babaei.net>, relay=mailhost.babaei.net[private/dovecot-lmtp], delay=0.41, delays=0.37/0/0/0.04, dsn=2.0.0, status=sent (250 2.0.0 <someinbox@babaei.net> 8gIiKnpazFyAFAEAiv64TQ Saved)
May  3 15:12:58 mailhost dovecot[24205]: lmtp(70784): Disconnect from local: Client has quit the connection (state=READY)
May  3 15:12:58 mailhost postfix/qmgr[81214]: 51AB77A8274: removed

I've noticed the clamav-clamd service crashed and I have to start it manually. And, then I have to start Dovecot manually. Afterwards, I start recieving emails again. Any ideas why it happens on 12.0?
 
Yes, I did upgrade from 11.2-RELENG and this my usual upgrade procedure is:

Code:
$ cd /usr/src/ && make clean
$ cd /usr/obj/ && rm -rf *
$ cd /usr/src/ && make buildworld -j3 && make buildkernel -j3 KERNCONF=CUSTOM
$ make installkernel KERNCONF=CUSTOM
$ nextboot -o "-s" -k kernel
$ shutdown -r now

$ mount -u /
$ mount -a -t ufs
$ adjkerntz -i
$ mergemaster -p
$ cd /usr/src/
$ make installworld
$ mergemaster
$ reboot

// looking for possible errors/warnings
$ tail /var/log/messages

$ cd /usr/src
$ make check-old
$ make -DBATCH_DELETE_OLD_FILES delete-old
$ make -DBATCH_DELETE_OLD_FILES delete-old-libs
$ cd /usr/src/ && make clean
$ cd /usr/obj/ && rm -rf *

$ pkg-static install -f pkg
$ portupgrade -afkC

I even did another portupgrade -afkC after I've found out about the mail server issue. The result is the same, unfortunately.
 
If you're sticking to -RELEASE versions, why didn't you use freebsd-update(8)?
The same can be said for packages, why don't you use packages?

I can't find a reason in your logs why it would have crashed. I do know it crashes from time to time on my server too. But that's still running 11.2-RELEASE.
 
SirDice the reason is I am building a custom kernel since I need two features (QUOTA, and TCP FastOpen which is just turned on by default in 12, so now only QUOTA). For the packages, it's the same. I turn on/off various features in ports (it actually is more than a mail server, web server, database, etc. A personal production server for small things).

VladiBG I guess I have to do it in the next crash since I couldn't find anything relevant.

Of course, there is one workaround though, writing a script and a Cron job to restart the postfix, dovecot, clam, .... every a few hours. But, I would like to find the root cause.
 
OK now it happened again and I am checking the clamav-clamd logs after the crash and nothing is there:

Code:
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.
SelfCheck: Database status OK.

And when I start it manually:

Code:
+++ Started at Sat May  4 21:11:09 2019
Received 0 file descriptor(s) from systemd.
clamd daemon 0.101.2 (OS: freebsd12.0, ARCH: amd64, CPU: amd64)
Log file size limited to 1048576 bytes.
Reading databases from /var/db/clamav
Not loading PUA signatures.
Bytecode: Security mode set to "TrustSigned".
Loaded 6119763 signatures.
LOCAL: Removing stale socket file /var/run/clamav/clamd.sock
LOCAL: Unix socket file /var/run/clamav/clamd.sock
LOCAL: Setting connection queue length to 200
Limits: Global size limit set to 104857600 bytes.
Limits: File size limit set to 26214400 bytes.
Limits: Recursion level limit set to 16.
Limits: Files limit set to 10000.
Limits: MaxEmbeddedPE limit set to 10485760 bytes.
Limits: MaxHTMLNormalize limit set to 10485760 bytes.
Limits: MaxHTMLNoTags limit set to 2097152 bytes.
Limits: MaxScriptNormalize limit set to 5242880 bytes.
Limits: MaxZipTypeRcg limit set to 1048576 bytes.
Limits: MaxPartitions limit set to 50.
Limits: MaxIconsPE limit set to 100.
Limits: MaxRecHWP3 limit set to 16.
Limits: PCREMatchLimit limit set to 100000.
Limits: PCRERecMatchLimit limit set to 2000.
Limits: PCREMaxFileSize limit set to 26214400.
Archive support enabled.
AlertExceedsMax heuristic detection disabled.
Heuristic alerts enabled.
Portable Executable support enabled.
ELF support enabled.
Mail files support enabled.
OLE2 support enabled.
PDF support enabled.
SWF support enabled.
HTML support enabled.
XMLDOCS support enabled.
HWP3 support enabled.
Self checking every 600 seconds.
Set stacksize to 2162688

So basically no ideas what happens :/
 
I have just taken another look at htop and noticed clamd is constantly using 30% to 50% of my VPS memory (1.3 GB memory + 4.0 GB swap) that's insane because all other services including Postgres, Sogo, Nginx, Gitea, .... use around 2 or 3 percent of the memory in idle mode, at most.

I tried lowering MaxConnectionQueueLength and MaxThreads:

Code:
# Maximum length the queue of pending connections may grow to.
# Default: 200
MaxConnectionQueueLength 10

# Maximum number of threads running at the same time.
# Default: 10
MaxThreads 2

Now the memory usage is a bit better (around 10%). There are other options such as

Code:
##
## Limits
##

# The options below protect your system against Denial of Service attacks
# using archive bombs.

# This option sets the maximum amount of data to be scanned for each input
# file.
# Archives and other containers are recursively extracted and scanned up to
# this value.
# Value of 0 disables the limit
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 100M
#MaxScanSize 150M

# Files larger than this limit won't be scanned. Affects the input file itself
# as well as files contained inside it (when the input file is an archive, a
# document or some other kind of container).
# Value of 0 disables the limit.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 25M
#MaxFileSize 30M

# Nested archives are scanned recursively, e.g. if a Zip archive contains a RAR
# file, all files within it will also be scanned. This options specifies how
# deeply the process should be continued.
# Note: setting this limit too high may result in severe damage to the system.
# Default: 16
#MaxRecursion 10

# Number of files to be scanned within an archive, a document, or any other
# container file.
# Value of 0 disables the limit.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 10000
#MaxFiles 15000

# Maximum size of a file to check for embedded PE. Files larger than this value
# will skip the additional analysis step.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 10M
#MaxEmbeddedPE 10M

# Maximum size of a HTML file to normalize. HTML files larger than this value
# will not be normalized or scanned.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 10M
#MaxHTMLNormalize 10M

# Maximum size of a normalized HTML file to scan. HTML files larger than this
# value after normalization will not be scanned.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 2M
#MaxHTMLNoTags 2M

# Maximum size of a script file to normalize. Script content larger than this
# value will not be normalized or scanned.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 5M
#MaxScriptNormalize 5M

# Maximum size of a ZIP file to reanalyze type recognition. ZIP files larger
# than this value will skip the step to potentially reanalyze as PE.
# Note: disabling this limit or setting it too high may result in severe damage
# to the system.
# Default: 1M
#MaxZipTypeRcg 1M

# This option sets the maximum number of partitions of a raw disk image to be
# scanned.
# Raw disk images with more partitions than this value will have up to
# the value number partitions scanned. Negative values are not allowed.
# Note: setting this limit too high may result in severe damage or impact
# performance.
# Default: 50
#MaxPartitions 128

# This option sets the maximum number of icons within a PE to be scanned.
# PE files with more icons than this value will have up to the value number
# icons scanned.
# Negative values are not allowed.
# WARNING: setting this limit too high may result in severe damage or impact
# performance.
# Default: 100
#MaxIconsPE 200

# This option sets the maximum recursive calls for HWP3 parsing during
# scanning. HWP3 files using more than this limit will be terminated and
# alert the user.
# Scans will be unable to scan any HWP3 attachments if the recursive limit
# is reached.
# Negative values are not allowed.
# WARNING: setting this limit too high may result in severe damage or impact
# performance.
# Default: 16
#MaxRecHWP3 16

# This option sets the maximum calls to the PCRE match function during
# an instance of regex matching.
# Instances using more than this limit will be terminated and alert the user
# but the scan will continue.
# For more information on match_limit, see the PCRE documentation.
# Negative values are not allowed.
# WARNING: setting this limit too high may severely impact performance.
# Default: 100000
#PCREMatchLimit 20000

# This option sets the maximum recursive calls to the PCRE match function
# during an instance of regex matching.
# Instances using more than this limit will be terminated and alert the user
# but the scan will continue.
# For more information on match_limit_recursion, see the PCRE documentation.
# Negative values are not allowed and values > PCREMatchLimit are superfluous.
# WARNING: setting this limit too high may severely impact performance.
# Default: 2000
#PCRERecMatchLimit 10000

# This option sets the maximum filesize for which PCRE subsigs will be
# executed. Files exceeding this limit will not have PCRE subsigs executed
# unless a subsig is encompassed to a smaller buffer.
# Negative values are not allowed.
# Setting this value to zero disables the limit.
# WARNING: setting this limit too high or disabling it may severely impact
# performance.
# Default: 25M
#PCREMaxFileSize 100M

# When AlertExceedsMax is set, files exceeding the MaxFileSize, MaxScanSize, or
# MaxRecursion limit will be flagged with the virus
# "Heuristics.Limits.Exceeded".
# Default: no
#AlertExceedsMax yes

But I am not sure if it is a good idea to mess with these.

I wonder if there is any way to run clamav on demand (when an email arrives)? Of course, it will be slower since it is loading from the disk each time, but I have only one inbox and the amount email I send/receive is not much. Better than the OS killing it due to much memory usage and forcing me to manually starting it.
 
I think there's a way to setup a stand-alone clamav remote scanner/server and submit files to it for av testing.
 
rootbert I think one could write a script to scan for a working clamd process using ps aux | grep clamd and then start the server if it cannot be found. More practical in my opinion. If I find enough time, I'll write one and post it here.
 
I'm using amavisd-new instead of clamsmtp. You can find example configuration on purplehat website.

ps.
Try to remove the Database of the clamav then to clean and rebuild the clamd port or manually fetch the new database with freshclam and then enable verbose logging to see what's going on.
 
VladiBG Thanks for the suggestion. I found a similar thing on iRedMail forums to run clamscan on demand. I will definitely look into amasvisd-new when I have a weekend to spare since it seems it requires a bit of configuration. Haven't heard of amavisd-new before since I set up this mail server years ago and it just worked until recently. So, for education purpose, besides allowing me to run clamscan on demand, what are any other advantages of running amasvisd-new over clamsmtp?

P.S. from the linked article:
NOTE ABOUT MEMORY
Clamav eats up a lot of memory. You won’t be able to run it on a server with less than 2GB RAM. If you feel comfortable, you can disable it.
Or, I'll guess running it on demand is a real solution on a low traffic cheap VPS.
 
it's not running on demand. It's using clamd instead of clamscan. There's listening socket.

amavisd-new scan for viruses+spam , clamsmtp is for virus scan only. If your clamd crash because of out of memory it will be logged. There will be no difference if the process crash during the scan initiated by amavisd-new or clamsmtp as both act on the same way.
 
Did you determine for sure that it was FreeBSD killing the process when it consumed all memory? There should be a log message in /var/log/messages if that is the case (very familiar with "out of memory" issues on an RPi3 :)

Using a script workaround seems more like treating the symptoms rather than the root cause which often means you'll get bitten again when you least expect it.
 
Well, as I mentioned in the article, it's just a poor man's guess. I did investigate the logs without finding any clue. Though, I agree with you that the script is just a workaround.
 
Interestingly, I have stopped getting clamav running round the clock. M/Monit has also given up. I know it is memory hog and do have suffice memory >=4gb. Yet, not all of it runs.

Clamavd_rc and clamavd_bin are running and being monitored. But clamd is no longer monitored. Perhaps, it's still running but not monitored by Monit again. Hopefully, it will come back being monitored sometime in the future. I worry less since most of our machines are *n*x.
 
Back
Top