Listen queue overflow

Hello All,

I'm posting this after reading a number of posts both here and elsewhere. Unfortunately, I have not been able to catch the process in question as described in https://blog.tyk.nu/blog/fun-with-freebsd-listen-queue-overflow/, so I'm trying to figure out what happened based on the logs.

The server is FreeBSD 12.0-STABLE r345381, a virtual instance with 4 CPUs, 4 GB RAM, 4GB swap, single-disk ZFS, lightly/moderately trafficked mail server, Cyrus/Postfix/Amavisd/ClamAV/Spamassassin + Unbound, all ports recently updated.

These are the log entries in chronological order:

/var/log/amavisd.log:
Code:
Jun  3 09:04:06 [server] /usr/local/sbin/amavisd[41944]: (41944-05) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock
Jun  3 09:04:07 [server] /usr/local/sbin/amavisd[41944]: (41944-05) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #1
Jun  3 09:04:13 [server] /usr/local/sbin/amavisd[41944]: (41944-05) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #2
Jun  3 09:04:30 [server] /usr/local/sbin/amavisd[41829]: (41829-06) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock
Jun  3 09:04:31 [server] /usr/local/sbin/amavisd[41829]: (41829-06) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #1
Jun  3 09:04:37 [server] /usr/local/sbin/amavisd[41829]: (41829-06) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #2
[...]
Jun  3 10:13:25 [server] /usr/local/sbin/amavisd[41829]: (41829-16) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock
Jun  3 10:13:26 [server] /usr/local/sbin/amavisd[41829]: (41829-16) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #1
Jun  3 10:13:32 [server] /usr/local/sbin/amavisd[41829]: (41829-16) ClamAV-clamd: Connecting to socket  /var/run/clamav/clamd.sock, retry #2

/var/log/messages:
Code:
Jun  3 09:08:05 [server] kernel: swap_pager_getswapspace(32): failed
Jun  3 09:08:05 [server] syslogd: last message repeated 103 times
Jun  3 09:08:05 [server] kernel: swap_pager_getswapspace(31): failed
Jun  3 09:08:05 [server] kernel: swap_pager_getswapspace(32): failed
Jun  3 09:08:05 [server] syslogd: last message repeated 3 times
Jun  3 09:08:05 [server] kernel: swap_pager_getswapspace(31): failed
Jun  3 09:08:05 [server] syslogd: last message repeated 1 times
[...]
Jun  3 09:09:03 [server] kernel: pid 42283 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:18:49 [server] kernel: pid 42392 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:20:10 [server] kernel: pid 42438 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:20:32 [server] kernel: pid 42443 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:33:43 [server] kernel: pid 42620 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:39:16 [server] kernel: pid 42707 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:40:02 [server] kernel: pid 42710 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:40:02 [server] kernel: pid 42711 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 09:57:49 [server] kernel: pid 42923 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 10:18:58 [server] kernel: pid 43117 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 10:20:24 [server] kernel: pid 43135 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 10:24:42 [server] kernel: pid 43158 (clamscan), jid 0, uid 110, was killed: out of swap space
Jun  3 10:36:02 [server] kernel: pid 43323 (clamscan), jid 0, uid 110, was killed: out of swap space

/var/log/local6.log (Cyrus):
Code:
Jun  3 10:14:44 [server] pop3[43092]: login: [client host] [ip.add.re.ss] [client] APOP User logged in SESSIONID=<[server]-43092-1559571284-1-7423282183343598449>
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

/var/log/messages:
Code:
Jun  3 10:26:18 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (1 occurrences)
Jun  3 10:27:20 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (22 occurrences)
Jun  3 10:28:45 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (8 occurrences)
Jun  3 10:30:03 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (12 occurrences)
Jun  3 10:31:03 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (28 occurrences)
Jun  3 10:32:03 [server] kernel: sonewconn: pcb 0xfffff800639f9988: Listen queue overflow: 49 already in queue awaiting acceptance (22 occurrences)
Jun  3 10:33:06 [server] kernel: sonewconn: pcb 0xfffff800639f95b8: Listen queue overflow: 49 already in queue awaiting acceptance (29 occurrences)
Jun  3 10:34:11 [server] kernel: sonewconn: pcb 0xfffff800639f95b8: Listen queue overflow: 49 already in queue awaiting acceptance (61 occurrences)
Jun  3 10:35:16 [server] kernel: sonewconn: pcb 0xfffff800639f95b8: Listen queue overflow: 49 already in queue awaiting acceptance (45 occurrences)

/var/log/maillog:
Code:
Jun  3 10:33:44 [server] postfix/pipe[43113]: warning: pipe_command_write: write time limit exceeded
Jun  3 10:35:59 [server] postfix/pipe[43142]: warning: pipe_command_write: write time limit exceeded

I also looked at message processing times by Amavisd/ClamAV/Spamassassin as reported in amavisd.log (TIMING in milliseconds, TIMINGs are usually under 1,000 ms, may be 3-4,000, but not 300,000 ms):
Code:
Jun 3 09:03:59 size: 16781, TIMING 290
Jun 3 09:04:02 size: 5544, TIMING 430
Jun 3 09:05:28 size: 38149, TIMING 3303
Jun 3 09:05:33 size: 22735, TIMING 4628
Jun 3 09:05:35 size: 22725, TIMING 1614
Jun 3 09:06:32 size: 73412, TIMING 2831
Jun 3 09:09:23 size: 40542, TIMING 317690
Jun 3 09:09:30 size: 8307, TIMING 299901
Jun 3 09:10:51 size: 17161, TIMING 329098
Jun 3 09:11:23 size: 18160, TIMING 302487
Jun 3 09:11:23 size: 75761, TIMING 308312
Jun 3 09:11:24 size: 22750, TIMING 709
Jun 3 09:11:42 size: 22356, TIMING 299189
Jun 3 09:11:42 size: 18059, TIMING 300968
Jun 3 09:11:56 size: 1405, TIMING 1230
Jun 3 09:12:16 size: 126818, TIMING 301309
Jun 3 09:13:13 size: 542117, TIMING 3533
Jun 3 09:14:24 size: 15877, TIMING 300708
Jun 3 09:14:28 size: 64391, TIMING 297870
Jun 3 09:15:59 size: 27882, TIMING 307546
Jun 3 09:16:34 size: 45929, TIMING 307134
Jun 3 09:16:35 size: 13796, TIMING 2308
Jun 3 09:17:27 size: 281934, TIMING 14667
Jun 3 09:19:12 size: 51672, TIMING 305754
Jun 3 09:19:29 size: 18476, TIMING 305202
Jun 3 09:19:48 size: 29764, TIMING 302856
Jun 3 09:19:49 size: 29767, TIMING 303809
Jun 3 09:19:52 size: 281896, TIMING 2427
Jun 3 09:19:53 size: 281929, TIMING 780
Jun 3 09:20:59 size: 30336, TIMING 340438
Jun 3 09:20:59 size: 10377, TIMING 296543
Jun 3 09:21:00 size: 281897, TIMING 945
Jun 3 09:21:01 size: 281940, TIMING 543
Jun 3 09:21:01 size: 281936, TIMING 253
Jun 3 09:22:11 size: 28265, TIMING 309174
Jun 3 09:22:48 size: 18476, TIMING 307903

Am I correct to conclude that:

1: ClamAV-clamd runs out of available Unix sockets
2: Falls back on clamscan
3: clamscan is much slower than ClamAV-clamd but eats up all memory and swap
4: Once in a while a socket is freed and ClamAV-clamd can connect, this would explain the wildly varying message processing TIMINGs
5: FreeBSD tries to keep alive by killing clamscan
6: But starved of RAM and swap, sockets cannot be freed (?)
7: Cyrus gives up and writes ^@^@^@^@^@ in the log
8: Postfix cannot deliver to Cyrus for lack of available sockets

At this point, I can't stop/restart Cyrus, Postfix and other daemons; may be I could if I waited long enough, but unresponsive mail service irritates people... so I had to resort to hard-rebooting the server.

I have also increased
Code:
kern.ipc.somaxconn
from the default 128 to 512, and it didn't help (what I describe here is the second occurrence of the same problem).

If this is a plausible explanation would increasing the number of amavisd servers? It is currently set to 8.

Hope to have your insightful comments...
 
You should set kern.ipc.soacceptqueue, although I do believe kern.ipc.somaxconn refers to the same thing. Some applications use their own settings, so even if you increase soacceptqueue they would still use 128 or even lower. Also, after changing soacceptqueue make sure to restart the daemon.

 
SirDice, you are right, kern.ipc.somaxconn and kern.ipc.soacceptqueue seem to be the same:

Code:
$ sysctl -d kern.ipc.somaxconn
kern.ipc.somaxconn: Maximum listen socket pending connection accept queue size (compat)
$ sysctl -d kern.ipc.soacceptqueue
kern.ipc.soacceptqueue: Maximum listen socket pending connection accept queue size

I'll try
Code:
$max_servers = 16;
in /usr/local/etc/amavisd.conf.
 
Clamd with default rules need at least 1GB of RAM when processing the e-mails. I think you need more RAM for your VPS or disable some anti-virus rules.
 
In Debian the signatures that come with clamd installation are:

/var/lib/clamav/daily.cld
/var/lib/clamav/main.cvd

I kept the main and stop using daily.

But I don't remember how I did it.
 
Thanks again - I dug up my old notes, and found that in a similar situation I changed some values from defaults in /usr/local/etc/clamd.conf:

Code:
MaxScanSize
MaxFileSize
MaxRecursion
MaxFiles
MaxHTMLNormalize
MaxHTMLNoTags
MaxScriptNormalize
 
The same problem happened again:

Code:
Jun  8 14:10:25 [server] kernel: sonewconn: pcb 0xfffff8003a6e47a0: Listen queue overflow: 49 already in queue awaiting acceptance (1 occurrences)
Jun  8 14:11:30 [server] kernel: sonewconn: pcb 0xfffff8003a6e47a0: Listen queue overflow: 49 already in queue awaiting acceptance (16 occurrences)

Also:

Code:
# netstat -aL
Current listen queue sizes (qlen/incqlen/maxqlen)
Proto Listen                           Local Address         
tcp4  0/0/100                          localhost.10025       
tcp4  0/0/100                          localhost.submission   
tcp4  0/0/100                          [server]..submi
tcp4  0/0/100                          localhost.smtp         
tcp4  0/0/100                          [server]..smtp 
tcp4  49/0/32                          *.pop3                 
tcp4  49/0/32                          *.imap                 
tcp6  0/0/128                          localhost.10026       
tcp4  0/0/128                          localhost.10026       
tcp6  0/0/128                          localhost.10024       
tcp4  0/0/128                          localhost.10024       
tcp4  0/0/128                          *.ssh                 
tcp6  0/0/128                          *.ssh                 
tcp4  0/0/256                          localhost.8953         
tcp4  0/0/256                          [server]..domai
tcp4  0/0/256                          localhost.domain       
unix  0/0/100                          private/amavisd-new
unix  0/0/100                          private/cyrus
unix  0/0/100                          private/scache
unix  0/0/100                          private/anvil
unix  0/0/100                          private/lmtp
unix  0/0/100                          private/virtual
unix  0/0/100                          private/local
unix  0/0/100                          private/discard
unix  0/0/100                          private/retry
unix  0/0/100                          private/error
unix  0/0/100                          public/showq
unix  0/0/100                          private/relay
unix  0/0/100                          private/smtp
unix  0/0/100                          private/proxywrite
unix  0/0/100                          private/proxymap
unix  0/0/100                          public/flush
unix  0/0/100                          private/verify
unix  0/0/100                          private/trace
unix  0/0/100                          private/defer
unix  0/0/100                          private/bounce
unix  0/0/100                          private/rewrite
unix  0/0/100                          private/tlsmgr
unix  0/0/100                          public/qmgr
unix  0/0/100                          public/cleanup
unix  0/0/100                          public/pickup
unix  0/0/100                          private/tlsproxy
unix  0/0/100                          private/dnsblog
unix  0/0/100                          private/smtpd
unix  0/0/200                          /var/run/clamav/clamd.sock
unix  0/0/128                          /var/amavis/amavisd.sock
unix  0/0/32                           /var/run/saslauthd/mux
unix  0/0/4                            /var/run/devd.pipe
unix  0/0/4                            /var/run/devd.seqpacket.pipe

It is then the Cyrus pop3 and imap tcp queues which are overflowing.

Where is the
Code:
maxqlen
value of 32 as reported by netstat(1) set?

While the server did not crash, I was unable to restart Cyrus with the rc.d script (waited a few minutes). Then, tried reboot; the server did not reboot while I again was waiting a few minutes.

Would you please suggest how can I further debug and fix this problem?
 
Back
Top