Server chrashes

Hello
I have some serious problems with my freebsd server and I am all out of clues what I could do to narrow down the problem, so I thought maybe someone can help me out.

I have a system with zfs on root on a SSD mirror and a zfs mirror of two HDDs for data.
The system is handling my private websites and my private mail server. It is a FreeBSD 12.0-RELEASE host with 3 jails managed via iocage

host: sanoid/syncoid, prometheus node exporter
jail 1: mail: nginx, mariadb, sogo, dovecot, postfix, rspam, unbound
jail 2: www: nginx, php
jail 3: mariadb: mariadb

all jails have the ports tree mounted via nullfs from the host
Code:
/usr/ports             /iocage/jails/www/root/usr/ports            nullfs  ro      0       0    # mount ports dir as readonly
/usr/ports/distfiles   /iocage/jails/www/root/var/ports/distfiles  nullfs  rw      0       0    # mount distfiles readwrite
/usr/ports/packages    /iocage/jails/www/root/var/ports/packages   nullfs  rw      0       0    # mount packages readwrite

mariadb has some zfs datasets from the ssd directly with special recordsize for the db
Code:
zroot/databases/innodb                    77.0M   178G  2.81M  /iocage/jails/mariadb/root/var/db/innodb
zroot/databases/innodb-logs                358M   178G  42.0M  /iocage/jails/mariadb/root/var/db/innodb-logs

NAME                        PROPERTY    VALUE    SOURCE
zroot/databases/innodb      recordsize  16K      local
zroot/databases/innodb-logs    recordsize  128K     default

and the data for web and mail is also mounted directly into the respective jail
Code:
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
hddpool                              12.5G  2.62T    23K  none
hddpool/mailboxes                    5.96G  2.62T    24K  /iocage/jails/mail/root/var/vmail/mailboxes
hddpool/mailboxes/domain1            2.96M  2.62T  1.07M  /iocage/jails/mail/root/var/vmail/mailboxes/domain1
hddpool/mailboxes/domain2            4.09G  2.62T  4.01G  /iocage/jails/mail/root/var/vmail/mailboxes/domain2
hddpool/mailboxes/domain3            1.87G  2.62T  1.67G  /iocage/jails/mail/root/var/vmail/mailboxes/domain3
hddpool/webroot                      6.38G  2.62T  4.06G  /iocage/jails/www/root/usr/local/www
hddpool/webroot/domain1               101M  2.62T  56.5M  /iocage/jails/www/root/usr/local/www/domain1
hddpool/webroot/domain2               131M  2.62T  94.7M  /iocage/jails/www/root/usr/local/www/domain2
hddpool/webroot/domain3              2.03G  2.62T  1.98G  /iocage/jails/www/root/usr/local/www/domain3

The server crashes in irregular intervals from 5 hours to ~4 weeks.

first the mariadb jail gets unresponsive. sometimes I can still access it but when I want to access the access /usr/ports or /var/db/innodb* it locks up completely.
sanoid on the host starts to fail doing the snapshots and since it is called every minute racks up thousends of stuck processes and the after a few hours the memory and swap is full and I have to restart the server.

debug log has only this:
Code:
kernel: sonewconn: pcb 0xfffff80030c4fc00: Listen queue overflow: 193 already in queue awaiting acceptance (20 occurrences)
spammed all over

smartctl reports no errors
zfs scrubbing ends without errors
after restart mariadb complains:
Code:
2019-08-07 12:15:06 0 [ERROR] InnoDB: Page [page id: space=102, page number=12] log sequence number 3106109799 is in the future! Current system log sequence number 3106109556.
2019-08-07 12:15:06 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
but the data is fine.

here is a screen of two crashes in short time, you can see the network traffic dropping and the memory use rise until the system is power-cycled.
6805


what would be the best way to find out what is causing this?
 
The first thing I would check is how MariaDB is configured. Specifically how much memory it's set to use for caches, buffers, etc. Setting these incorrectly is often the reason why it suddenly tries to eat up all available memory. I've seen configurations with a configured max. memory usage of several TB while the server only has a few GB of memory. Never configure this for more than 3/4 of the amount of RAM the machine has, especially if you have other things running on it too.

Because you use ZFS too I would also limit the amount of ARC. I've had lockups because ARC, MariaDB and a few other applications were all fighting over same bit of memory.

You can probably ignore the 'listen queue overflow' messages, those look like they're a symptom of the lockup, not the cause of it.
 
I did not configure mariadb but was just using the standard settings since the servers workload is minimal. I explicitly set now some sane limits to the most important settings (limiting ram to 8G of total 32G, but my working set is only 200M anyway) and enabled some more verbose logging to see if somethings comes up.
I will take a look into ARC as sanoid is configured to take many snapshots and I am also deduping the SSD pool.
 
Back
Top