Hello
I have some serious problems with my freebsd server and I am all out of clues what I could do to narrow down the problem, so I thought maybe someone can help me out.
I have a system with zfs on root on a SSD mirror and a zfs mirror of two HDDs for data.
The system is handling my private websites and my private mail server. It is a FreeBSD 12.0-RELEASE host with 3 jails managed via iocage
host: sanoid/syncoid, prometheus node exporter
jail 1: mail: nginx, mariadb, sogo, dovecot, postfix, rspam, unbound
jail 2: www: nginx, php
jail 3: mariadb: mariadb
all jails have the ports tree mounted via nullfs from the host
mariadb has some zfs datasets from the ssd directly with special recordsize for the db
and the data for web and mail is also mounted directly into the respective jail
The server crashes in irregular intervals from 5 hours to ~4 weeks.
first the mariadb jail gets unresponsive. sometimes I can still access it but when I want to access the access /usr/ports or /var/db/innodb* it locks up completely.
sanoid on the host starts to fail doing the snapshots and since it is called every minute racks up thousends of stuck processes and the after a few hours the memory and swap is full and I have to restart the server.
debug log has only this:
spammed all over
smartctl reports no errors
zfs scrubbing ends without errors
after restart mariadb complains:
but the data is fine.
here is a screen of two crashes in short time, you can see the network traffic dropping and the memory use rise until the system is power-cycled.
what would be the best way to find out what is causing this?
I have some serious problems with my freebsd server and I am all out of clues what I could do to narrow down the problem, so I thought maybe someone can help me out.
I have a system with zfs on root on a SSD mirror and a zfs mirror of two HDDs for data.
The system is handling my private websites and my private mail server. It is a FreeBSD 12.0-RELEASE host with 3 jails managed via iocage
host: sanoid/syncoid, prometheus node exporter
jail 1: mail: nginx, mariadb, sogo, dovecot, postfix, rspam, unbound
jail 2: www: nginx, php
jail 3: mariadb: mariadb
all jails have the ports tree mounted via nullfs from the host
Code:
/usr/ports /iocage/jails/www/root/usr/ports nullfs ro 0 0 # mount ports dir as readonly
/usr/ports/distfiles /iocage/jails/www/root/var/ports/distfiles nullfs rw 0 0 # mount distfiles readwrite
/usr/ports/packages /iocage/jails/www/root/var/ports/packages nullfs rw 0 0 # mount packages readwrite
mariadb has some zfs datasets from the ssd directly with special recordsize for the db
Code:
zroot/databases/innodb 77.0M 178G 2.81M /iocage/jails/mariadb/root/var/db/innodb
zroot/databases/innodb-logs 358M 178G 42.0M /iocage/jails/mariadb/root/var/db/innodb-logs
NAME PROPERTY VALUE SOURCE
zroot/databases/innodb recordsize 16K local
zroot/databases/innodb-logs recordsize 128K default
and the data for web and mail is also mounted directly into the respective jail
Code:
NAME USED AVAIL REFER MOUNTPOINT
hddpool 12.5G 2.62T 23K none
hddpool/mailboxes 5.96G 2.62T 24K /iocage/jails/mail/root/var/vmail/mailboxes
hddpool/mailboxes/domain1 2.96M 2.62T 1.07M /iocage/jails/mail/root/var/vmail/mailboxes/domain1
hddpool/mailboxes/domain2 4.09G 2.62T 4.01G /iocage/jails/mail/root/var/vmail/mailboxes/domain2
hddpool/mailboxes/domain3 1.87G 2.62T 1.67G /iocage/jails/mail/root/var/vmail/mailboxes/domain3
hddpool/webroot 6.38G 2.62T 4.06G /iocage/jails/www/root/usr/local/www
hddpool/webroot/domain1 101M 2.62T 56.5M /iocage/jails/www/root/usr/local/www/domain1
hddpool/webroot/domain2 131M 2.62T 94.7M /iocage/jails/www/root/usr/local/www/domain2
hddpool/webroot/domain3 2.03G 2.62T 1.98G /iocage/jails/www/root/usr/local/www/domain3
The server crashes in irregular intervals from 5 hours to ~4 weeks.
first the mariadb jail gets unresponsive. sometimes I can still access it but when I want to access the access /usr/ports or /var/db/innodb* it locks up completely.
sanoid on the host starts to fail doing the snapshots and since it is called every minute racks up thousends of stuck processes and the after a few hours the memory and swap is full and I have to restart the server.
debug log has only this:
Code:
kernel: sonewconn: pcb 0xfffff80030c4fc00: Listen queue overflow: 193 already in queue awaiting acceptance (20 occurrences)
smartctl reports no errors
zfs scrubbing ends without errors
after restart mariadb complains:
Code:
2019-08-07 12:15:06 0 [ERROR] InnoDB: Page [page id: space=102, page number=12] log sequence number 3106109799 is in the future! Current system log sequence number 3106109556.
2019-08-07 12:15:06 0 [ERROR] InnoDB: Your database may be corrupt or you may have copied the InnoDB tablespace but not the InnoDB log files. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
here is a screen of two crashes in short time, you can see the network traffic dropping and the memory use rise until the system is power-cycled.
what would be the best way to find out what is causing this?