I recently updated a small server from a Qotom i5 mini-PC (16GB RAM) to a Topton n100 mini-PC (32GB RAM). I have several light duty Debian virtual machines running on the mini-PC. It has been rock solid for the past several years. But since the update, I am puzzled by a weird Debian Bhyve VM file system corruption problem: Basically, FreeBSD zpool reports no issue whatsoever, but the VMs kept reporting file system corruptions (inode problems and checksum mismatch, etc.).
Here is the start script for one of the VMs, in this case, a Pihole. The VM uses two ZFS block datasets: one for the root file system (5GB), and other one for the swap partition (2GB).
Sometimes a VM cannot start, stuck at the Debian initramfs interface (see the image below). Debian complains about the file system and asks for a fsck. Then the VM may boot up normally after fsck (it will fix many inode problems), but it may also end up in a kernel panic and could not be recovered (had to be rebuilt). Even though a VM boots up, often times there are many problems with the root file system. In some cases, the root file system was remounted read-only.
On the FreeBSD host, zpool scrub shows that the zpool and zfs datasets are perfect while all these are happening.
The only substantial difference between the old Qotom i5 and the new Topton n100 machine is that the FreeBSD operating system runs from a USB enclosure (an mSata SSD inside) on the former, and it runs from a SATA enclosure (a M.2 B-Key SSD inside) on the latter. The FreeBSD version is the same, V14, patched to the latest. ZFS version is zfs-2.2.0-FreeBSD_g95785196f, and zfs-kmod-2.2.0-FreeBSD_g95785196f.
This is quite a headache. It is like a timed bomb. ZFS is supposed to be exceptionally reliable, and it has been for the past several years. I suspect the problem is faulty hardware, as the mini-PC boots up just fine. I have tried recreating the Debian virtual machines, changing the parameter of virtio-blk to nvme or achi-hd. Using snapshot rollback sometimes restores a working VM but not always. It is like the VM has its own mindset to decide when to go crazy.
Any ideas? Thanks much!
Debian VM initramfs screen. fsck fixes many issues.
Information provided by dmesg on a booted-up Debian VM.
Here is the start script for one of the VMs, in this case, a Pihole. The VM uses two ZFS block datasets: one for the root file system (5GB), and other one for the swap partition (2GB).
Code:
nohup bhyve -c 1 -m 1024M -w -H \
-s 0,hostbridge \
-s 4,virtio-blk,/dev/zvol/work/vm/pihole53 \
-s 5,virtio-blk,/dev/zvol/work/vm/pihole53_swap \
-s 6,virtio-net,tap53 \
-s 29,fbuf,tcp=0.0.0.0:5900,w=1024,h=768,wait -s 30,xhci,tablet \
-s 31,lpc -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd pihole53 &
Sometimes a VM cannot start, stuck at the Debian initramfs interface (see the image below). Debian complains about the file system and asks for a fsck. Then the VM may boot up normally after fsck (it will fix many inode problems), but it may also end up in a kernel panic and could not be recovered (had to be rebuilt). Even though a VM boots up, often times there are many problems with the root file system. In some cases, the root file system was remounted read-only.
On the FreeBSD host, zpool scrub shows that the zpool and zfs datasets are perfect while all these are happening.
The only substantial difference between the old Qotom i5 and the new Topton n100 machine is that the FreeBSD operating system runs from a USB enclosure (an mSata SSD inside) on the former, and it runs from a SATA enclosure (a M.2 B-Key SSD inside) on the latter. The FreeBSD version is the same, V14, patched to the latest. ZFS version is zfs-2.2.0-FreeBSD_g95785196f, and zfs-kmod-2.2.0-FreeBSD_g95785196f.
This is quite a headache. It is like a timed bomb. ZFS is supposed to be exceptionally reliable, and it has been for the past several years. I suspect the problem is faulty hardware, as the mini-PC boots up just fine. I have tried recreating the Debian virtual machines, changing the parameter of virtio-blk to nvme or achi-hd. Using snapshot rollback sometimes restores a working VM but not always. It is like the VM has its own mindset to decide when to go crazy.
Any ideas? Thanks much!
Debian VM initramfs screen. fsck fixes many issues.
Information provided by dmesg on a booted-up Debian VM.
Code:
[ 8.569264] EXT4-fs error (device sda2): ext4_find_extent:936: inode #52349: comm pihole-FTL: pblk 87225 bad header/extent: extent tree corrupted - magic f30a, entries 9, max 340(340), depth 0(0)
[ 8.569280] Aborting journal on device sda2-8.
[ 8.571911] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm s6-rc: Detected aborted journal
[ 8.572125] EXT4-fs (sda2): Remounting filesystem read-only
[ 1.967922] FAT-fs (vda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[ 2.007173] EXT4-fs error (device vda2): ext4_lookup:1855: inode #140433: comm apparmor.system: iget: checksum invalid
[ 2.007183] Aborting journal on device vda2-8.
[ 2.007478] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[ 2.007879] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-tmpfile: Detected aborted journal
[ 2.008909] EXT4-fs (vda2): Remounting filesystem read-only
[ 8.283215] EXT4-fs warning (device vda2): ext4_dirblock_csum_verify:405: inode #131491: comm s6-rmrf: No space for directory leaf checksum. Please run e2fsck -D.
[ 8.283222] EXT4-fs error (device vda2): htree_dirblock_to_tree:1082: inode #131491: comm s6-rmrf: Directory block failed checksum
[ 8.283230] Aborting journal on device vda2-8.
[ 8.284508] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm dockerd: Detected aborted journal
[ 8.284682] EXT4-fs (vda2): Remounting filesystem read-only
[ 8.416508] EXT4-fs warning (device vda2): ext4_dirblock_csum_verify:405: inode #131491: comm dockerd: No space for directory leaf checksum. Please run e2fsck -D.
[ 8.416515] EXT4-fs error (device vda2): htree_dirblock_to_tree:1082: inode #131491: comm dockerd: Directory block failed checksum
[ 4.862167] EXT4-fs error (device vda2): ext4_validate_block_bitmap:420: comm ext4lazyinit: bg 29: bad block bitmap checksum
[ 4.862180] Aborting journal on device vda2-8.
[ 4.864966] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[ 5.102975] EXT4-fs (vda2): Remounting filesystem read-only