bhyve Debian Bhyve VM file system corruption (ZVOL backend)

wxppro · Jul 21, 2024

I recently updated a small server from a Qotom i5 mini-PC (16GB RAM) to a Topton n100 mini-PC (32GB RAM). I have several light duty Debian virtual machines running on the mini-PC. It has been rock solid for the past several years. But since the update, I am puzzled by a weird Debian Bhyve VM file system corruption problem: Basically, FreeBSD zpool reports no issue whatsoever, but the VMs kept reporting file system corruptions (inode problems and checksum mismatch, etc.).

Here is the start script for one of the VMs, in this case, a Pihole. The VM uses two ZFS block datasets: one for the root file system (5GB), and other one for the swap partition (2GB).

Code:

nohup bhyve -c 1 -m 1024M -w -H \
-s 0,hostbridge \
-s 4,virtio-blk,/dev/zvol/work/vm/pihole53 \
-s 5,virtio-blk,/dev/zvol/work/vm/pihole53_swap \
-s 6,virtio-net,tap53  \
-s 29,fbuf,tcp=0.0.0.0:5900,w=1024,h=768,wait -s 30,xhci,tablet \
-s 31,lpc -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd pihole53 &

Sometimes a VM cannot start, stuck at the Debian initramfs interface (see the image below). Debian complains about the file system and asks for a fsck. Then the VM may boot up normally after fsck (it will fix many inode problems), but it may also end up in a kernel panic and could not be recovered (had to be rebuilt). Even though a VM boots up, often times there are many problems with the root file system. In some cases, the root file system was remounted read-only.

On the FreeBSD host, zpool scrub shows that the zpool and zfs datasets are perfect while all these are happening.

The only substantial difference between the old Qotom i5 and the new Topton n100 machine is that the FreeBSD operating system runs from a USB enclosure (an mSata SSD inside) on the former, and it runs from a SATA enclosure (a M.2 B-Key SSD inside) on the latter. The FreeBSD version is the same, V14, patched to the latest. ZFS version is zfs-2.2.0-FreeBSD_g95785196f, and zfs-kmod-2.2.0-FreeBSD_g95785196f.

This is quite a headache. It is like a timed bomb. ZFS is supposed to be exceptionally reliable, and it has been for the past several years. I suspect the problem is faulty hardware, as the mini-PC boots up just fine. I have tried recreating the Debian virtual machines, changing the parameter of virtio-blk to nvme or achi-hd. Using snapshot rollback sometimes restores a working VM but not always. It is like the VM has its own mindset to decide when to go crazy.

Any ideas? Thanks much!

Debian VM initramfs screen. fsck fixes many issues.

Information provided by dmesg on a booted-up Debian VM.

Code:

[    8.569264] EXT4-fs error (device sda2): ext4_find_extent:936: inode #52349: comm pihole-FTL: pblk 87225 bad header/extent: extent tree corrupted - magic f30a, entries 9, max 340(340), depth 0(0)
[    8.569280] Aborting journal on device sda2-8.
[    8.571911] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm s6-rc: Detected aborted journal
[    8.572125] EXT4-fs (sda2): Remounting filesystem read-only



[    1.967922] FAT-fs (vda1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[    2.007173] EXT4-fs error (device vda2): ext4_lookup:1855: inode #140433: comm apparmor.system: iget: checksum invalid
[    2.007183] Aborting journal on device vda2-8.
[    2.007478] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[    2.007879] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-tmpfile: Detected aborted journal
[    2.008909] EXT4-fs (vda2): Remounting filesystem read-only



[    8.283215] EXT4-fs warning (device vda2): ext4_dirblock_csum_verify:405: inode #131491: comm s6-rmrf: No space for directory leaf checksum. Please run e2fsck -D.
[    8.283222] EXT4-fs error (device vda2): htree_dirblock_to_tree:1082: inode #131491: comm s6-rmrf: Directory block failed checksum
[    8.283230] Aborting journal on device vda2-8.
[    8.284508] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm dockerd: Detected aborted journal
[    8.284682] EXT4-fs (vda2): Remounting filesystem read-only
[    8.416508] EXT4-fs warning (device vda2): ext4_dirblock_csum_verify:405: inode #131491: comm dockerd: No space for directory leaf checksum. Please run e2fsck -D.
[    8.416515] EXT4-fs error (device vda2): htree_dirblock_to_tree:1082: inode #131491: comm dockerd: Directory block failed checksum



[    4.862167] EXT4-fs error (device vda2): ext4_validate_block_bitmap:420: comm ext4lazyinit: bg 29: bad block bitmap checksum
[    4.862180] Aborting journal on device vda2-8.
[    4.864966] EXT4-fs error (device vda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[    5.102975] EXT4-fs (vda2): Remounting filesystem read-only

Emrion · Jul 21, 2024

Did you run smartctl* on this drive? Check the smart reporting of this disk. You can also run a short or, better, a long test to ensure the hardware is ok.

(*) sysutils/smartmontools

wxppro · Jul 21, 2024

Here is the report.

Code:

root@Home1:/vm # smartctl -t long /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 30 minutes for test to complete.
Test will complete after Sun Jul 21 13:28:56 2024 EDT
Use smartctl -X to abort test.




root@Home1:/vm # smartctl -x /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SPCC M.2 SSD
Serial Number:    AA2023120700089
Firmware Version: HAFEA2.0
User Capacity:    128,035,676,160 bytes [128 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      M.2
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul 21 14:01:19 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (65535) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  9 Power_On_Hours          -O--C-   100   100   000    -    118
 12 Power_Cycle_Count       -O--C-   100   100   000    -    68
161 Unknown_Attribute       -O--C-   100   100   000    -    269
163 Unknown_Attribute       PO----   100   100   050    -    4
165 Unknown_Attribute       ------   100   100   000    -    75
166 Unknown_Attribute       ------   100   100   000    -    0
167 Unknown_Attribute       ------   100   100   000    -    47
172 Unknown_Attribute       -O--C-   100   100   000    -    0
173 Unknown_Attribute       -O---K   100   100   000    -    0
192 Power-Off_Retract_Count -O--C-   100   100   000    -    45
194 Temperature_Celsius     PO---K   062   062   000    -    38 (Min/Max 33/38)
195 Hardware_ECC_Recovered  PO-R--   100   100   050    -    0
198 Offline_Uncorrectable   -O--C-   100   100   000    -    0
241 Total_LBAs_Written      -O--C-   100   100   000    -    519
242 Total_LBAs_Read         -O--C-   100   100   000    -    236
249 Unknown_Attribute       -O--C-   100   100   000    -    5176
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O     64  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      8  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       118         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 1) ==
0x01  0x008  4              68  ---  Lifetime Power-On Resets
0x01  0x010  4             118  ---  Power-on Hours
0x01  0x018  6      1088433044  ---  Logical Sectors Written
0x01  0x028  6       496386807  ---  Logical Sectors Read
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              38  ---  Current Temperature
0x05  0x020  1              38  ---  Highest Temperature
0x05  0x028  1              33  ---  Lowest Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4             117  ---  Number of Hardware Resets
0x06  0x018  4               3  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               4  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4           46  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4           47  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

Mirror176 · Jul 21, 2024

smartctl seems limited on SSDs for what it can observe but nothing jumped out as bad to me for the SMART log. You could try running /usr/local/sbin/update-smart-drivedb (I recall it not working for me) or just manually downloading https://raw.githubusercontent.com/smartmontools/smartmontools/master/smartmontools/drivedb.h to /usr/local/share/smartmontools/drivedb.h to see if any more values get filled in with useful information. If it is an I/O error between the OS and the drive, SMART data may not be aware of it but the host OS should output errors in such a case.

cracauer@ · Jul 22, 2024

I would run general stability tests.

wxppro · Jul 22, 2024

cracauer@ said:
I would run general stability tests.

Could you please provide some detailed instructions? Like what packages and what tests to run? Thanks.

Mirror176 said:
smartctl seems limited on SSDs for what it can observe but nothing jumped out as bad to me for the SMART log. You could try running /usr/local/sbin/update-smart-drivedb (I recall it not working for me) or just manually downloading https://raw.githubusercontent.com/smartmontools/smartmontools/master/smartmontools/drivedb.h to /usr/local/share/smartmontools/drivedb.h to see if any more values get filled in with useful information. If it is an I/O error between the OS and the drive, SMART data may not be aware of it but the host OS should output errors in such a case.

Will try. Thanks.

wxppro · Jul 25, 2024

I did many tests using multiple mini-PCs. I think I have pinpointed the issue to be a software problem: A Bhyve Debian virtual machine, using ZVOL as backend storage, could run into file system corruption under certain conditions (not sure what those conditions are, though).

First, on the same Topton N100 mini-PC, the problem does not occur if the backend of the Bhyve Debian virtual machine is switched from ZVOL to image files. Granted, I may have not tested long enough. But with ZVOL backend, the file system corruption problem usually surfaces within 15 ~ 30 minutes. With image file backend, no problem occurred after a whole day.

Second, the problem does not occur on three other mini-PCs (with ZVOL as the backend).

Therefore, something on this Topton N100 has triggered the problem to occur. I lean more on software side because (1) With image files as backend, there is no problem. Hence hardware wise, the mini-PC itself is working fine. In addition, I ran a memtest and the result is “Pass.” I also ran many smartctl tests and did not find anything abnormal. (2) The problem always occurs with ZVOL backend, regardless of how FreeBSD is booted up: SATA / NVME / USB (Different SSDs are used in these cases).

Here is how the tests are done:

Four mini-PCs: Azulle Byte 3, Qotom i5, Topton N5105, and Topton N100. Each is installed with FreeBSD 14, updated to the latest version. The test is to repeatedly boot up a fresh installation of a Debian virtual machine, and check whether file system corruption happens. I used a simple script:

Code:

while true
do
        /vm/debian12
        sleep 15

        echo Test if the EXT4 file system is OK.
        ssh -i OpenPrivateKey testuser@192.168.10.40 -t "sudo dmesg | grep EXT4-fs"
        sleep 2
done

/vm/debian12 is another script to boot up the virtual machine (destroyed first before booting up). The virtual machine has a fixed IP of 192.168.10.40.

Code:

bhyvectl --destroy –vm=debian12
sleep 2

nohup bhyve -c 1 -m 1024M -w -H \
-s 0,hostbridge \
-s 4,nvme,/dev/zvol/vm/debian12 \
-s 5,nvme,/dev/zvol/vm/debian12_swap \
-s 6,virtio-net,tap12  \
-s 31,lpc -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd debian12 &

When there is nothing wrong, the following message will be repeatedly displayed:

Code:

Test if the EXT4 file system is OK.
[    1.574342] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[    2.068328] EXT4-fs (nvme0n1p2): re-mounted. Quota mode: none.
Connection to 192.168.10.40 closed.

When the file system corruption problem happens, something like below will be displayed.

Code:

[    1.235683] EXT4-fs (nvme0n1p2): mounted filesystem with ordered data mode. Quota mode: none.
[    1.635141] EXT4-fs (nvme0n1p2): re-mounted. Quota mode: none.
[    2.642559] EXT4-fs warning (device nvme0n1p2): ext4_dirblock_csum_verify:405: inode #128363: comm systemd-tmpfile: No space for directory leaf checksum. Please run e2fsck -D.
[    2.642565] EXT4-fs error (device nvme0n1p2): htree_dirblock_to_tree:1082: inode #128363: comm systemd-tmpfile: Directory block failed checksum
[    2.642669] EXT4-fs (nvme0n1p2): Remounting filesystem read-only
[    2.643131] EXT4-fs warning (device nvme0n1p2): ext4_dirblock_csum_verify:405: inode #128363: comm systemd-tmpfile: No space for directory leaf checksum. Please run e2fsck -D.
[    2.643135] EXT4-fs error (device nvme0n1p2): htree_dirblock_to_tree:1082: inode #128363: comm systemd-tmpfile: Directory block failed checksum

Errors are only observed on the Topton N100 mini-PC. All three other mini-PCs reported no error after a whole day’s test (> 1000 VM boot ups).

On Topton N100, when the backend is switched to image files, miraculously, the error no longer appears.

For whatever it worths, Topton N100 uses DDR5 RAM.

Emrion · Jul 25, 2024

You should open a bug and report what you found. I don't think there is enough informations to easily reproduce the bug, but some others people may have experienced similar problems and will help to specify the hardware part which seems to be decisive.

Luckily, I choose to never use zvol for another reason.

cracauer@ · Jul 25, 2024

wxppro said:
Could you please provide some detailed instructions? Like what packages and what tests to run? Thanks.

math/mprime, but I use the older version without build-in multiprocessing.

Superpi. There's no port, I use the Linux binary.

memtest86+

Andriy · Jul 25, 2024

wxppro , this may be completely off the mark, but how is that ZFS volume provisioned?
Is it spare or is its space fully reserved?
How much free space does the ZFS pool have?

wxppro · Jul 25, 2024

These two commands are used. Note that I also tested using volmode=dev. Same result.

Code:

zfs create -V 5G -o volmode=default    vm/debian12
zfs create -V 2G -o volmode=default    vm/debian12_swap

Report by the FreeBSD host machine:

Code:

root@Host1:/vm # zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT

vm/debian12                    1.63G  27.6G  1.63G  -
vm/debian12_swap               57.2M  27.6G  57.2M  -

Free space reported inside the Debian VM:

Code:

root@debian12:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            457M     0  457M   0% /dev
tmpfs            96M  388K   96M   1% /run
/dev/nvme0n1p2  4.4G  1.8G  2.4G  42% /
tmpfs           479M     0  479M   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p1  511M  9.9M  502M   2% /boot/efi
tmpfs            96M     0   96M   0% /run/user/1001

root@debian12:~# swapon
NAME           TYPE      SIZE USED PRIO
/dev/nvme1n1p1 partition   2G   0B   -2

Andriy · Jul 29, 2024

Andriy said:
How much free space does the ZFS pool have?

~~This hasn't been answered.~~

sko · Jul 29, 2024

Code:

  9 Power_On_Hours          -O--C-   100   100   000    -    118
 12 Power_Cycle_Count       -O--C-   100   100   000    -    68

Does this drive has any weird power-saving features that might cause i/o timeouts? 68 power cycles at only 118 power on hours seems *very* excessive...