FreeBSD 13.1 crashing regularly

I'm running into an issue with FreeBSD crashing on about a weekly basis. The system is running on a KVM instance with 6 passed through hard drives (zroot mirror, one zfs raidz1 pool with cache drive), passed through Realtek gigabit lan, and 64GB ram. The host system is using an AMD Ryzen 7 2800 8 core 16thread cpu with 12 threads allocated to this vm.

The crashes seem almost random, the only thing that has lessened them is unloading the re_ko_mod driver for the ethernet card. Each of the crashes seems to blame a different reason as well:

Code:
Feb  9 07:51:41 fileserver kernel:
Feb  9 07:51:41 fileserver syslogd: last message repeated 1 times
Feb  9 07:51:41 fileserver kernel: Fatal trap 12: page fault while in kernel mode
Feb  9 07:51:41 fileserver kernel: cpuid = 10; apic id = 0a
Feb  9 07:51:41 fileserver kernel: fault virtual address        = 0x7b21b
Feb  9 07:51:41 fileserver kernel: fault code           = supervisor read data, page not present
Feb  9 07:51:41 fileserver kernel: instruction pointer  = 0x20:0xffffffff8220c1e6
Feb  9 07:51:41 fileserver kernel: stack pointer                = 0x0:0xfffffe019d5f3cf0
Feb  9 07:51:41 fileserver kernel: frame pointer                = 0x0:0xfffffe019d5f3d50
Feb  9 07:51:41 fileserver kernel: code segment         = base 0x0, limit 0xfffff, type 0x1b
Feb  9 07:51:41 fileserver kernel:                      = DPL 0, pres 1, long 1, def32 0, gran 1
Feb  9 07:51:41 fileserver kernel: processor eflags     = interrupt enabled, resume, IOPL = 0
Feb  9 07:51:41 fileserver kernel: current process              = 5 (dp_sync_taskq_1)
Feb  9 07:51:41 fileserver kernel: trap number          = 12
Feb  9 07:51:41 fileserver kernel: panic: page fault
Feb  9 07:51:41 fileserver kernel: cpuid = 10
Feb  9 07:51:41 fileserver kernel: time = 1675936880
Feb  9 07:51:41 fileserver kernel: KDB: stack backtrace:
Feb  9 07:51:41 fileserver kernel: #0 0xffffffff80c694c5 at kdb_backtrace+0x65
Feb  9 07:51:41 fileserver kernel: #1 0xffffffff80c1bb7f at vpanic+0x17f
Feb  9 07:51:41 fileserver kernel: #2 0xffffffff80c1b9f3 at panic+0x43
Feb  9 07:51:41 fileserver kernel: #3 0xffffffff810afdf5 at trap_fatal+0x385
Feb  9 07:51:41 fileserver kernel: #4 0xffffffff810afe4f at trap_pfault+0x4f
Feb  9 07:51:41 fileserver kernel: #5 0xffffffff810875d8 at calltrap+0x8
Feb  9 07:51:41 fileserver kernel: #6 0xffffffff82224770 at dnode_sync+0x110
Feb  9 07:51:41 fileserver kernel: #7 0xffffffff8220b5c9 at sync_dnodes_task+0x89
Feb  9 07:51:41 fileserver kernel: #8 0xffffffff821a29ef at taskq_run+0x1f
Feb  9 07:51:41 fileserver kernel: #9 0xffffffff80c7daa1 at taskqueue_run_locked+0x181
Feb  9 07:51:41 fileserver kernel: #10 0xffffffff80c7edb2 at taskqueue_thread_loop+0xc2
Feb  9 07:51:41 fileserver kernel: #11 0xffffffff80bd8abe at fork_exit+0x7e
Feb  9 07:51:41 fileserver kernel: #12 0xffffffff8108864e at fork_trampoline+0xe
Feb  9 07:51:41 fileserver kernel: Uptime: 14h23m50s
Feb  9 07:51:41 fileserver kernel: (ada2:ahcich2:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb  9 07:51:41 fileserver kernel: (ada2:ahcich2:0:0:0): CAM status: Command timeout
Feb  9 07:51:41 fileserver kernel: (ada2:ahcich2:0:0:0): Error 5, Retries exhausted
Feb  9 07:51:41 fileserver kernel: (ada2:ahcich2:0:0:0): Synchronize cache failed
Feb  9 07:51:41 fileserver kernel: (ada5:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb  9 07:51:41 fileserver kernel: (ada5:ahcich6:0:0:0): CAM status: Command timeout
Feb  9 07:51:41 fileserver kernel: (ada5:ahcich6:0:0:0): Error 5, Retries exhausted
Feb  9 07:51:41 fileserver kernel: (ada5:ahcich6:0:0:0): Synchronize cache failed
Feb  9 07:51:41 fileserver kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Feb  9 07:51:41 fileserver kernel: Rebooting...

Code:
Feb  1 07:40:55 fileserver kernel: panic: bad pte va 8004d2000 pte 1a0527404
Feb  1 07:40:55 fileserver kernel: cpuid = 11
Feb  1 07:40:55 fileserver kernel: time = 1675214102
Feb  1 07:40:55 fileserver kernel: KDB: stack backtrace:
Feb  1 07:40:55 fileserver kernel: #0 0xffffffff80c694a5 at kdb_backtrace+0x65
Feb  1 07:40:55 fileserver kernel: #1 0xffffffff80c1bb5f at vpanic+0x17f
Feb  1 07:40:55 fileserver kernel: #2 0xffffffff80c1b9d3 at panic+0x43
Feb  1 07:40:55 fileserver kernel: #3 0xffffffff810a0a6f at pmap_remove_pages+0x92f
Feb  1 07:40:55 fileserver kernel: #4 0xffffffff80bd0523 at exec_new_vmspace+0x223
Feb  1 07:40:55 fileserver kernel: #5 0xffffffff80ba2d46 at exec_elf64_imgact+0xb16
Feb  1 07:40:55 fileserver kernel: #6 0xffffffff80bcee2d at kern_execve+0x77d
Feb  1 07:40:55 fileserver kernel: #7 0xffffffff80bce35a at sys_execve+0x5a
Feb  1 07:40:55 fileserver kernel: #8 0xffffffff810b06ec at amd64_syscall+0x10c
Feb  1 07:40:55 fileserver kernel: #9 0xffffffff81087ecb at fast_syscall_common+0xf8
Feb  1 07:40:55 fileserver kernel: Uptime: 5d8h19m50s
Feb  1 07:40:55 fileserver kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Feb  1 07:40:55 fileserver kernel: Rebooting...

Code:
Jan 22 06:04:28 fileserver kernel: Fatal trap 12: page fault while in kernel mode
Jan 22 06:04:28 fileserver kernel: cpuid = 9; apic id = 09
Jan 22 06:04:28 fileserver kernel: fault virtual address    = 0x440
Jan 22 06:04:28 fileserver kernel: fault code        = supervisor read data, page not present
Jan 22 06:04:28 fileserver kernel: instruction pointer    = 0x20:0xffffffff80c269ce
Jan 22 06:04:28 fileserver kernel: stack pointer            = 0x28:0xfffffe0114fa1d20
Jan 22 06:04:28 fileserver kernel: frame pointer            = 0x28:0xfffffe0114fa1dc0
Jan 22 06:04:28 fileserver kernel: code segment        = base 0x0, limit 0xfffff, type 0x1b
Jan 22 06:04:28 fileserver kernel:             = DPL 0, pres 1, long 1, def32 0, gran 1
Jan 22 06:04:28 fileserver kernel: processor eflags    = interrupt enabled, resume, IOPL = 0
Jan 22 06:04:28 fileserver kernel: current process        = 5 (dbu_evict)
Jan 22 06:04:28 fileserver kernel: trap number        = 12
Jan 22 06:04:28 fileserver kernel: panic: page fault
Jan 22 06:04:28 fileserver kernel: cpuid = 9
Jan 22 06:04:28 fileserver kernel: time = 1674381882
Jan 22 06:04:28 fileserver kernel: KDB: stack backtrace:
Jan 22 06:04:28 fileserver kernel: #0 0xffffffff80c694a5 at kdb_backtrace+0x65
Jan 22 06:04:28 fileserver kernel: #1 0xffffffff80c1bb5f at vpanic+0x17f
Jan 22 06:04:28 fileserver kernel: #2 0xffffffff80c1b9d3 at panic+0x43
Jan 22 06:04:28 fileserver kernel: #3 0xffffffff810afdf5 at trap_fatal+0x385
Jan 22 06:04:28 fileserver kernel: #4 0xffffffff810afe4f at trap_pfault+0x4f
Jan 22 06:04:28 fileserver kernel: #5 0xffffffff810875b8 at calltrap+0x8
Jan 22 06:04:28 fileserver kernel: #6 0xffffffff821f4b56 at dnode_destroy+0x256
Jan 22 06:04:28 fileserver kernel: #7 0xffffffff821f5a32 at dnode_buf_evict_async+0x92
Jan 22 06:04:28 fileserver kernel: #8 0xffffffff80c7da81 at taskqueue_run_locked+0x181
Jan 22 06:04:28 fileserver kernel: #9 0xffffffff80c7ed92 at taskqueue_thread_loop+0xc2
Jan 22 06:04:28 fileserver kernel: #10 0xffffffff80bd8a9e at fork_exit+0x7e
Jan 22 06:04:28 fileserver kernel: #11 0xffffffff8108862e at fork_trampoline+0xe
Jan 22 06:04:28 fileserver kernel: Uptime: 1d12h7m45s
Jan 22 06:04:28 fileserver kernel: Automatic reboot in 15 seconds - press a key on the console to abort
Jan 22 06:04:28 fileserver kernel: Rebooting...

Many of them seem to be signalling a page fault but RAM is fairly new and when tested returns 0 errors. Any help would be appreciated.
 
I would strongly advise to have a real good look at the SMART data of your ada2 and ada5. What are these discs and how are they connected? I recently had such problems when using bad/substandard cables. If the drives are fine (check smartctl -a /dev/ada2 and smartctl -a /dev/ada5). Run some long tests. I had discs fail without them knowing, only the messages from ZFS told me that something was up.
 
both are part of the raidz1 pool. ada2 is a western digital Blue 2Tb, ada5 is the cache drive, a PNY ssd 500GB. The PNY is less than a month old, the western digital is about a year old.

Code:
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue (SMR)
Device Model:     WDC WD20EZAZ-00L9GB0
Serial Number:    WD-WXF2A31E38FY
LU WWN Device Id: 5 0014ee 2bec08d87
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb  9 09:03:19 2023 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  17) The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (31004) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 243) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   170   166   021    Pre-fail  Always       -       2475
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       58
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13514
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       56
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1519
194 Temperature_Celsius     0x0022   115   096   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Code:
=== START OF INFORMATION SECTION ===
Device Model:     PNY CS900 500GB SSD
Serial Number:    PNY22442211040100C57
LU WWN Device Id: 5 f8db4c 224400c57
Firmware Version: CS900615
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        Not in smartctl database 7.3/5319
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Feb  9 09:04:10 2023 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (65535) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       474
 12 Power_Cycle_Count       0x0012   100   100   000    Old_age   Always       -       5
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
170 Unknown_Attribute       0x0003   100   100   000    Pre-fail  Always       -       136
173 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0023   067   067   000    Pre-fail  Always       -       33 (Min/Max 33/33)
218 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
231 Unknown_SSD_Attribute   0x0013   100   100   000    Pre-fail  Always       -       100
241 Total_LBAs_Written      0x0012   100   100   000    Old_age   Always       -       327

SMART Error Log Version: 1
No Errors Logged

I'd guess ada2 is the most likely culprit, I'll run a long test and see if that pops anything up.
 
Long test is a good idea. I have two discs in my backup server that failed in the last weeks, no bad blocks recorded and all - only after the long test they also logged SMART errors. Others had the same issues you mention, but a set of good cables and cleaning of contacts helped there. Also, maybe the power supply is getting old. The capacitators age, and the voltage gets a bit uneven by time.

But I see the system is running in a VM, that is an extra source of problems. What is the problem not running it on bare metal?
 
Right now my issue with bare metal is GPU-passthrough. I use that for games/ai scripts on both windows vms and Linux vms. If hardware prices drop a bit more I'd love to rebuild the GPU system as a standalone but can't get a new computer at the moment. If support for that gets added to FreeBSD-14 I might try to redo the system as a FreeBSD host then.
 
Back
Top