T480 drm-kmod crash (13.1)

Hello,

i installed FreeBSD 13.1 on my Thinkpad T480 but it randomly crashes (dumps posted below).

On drm-kmod git page there is a thread with similar issue posted.
They say it's not drm bug and one user even mentioned it's zfs related.

This is confusing to me. Sadly I did not find solution.
I do have other FreeBSD machines with X11 running and no such problem ever occured.
I hope it's not hw failure related issue.

Here is full dump

My sysctl.conf, loader.conf, rc.conf

Those are last messages during crash.
WARNING !drm_modeset_is_locked(&plane->mutex) failed at /wrkdirs/usr/ports/graphics/drm-fbsd13-kmod/work/drm-kmod-drm_v5.4.144_6/drivers/gpu/drm/drm_atomic_helper.c:871
#0 0xffffffff80e5e253 at linux_dump_stack+0x23
#1 0xffffffff83858620 at drm_atomic_helper_check_planes+0xb0
#2 0xffffffff83750dfa at intel_atomic_check+0x124a
#3 0xffffffff83856360 at drm_atomic_check_only+0x400
#4 0xffffffff83856793 at drm_atomic_commit+0x13
#5 0xffffffff838633b8 at drm_client_modeset_commit_atomic+0x148
#6 0xffffffff83863119 at drm_client_modeset_commit_force+0x69
#7 0xffffffff838a30ba at drm_fb_helper_restore_fbdev_mode_unlocked+0x7a
#8 0xffffffff8389d057 at vt_kms_postswitch+0x167
#9 0xffffffff80a709f9 at vt_window_switch+0x2d9
#10 0xffffffff80a6db5f at vtterm_cngrab+0x4f
#11 0xffffffff80bb3916 at cngrab+0x26
#12 0xffffffff80c1b614 at kern_reboot+0x354
#13 0xffffffff80c1bb8e at vpanic+0x1ee
#14 0xffffffff80c1b993 at panic+0x43
#15 0xffffffff810afdf5 at trap_fatal+0x385
#16 0xffffffff810afe4f at trap_pfault+0x4f
#17 0xffffffff81087528 at calltrap+0x8
WARN_ON(!mutex_is_locked(&dev->struct_mutex))WARN_ON(!mutex_is_locked(&fbc->lock))WARN_ON(!mutex_is_locked(&dev->struct_mutex))WARN_ON(!mutex_is_locked(&fbc->lock))

WARN_ON(!mutex_is_locked(&fbc->lock))

WARN_ON(!mutex_is_locked(&fbc->lock))WARN_ON(!mutex_is_locked(&fbc->lock))WARN_ON(!mutex_is_locked(&fbc->lock))
Dumping 1324 out of 32613 MB:..2%..11%..21%..31%..42%..51%..61%..71%..81%..91%

Here is the trace
__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55 /usr/src/sys/amd64/include/pcpu_aux.h: No such file or directory.
(kgdb) #0 __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1 doadump (textdump=<optimized out>)
at /usr/src/sys/kern/kern_shutdown.c:399
#2 0xffffffff80c1b71c in kern_reboot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:487
#3 0xffffffff80c1bb8e in vpanic (fmt=0xffffffff811b4fb9 "%s",
ap=<optimized out>) at /usr/src/sys/kern/kern_shutdown.c:920
#4 0xffffffff80c1b993 in panic (fmt=<unavailable>)
at /usr/src/sys/kern/kern_shutdown.c:844
#5 0xffffffff810afdf5 in trap_fatal (frame=0xfffffe014f0e5ad0,
eva=274877908032) at /usr/src/sys/amd64/amd64/trap.c:944
#6 0xffffffff810afe4f in trap_pfault (frame=0xfffffe014f0e5ad0,
usermode=false, signo=<optimized out>, ucode=<optimized out>)
at /usr/src/sys/amd64/amd64/trap.c:763
#7 <signal handler called>
#8 __mtx_lock_sleep (c=0xfffff80011c6f778, v=<optimized out>)
at /usr/src/sys/kern/kern_mutex.c:594
#9 0xffffffff80cbf1ec in sopoll_generic (so=0xfffff80011c6f760, events=3,
active_cred=<optimized out>, td=0xfffffe014ef5a900)
at /usr/src/sys/kern/uipc_socket.c:3551
#10 0xffffffff80c8b0dc in fo_poll (fp=0xfffff80011c6f778, events=0,
active_cred=0xfffff802865cf500, td=0xfffffe014ef5a900)
at /usr/src/sys/sys/file.h:369
#11 pollscan (td=0xfffffe014ef5a900, fds=0xfffffe014f0e5d00, nfd=21)
at /usr/src/sys/kern/sys_generic.c:1651
#12 kern_poll (td=0xfffffe014ef5a900, ufds=0x7fffdff7b890,
nfds=<optimized out>, tsp=<optimized out>, uset=<optimized out>,
uset@entry=0x0) at /usr/src/sys/kern/sys_generic.c:1492
#13 0xffffffff80c8ac50 in sys_poll (td=0xfffff80011c6f778,
uap=<optimized out>) at /usr/src/sys/kern/sys_generic.c:1417
#14 0xffffffff810b06ec in syscallenter (td=0xfffffe014ef5a900)
at /usr/src/sys/amd64/amd64/../../kern/subr_syscall.c:189
#15 amd64_syscall (td=0xfffffe014ef5a900, traced=0)
at /usr/src/sys/amd64/amd64/trap.c:1185
#16 <signal handler called>
#17 0x00000008013c963a in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdff7b7f8
(kgdb)
 
Little bit of additional information.
I'm getting those errors from /var/log/messages during runtime
Jun 5 16:45:56 ltop ZFS[27]: pool I/O failure, zpool=data1 error=97
Jun 5 16:45:56 ltop ZFS[837]: checksum mismatch, zpool=data1 path=/dev/da1 offset=255102586880 size=131072
Jun 5 16:45:56 ltop ZFS[2423]: pool I/O failure, zpool=data1 error=97
Jun 5 16:45:56 ltop ZFS[3763]: checksum mismatch, zpool=data1 path=/dev/da1 offset=255103766528 size=131072
Jun 5 16:45:56 ltop ZFS[5232]: pool I/O failure, zpool=data1 error=97
Jun 5 16:45:56 ltop ZFS[6255]: checksum mismatch, zpool=data1 path=/dev/da1 offset=255104159744 size=131072
Jun 5 16:45:56 ltop ZFS[7487]: pool I/O failure, zpool=data1 error=97
Jun 5 16:45:56 ltop ZFS[8426]: checksum mismatch, zpool=data1 path=/dev/da1 offset=255104552960 size=131072
Jun 5 16:45:56 ltop ZFS[9551]: pool I/O failure, zpool=data1 error=97
Jun 5 16:45:56 ltop ZFS[10935]: checksum mismatch, zpool=data1 path=/dev/da1 offset=255104946176 size=131072

My other FreeBSD machines running on ZFS never reported such error.

The drive seems alright
smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: Force MP510
Serial Number: 19478236000128863341
Firmware Version: ECFM22.5
PCI Vendor/Subsystem ID: 0x1987
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 240,057,409,536 [240 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 240,057,409,536 [240 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 2b01563431
Local Time is: Sun Jun 5 17:11:19 2022 CEST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d): Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08): Telmtry_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 75 Celsius
Critical Comp. Temp. Threshold: 80 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.77W - - 0 0 0 0 0 0
1 + 5.71W - - 1 1 1 1 0 0
2 + 5.19W - - 2 2 2 2 0 0
3 - 0.0490W - - 3 3 3 3 2000 2000
4 - 0.0018W - - 4 4 4 4 25000 25000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 2,178,308 [1.11 TB]
Data Units Written: 1,075,572 [550 GB]
Host Read Commands: 10,585,230
Host Write Commands: 5,694,829
Controller Busy Time: 29
Power Cycles: 195
Power On Hours: 3,677
Unsafe Shutdowns: 112
Media and Data Integrity Errors: 0
Error Information Log Entries: 464
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

But zpool status reports permament errors that can't be scrubbed. Could this be somehow related to filesystem corruption?
 
Good I did not add "Solved" prefix to this thread yet.

I've submitted more detailed report to bugzilla

It seems that error=97 might not be always hw malfunction error as some threads on other forums reported.
I managed to borrow spare hw (motherboard, ssd and adapter) for T480. The i/o problem and data corruption still persists.
Neither memtest86+, thinkpad bios diagnosis tools or smartctl report issues with the hw.
Manual checksuming on windows 10 did not report fail.
 
Back
Top