This problem seems to be more severe than I originally thought....
Up until now I had 3 drives failing in the same manner in short succession but *exclusively* in 13.2-RELEASE hosts:
The first occurance was on one of the freshly built systems with 2x Micron 7450, where one drive was unresponsive right on the first boot and throwing "ABORTED - BY REQUEST..." errors in dmesg.
The second, *identical* system still runs perfectly fine though, as well as the second drive in that first system.
That dead/unresponsive 7450 drive has since been RMA'd and installed 2 weeks ago and is still working.
There are 2 more nodes with the same hardware (supermicro dual-node systems with X10DRT-PIBF) and Micron 7400 drives, where one also dropped out last week with those errors after ~3 months of operation. (both nodes also running 13.2-RELEASE)
Yesterday I replaced that failed 7400 pro with a spare WD blue that I still had lying around, but that one dropped out just ~3 hours after installation:
Code:
nvme1: RECOVERY_START 26721188760532 vs 26720218109182
nvme1: RECOVERY_START 26721403511829 vs 26720218109182
nvme1: RECOVERY_START 26721588194602 vs 26720218109182
nvme1: RECOVERY_START 26721940381197 vs 26720218109182
nvme1: RECOVERY_START 26721940381197 vs 26720218109182
nvme1: RECOVERY_START 26722189491089 vs 26720218109182
nvme1: Controller in fatal status, resetting
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: RECOVERY_WAITING
nvme1: resetting controller
nvme1: waiting
nvme1: waiting
nvme1: failing outstanding i/o
nvme1: READ sqid:2 cid:123 nsid:1 lba:249674706 len:3
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:123 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bbd2 0 2 0 0 0
nvme1: READ sqid:4 cid:123 nsid:1 lba:249674645 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:4 cid:123 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:4 cid:122 nsid:1 lba:249674646 len:1
nvme1: waiting
nvme1: waiting
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:4 cid:122 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:8 cid:120 nsid:1 lba:249669131 len:17
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:8 cid:120 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nvme1: waiting
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb95 0 0 0 0 0
nvme1: READ sqid:9 cid:123 nsid:1 lba:249674649 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:123 cdw0:0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb96 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1a60b 0 10 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:120 nsid:1 lba:249674505 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:120 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:118 nsid:1 lba:249674650 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:118 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:125 nsid:1 lba:249674651 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:125 cdw0:0
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb99 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:17 cid:124 cdw0:0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:17 cid:125 nsid:1 lba:268442604 len:32
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:17 cid:125 cdw0:0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb09 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:19 cid:118 nsid:1 lba:352410219 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:19 cid:118 cdw0:0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:19 cid:120 nsid:1 lba:352410218 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:19 cid:120 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:25 cid:125 nsid:1 lba:249674652 len:2
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:25 cid:125 cdw0:0
nvme1: waiting
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9a 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9b 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: READ sqid:28 cid:121 nsid:1 lba:249674525 len:1
nvme1: waiting
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:28 cid:121 cdw0:0
nvme1: waiting
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=10001c0c 0 5 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=10001bec 0 1f 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=15015a6b 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
nvme1: waiting
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=15015a6a 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nda1 at nvme1 bus 0 scbus9 target 0 lun 1
nda1: <WD Blue SN570 2TB 234200WD 23071H801858>
s/n 23071H801858 detached
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9c 0 1 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Periph was invalidated
nvme1: waiting
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb1d 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Periph was invalidated
nvme1: waiting
(nda1:nvme1:0:0:1): Periph destroyed
nvme1: waiting
nvme1: waiting
nvme1: waiting
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
An identical drive (from the same order) has been running fine since February in my 12.4-RELEASE workstation.
Except for one host I pretty much skipped 13.0/13.1, so the systems run either 12.4-RELEASE or 13.2-RELEASE and I don't have any data on 13.0 or 13.1 /w nvme drives.
Other dead nvme drives in the past would just show increasing media/data integrity errors or vanish completely, but not 'lock up' like those drives.
I wasn't able to revive any of those drives in other systems and they can't be accessed/probed by either other FreeBSD versions, illumos or linux. There's always only a completely unresponsive generic nvme device showing up. (firmware lockup?)
Given that those *identical* failures occured with 3 different drives from 2 different vendors within ~4 weeks but *exclusively* on 13.2-RELEASE hosts, while several 12.4-RELEASE systems (installed with 12.x or 11.x), some even with the same drive models, have been running perfectly fine, I'm no longer convinced this is just a bad coincidence or hardware/firmware problem... Especially after that last drive failed with *exactly* the same errors after just a few hours.
There are 2 bug reports that might be related (
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270409 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262969), but I'm not sure if I should add to one of those (although they are not exactly describing the same problem?) or open a new bug report...