I have an issue with my system that started a few days ago. I have two NVME drives connected to a SuperMicro AOC-SLG3-2M2 PCIE card I had been using for months (only in production for a couple of months though, through a progressive ramp up), but they both started to generate errors on FreeBSD a few days ago, with one of them generating more errors than the other:
I can make them fail pretty systematically by doing a zpool scrub. Running I/O intensive services on the machine also makes them fail consistently. Running smartctl -t does not seem to necessarily make them fail. Initially I thought the PCIe card was at fault, but I moved one of the drives to the motherboard's M.2 and it did not solve the problem. Both drives seem to be from the same batch based on the serial numbers. I am using ECC RAM.
What can be going on?
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: resetting controller
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: failing outstanding i/o
nvme0: READ sqid:4 cid:124 nsid:1 lba:459285472 len:16
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:4 cid:124 cdw0:0
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b6023e0 0 f 0 0 0
nvme0: READ sqid:9 cid:126 nsid:1 lba:459344056 len:8
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:9 cid:126 cdw0:0
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b6108b8 0 7 0 0 0
nvme0: WRITE sqid:11 cid:126 nsid:1 lba:1244153000 len:256
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:11 cid:126 cdw0:0
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nvme0: READ sqid:16 cid:125 nsid:1 lba:459264640 len:8
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=4a2844a8 0 ff 0 0 0
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:16 cid:125 cdw0:0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nda0 at nvme0 bus 0 scbus4 target 0 lun 1
nda0: <WD_BLACK SN770 2TB 731100WD 23160F800275> s/n 23160F800275 detached
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b5fd280 0 7 0 0 0
GEOM_MIRROR: Device boot: provider nda0p2 disconnected.
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 6, Periph was invalidated
(nda0:nvme0:0:0:1): Periph destroyed
nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:0 cid:0 cdw0:0
nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:0 cid:0 cdw0:0
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: resetting controller
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: failing queued i/o
nvme1: WRITE sqid:3 cid:0 nsid:1 lba:1244155304 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:3 cid:0 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=4a284da8 0 ff 0 0 0
nvme1: READ sqid:5 cid:126 nsid:1 lba:3227128240 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:126 cdw0:0
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
nvme1: failing outstanding i/o
nvme1: READ sqid:13 cid:127 nsid:1 lba:3227127984 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:13 cid:127 cdw0:0
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a11b0 0 ff 0 0 0
nvme1: failing outstanding i/o
nvme1: READ sqid:16 cid:127 nsid:1 lba:3227127728 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:16 cid:127 cdw0:0
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nda1 at nvme1 bus 0 scbus5 target 0 lun 1
nda1: <WD_BLACK SN770 2TB 731100WD 23160F800262> s/n 23160F800262 detached
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a10b0 0 ff 0 0 0
GEOM_MIRROR: Device boot: provider nda1p2 disconnected.
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
GEOM_MIRROR: Device boot: provider destroyed.
(nda1:nvme1:0:0:1): Error 6, Periph was invalidated
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a0fb0 0 ff 0 0 0
GEOM_MIRROR: Device boot destroyed.
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
(nda1:nvme1:0:0:1): Error 6, Periph was invalidated
Solaris: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.
I can make them fail pretty systematically by doing a zpool scrub. Running I/O intensive services on the machine also makes them fail consistently. Running smartctl -t does not seem to necessarily make them fail. Initially I thought the PCIe card was at fault, but I moved one of the drives to the motherboard's M.2 and it did not solve the problem. Both drives seem to be from the same batch based on the serial numbers. I am using ECC RAM.
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,663,470 [851 GB]
Data Units Written: 12,013,897 [6.15 TB]
Host Read Commands: 15,559,653
Host Write Commands: 196,433,852
Controller Busy Time: 113
Power Cycles: 137
Power On Hours: 2,330
Unsafe Shutdowns: 134
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 63 Celsius
Temperature Sensor 2: 37 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 5,585,758 [2.85 TB]
Data Units Written: 11,214,987 [5.74 TB]
Host Read Commands: 32,190,918
Host Write Commands: 184,413,733
Controller Busy Time: 120
Power Cycles: 135
Power On Hours: 2,333
Unsafe Shutdowns: 132
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 25 Celsius
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
What can be going on?