Other nvme M.2 drive unresponsive (ABORTED - BY REQUEST (00/07))

I'm in the process of setting up 2 new servers with identical hardware, including a Supermicro AOC-SLG3-2M2 M.2 carrier holding 2 Micron 7450 drives in each of them. 3 of those drives (and a few more 7450, 7400 and 7300 in other servers) work as expected, but one drive doesn't initialize properly.
The device node (nvme1) is created, but during boot or when trying to issue an 'identify' command via nvmecontrol (or just nvmecontrol devlist) I always get the following errors logged in dmesg from the device:

Code:
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0

It shows up normally in pciconf -l, device node is created (minus the namespace), but not the geom provider nor does it show up e.g. in nvmecontrol devlist:
Code:
# pciconf -l | grep nvme
nvme0@pci0:132:0:0:     class=0x010802 rev=0x01 hdr=0x00 vendor=0x1344 device=0x51c3 subvendor=0x1344 subdevice=0x2100
nvme1@pci0:133:0:0:     class=0x010802 rev=0x01 hdr=0x00 vendor=0x1344 device=0x51c3 subvendor=0x1344 subdevice=0x2100

# ls /dev | grep nvme
nvme0
nvme0ns1
nvme1

# ls /dev | grep nda
nda0

# nvmecontrol devlist
 nvme0: Micron_7450_MTFDKBA960TFR
    nvme0ns1 (915715MB)

Trying to poke the device and get some more information:
Code:
# devctl rescan pci0:133:0:0
devctl: Failed to rescan pci0:133:0:0: Operation not supported by device

# devctl attach pci0:133:0:0
devctl: Failed to attach pci0:133:0:0: Device busy

# smartctl -i /dev/nvme1
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.2-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Read NVMe Identify Controller failed: NVMe Status 0x01


Anything else I could try or is that one just dead?
 
Just to be clear these ae new drives? You have had a pair working on the card so not a bifurication issue?

yes, a set of 4 brand new drives.
bifurcation settings are correct and identical on those 2 servers and the carrier installed in the same slot ofc. In fact this is one of 4 nodes in 2 dual-node systems that share the same hardware configuration and hence also have identical efi/bios configuration. Bifurcation was also my first guess when I didn't see the second drive during setup, but confirmed correct settings (x16 slot is set to 4x4x4x4) and the system picks it up correctly (-> nvme1 is created) but the drive doesn't respond to anything...

Did you already try to swap 2 drives to isolate wether a drive or a slot is bad?

The system sits in the rack at work and i'm more or less bound to homeoffice this week. swapping the drives would have been the first thing to try when I get back to work. I haven't dealt much with nvme troubleshooting, so I was wondering if there is anything I could try to get it running or at least some more intel on what is wrong or to request RMA right away.
 
This problem seems to be more severe than I originally thought....


Up until now I had 3 drives failing in the same manner in short succession but *exclusively* in 13.2-RELEASE hosts:

The first occurance was on one of the freshly built systems with 2x Micron 7450, where one drive was unresponsive right on the first boot and throwing "ABORTED - BY REQUEST..." errors in dmesg.
The second, *identical* system still runs perfectly fine though, as well as the second drive in that first system.
That dead/unresponsive 7450 drive has since been RMA'd and installed 2 weeks ago and is still working.

There are 2 more nodes with the same hardware (supermicro dual-node systems with X10DRT-PIBF) and Micron 7400 drives, where one also dropped out last week with those errors after ~3 months of operation. (both nodes also running 13.2-RELEASE)
Yesterday I replaced that failed 7400 pro with a spare WD blue that I still had lying around, but that one dropped out just ~3 hours after installation:

Code:
nvme1: RECOVERY_START 26721188760532 vs 26720218109182
nvme1: RECOVERY_START 26721403511829 vs 26720218109182
nvme1: RECOVERY_START 26721588194602 vs 26720218109182
nvme1: RECOVERY_START 26721940381197 vs 26720218109182
nvme1: RECOVERY_START 26721940381197 vs 26720218109182
nvme1: RECOVERY_START 26722189491089 vs 26720218109182
nvme1: Controller in fatal status, resetting
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: RECOVERY_WAITING
nvme1: resetting controller
nvme1: waiting
nvme1: waiting
nvme1: failing outstanding i/o
nvme1: READ sqid:2 cid:123 nsid:1 lba:249674706 len:3
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:2 cid:123 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bbd2 0 2 0 0 0
nvme1: READ sqid:4 cid:123 nsid:1 lba:249674645 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:4 cid:123 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:4 cid:122 nsid:1 lba:249674646 len:1
nvme1: waiting
nvme1: waiting
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:4 cid:122 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:8 cid:120 nsid:1 lba:249669131 len:17
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:8 cid:120 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nvme1: waiting
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb95 0 0 0 0 0
nvme1: READ sqid:9 cid:123 nsid:1 lba:249674649 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:123 cdw0:0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb96 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1a60b 0 10 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:120 nsid:1 lba:249674505 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:120 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:118 nsid:1 lba:249674650 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:118 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:9 cid:125 nsid:1 lba:249674651 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:9 cid:125 cdw0:0
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb99 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:17 cid:124 cdw0:0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:17 cid:125 nsid:1 lba:268442604 len:32
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:17 cid:125 cdw0:0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb09 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:19 cid:118 nsid:1 lba:352410219 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:19 cid:118 cdw0:0
nvme1: failing outstanding i/o
nvme1: WRITE sqid:19 cid:120 nsid:1 lba:352410218 len:1
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:19 cid:120 cdw0:0
nvme1: failing outstanding i/o
nvme1: READ sqid:25 cid:125 nsid:1 lba:249674652 len:2
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:25 cid:125 cdw0:0
nvme1: waiting
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9a 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9b 0 0 0 0 0
nvme1: failing outstanding i/o
nvme1: READ sqid:28 cid:121 nsid:1 lba:249674525 len:1
nvme1: waiting
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 sqid:28 cid:121 cdw0:0
nvme1: waiting
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=10001c0c 0 5 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=10001bec 0 1f 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=15015a6b 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
nvme1: waiting
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=15015a6a 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nda1 at nvme1 bus 0 scbus9 target 0 lun 1
nda1: <WD Blue SN570 2TB 234200WD 23071H801858>
 s/n 23071H801858 detached
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb9c 0 1 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Periph was invalidated
nvme1: waiting
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=ee1bb1d 0 0 0 0 0
(nda1:nvme1:0:0:1): CAM status: CCB request completed with an error
(nda1:nvme1:0:0:1): Error 5, Periph was invalidated
nvme1: waiting
(nda1:nvme1:0:0:1): Periph destroyed
nvme1: waiting
nvme1: waiting
nvme1: waiting
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0
nvme1: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 sqid:0 cid:0 cdw0:0

An identical drive (from the same order) has been running fine since February in my 12.4-RELEASE workstation.
Except for one host I pretty much skipped 13.0/13.1, so the systems run either 12.4-RELEASE or 13.2-RELEASE and I don't have any data on 13.0 or 13.1 /w nvme drives.
Other dead nvme drives in the past would just show increasing media/data integrity errors or vanish completely, but not 'lock up' like those drives.
I wasn't able to revive any of those drives in other systems and they can't be accessed/probed by either other FreeBSD versions, illumos or linux. There's always only a completely unresponsive generic nvme device showing up. (firmware lockup?)

Given that those *identical* failures occured with 3 different drives from 2 different vendors within ~4 weeks but *exclusively* on 13.2-RELEASE hosts, while several 12.4-RELEASE systems (installed with 12.x or 11.x), some even with the same drive models, have been running perfectly fine, I'm no longer convinced this is just a bad coincidence or hardware/firmware problem... Especially after that last drive failed with *exactly* the same errors after just a few hours.


There are 2 bug reports that might be related (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270409 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262969), but I'm not sure if I should add to one of those (although they are not exactly describing the same problem?) or open a new bug report...
 
This doesn’t sound good.

I’ve got a Micron 7450 in a Supermicro machine that I’m preparing for production. 13.2 and no issues so far but only be idling for a couple of weeks. Had a Samsung in there before but given Samsung’s bad run recently I thought I’d go to Micron. No issues with the Samsung but only used as a lightly loaded development machine with 12.X then 13.X.

Also have Samsung NVMe in another Supermicro that has been running for a couple of years 12.X to 13.0, 13.1, and 13.2 since it came out.

A couple of other Supermicro machines with NVMe (Samsung IIRC) running for years but on OpenBSD.
 
I don't think this is a problem with Micron - the drive that 'failed' after 3 hours was a WD and I have ~10 other Micron 7300, 7400 7450 (and several desktop variants) in use that have been perfectly fine for years.
I have another set of 7400 on the same supermicro carrier running with smartOS since ~1.5 years, another pair is running in my home server (again, the same AOC) with 12.3-RELEASE for ~6 months now (together with 4 transcend and 2 WD M.2 NVMes which are up to 5 years old now). Some 7300s are running in 2 or 3 hosts with 12.3...

NVME drives (especially Microns) and adapter cards have been very reliable for me, so those failures in short succession are very conspicuous, especially because they occured with the exact same failure pattern (that I haven't seen before - it was either 'completely dead' or slowly dying and throwing errors) with new or almost new drives of different series/brands but *exclusively* on hosts that have just been set up with 13.2.


I already considered moving the 2 other hosts back to 12.4, but thanks to incompatible features of ZFS 2.x this is not possible without restoring everything (VMs, jails, userdata...) from backups. That's also the reason why we are still running everything else on 12.4, so we still have ZFS compatibility with smartOS/illumos.
 
I’ve set up another Supermicro with 13.2 and a WD NVMe and see how that goes on 13.2 (was 13.1 but upgraded today.) Gave it a bit of a thrashing building ports.

Also got another Supermicro with 13.2 and NVMe that I am bringing online next week.

You are obviously seeing something.
 
Do you monitor drive temperature?
If it is a firmware concept bug (I have a suspicion that the driver part for the storage chips comes bundled from the chip manufacturer to the drive manufacturer, so it may be the same across brands) it may be triggered by some kind of write buffer overflow, as in there is a write log and no space is available due to running trimming inside the NVMe. Which brings me to the question if you have a partition on the drive kept empty to have spare space for re-org.

I think the drive does not respond to SMART any more, does it?
Edit: There may be a (temporarily) cure for this.
 
Do you monitor drive temperature?
Drive temperatures even under load are way below under any warn/critical thresholds (~30-40°C). The server rack is located in a (cold) basement and fans are usually set to 'optimal' speed, so cooling is plenty...
The drives in my home server are far more likely to hit higher temperatures (especially during poudriere builds, which happen on the pair of micron 7400), but I have no problems there despite the drives seeing temps up to ~60° under load.


If it is a firmware concept bug (I have a suspicion that the driver part for the storage chips comes bundled from the chip manufacturer to the drive manufacturer, so it may be the same across brands) it may be triggered by some kind of write buffer overflow, as in there is a write log and no space is available due to running trimming inside the NVMe. Which brings me to the question if you have a partition on the drive kept empty to have spare space for re-org.
The microns are/were fully utilized by the single namespace and fully used by partitions/ZFS. The spare WD (2TB) I wanted to use as a temporary mirror had a smaller partition for ZFS to match the size of the 1.92TB micron. AFAIK those 1.92TB microns are already (heavily) underprovisioned for plenty spare/buffer - otherwise they wouldn't achieve those high sustained bandwidths I think.
The pools on those systems/drives were also far from being full - the pool on the pair of 1.92TB 7400s holds ~800GB, the 7450s were way below <100GB used space... So there wasn't really high load involved (apart from one bhyve VM on the 7400 pool running a mildly utilized MSSQL server) on those drives.
The 7400 in my home server get abused by weekly poudriere builds, so those see plenty IOPS during builds and have higher TBW figures


I think the drive does not respond to SMART any more, does it?
nope, completely unresponsive. It shows up as "nvmeN" device, but any initialization/probing beyond that point fails with the IDENTIFY/ABORTED errors shown in #1


Edit: There may be a (temporarily) cure for this.
I've followed that bug report and also tried to revive those drives in other systems and with other OS, but to no avail. They are always recognized as 'generic NVME' devices, but the firmware seems to be unresponsive to anything beyond that point.
The 7450 actually showed up with the correct vendor/model description at POST once after a few reboots, but went dark again before the OS was booted and then again was only shown as "generic NVME".
 
My longest running FreeBSD 13.x Supermicro:
Code:
% dmesg | grep nvd
nvd0: <Samsung SSD 970 PRO 1TB> NVMe namespace
% uptime
 8:39AM  up 32 days, 17:39, 1 user, load averages: 0.05, 0.06, 0.06
Was on 13.0, upgraded to 13.1 on 16th Feb, upgraded to 13.2 on April 13th and running 13.2 ever since. This one machine is geli-encrypted.

Obviously only one datapoint compared to your several datapoints, but I've got other Supermicro + NVMe + FreeBSD 13.2 machines that I've just set-up and started using - I'll watch those closely.

All machines are currently lightly loaded and I'm using the NVMe SSDs as boot/OS devices (UFS) with non-NVMe SSDs for ZFS.
 
I've dusted off an Intel NUC with a NVMe - this was on FreeBSD 13.1 - just upgraded to 13.2 and I'll leave it running to see how it fares.
Code:
% dmesg | grep nvd
nvd0: <Samsung SSD 970 PRO 1TB> NVMe namespace
No-one else out there using NVMe and FreeBSD 13.2?
 
I'm currently waiting for 2 spares and the RMA'd drives to arrive until I touch anything on those hosts again. Snapshots and backups (i.e. sending the snapshots off to the backup/storag hosts and NAS) have been cranced up to */15 to minimize fallout on the 2 VMs on the vnme pool if that single remaining drive should fail.

I'm also trying to switch the drives around on the carrier after resilvering, just to verify this isn't an issue with the secondary slot on those carriers, because the failures on both hosts only happened on slot #2. Yet again - I have at least 8 or 9 of them running without issues and one of those now having disk failures was also previously running ~2 years with smartOS...


If I had any clue by now what might trigger those errors, I have an old Supermicro X9 system and some noname M.2 carriers and old drives lying around I could abuse and test with...
 
Obviously something going on but as I’m about to go into production with a couple of Supermicro machines with NVMe on FreeBSD 13.2 I’m keen to find out what it is.

It‘s not definite (I don’t think?) that any of those three are a factor but it’s what you are seeing so far.
 
I'm also trying to switch the drives around on the carrier after resilvering, just to verify this isn't an issue with the secondary slot on those carriers, because the failures on both hosts only happened on slot #2
Any particular make and model?

I‘ve got some Supermicro and Startech carriers I can try to see if I can reproduce this.
 
Any particular make and model?

I‘ve got some Supermicro and Startech carriers I can try to see if I can reproduce this.

Those servers are all running AOC-SLG3-2M2 carriers.
At home I also use them as well as some quad port carriers from aliexpress that are running fine.
In essence those carriers are just port adapters with some power distribution circuitry, absolutely nothing sophisticated and absolutely no ICs or even controllers that could act up.
 
Back
Top