NVME drives issues...

I have an issue with my system that started a few days ago. I have two NVME drives connected to a SuperMicro AOC-SLG3-2M2 PCIE card I had been using for months (only in production for a couple of months though, through a progressive ramp up), but they both started to generate errors on FreeBSD a few days ago, with one of them generating more errors than the other:

nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: resetting controller
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: Resetting controller due to a timeout and possible hot unplug.
nvme0: failing outstanding i/o
nvme0: READ sqid:4 cid:124 nsid:1 lba:459285472 len:16
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:4 cid:124 cdw0:0
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b6023e0 0 f 0 0 0
nvme0: READ sqid:9 cid:126 nsid:1 lba:459344056 len:8
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:9 cid:126 cdw0:0
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b6108b8 0 7 0 0 0
nvme0: WRITE sqid:11 cid:126 nsid:1 lba:1244153000 len:256
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:11 cid:126 cdw0:0
nvme0: failing outstanding i/o
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nvme0: READ sqid:16 cid:125 nsid:1 lba:459264640 len:8
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=4a2844a8 0 ff 0 0 0
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:16 cid:125 cdw0:0
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 5, Retries exhausted
nda0 at nvme0 bus 0 scbus4 target 0 lun 1
nda0: <WD_BLACK SN770 2TB 731100WD 23160F800275> s/n 23160F800275 detached
(nda0:nvme0:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=1b5fd280 0 7 0 0 0
GEOM_MIRROR: Device boot: provider nda0p2 disconnected.
(nda0:nvme0:0:0:1): CAM status: Unknown (0x420)
(nda0:nvme0:0:0:1): Error 6, Periph was invalidated
(nda0:nvme0:0:0:1): Periph destroyed
nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:0 cid:0 cdw0:0
nvme0: IDENTIFY (06) sqid:0 cid:0 nsid:0 cdw10:00000001 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:0 cid:0 cdw0:0
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: resetting controller
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: Waiting for reset to complete
nvme1: Waiting for reset to complete
nvme1: Resetting controller due to a timeout and possible hot unplug.
nvme1: failing queued i/o
nvme1: WRITE sqid:3 cid:0 nsid:1 lba:1244155304 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:0 p:0 sqid:3 cid:0 cdw0:0
nvme1: failing outstanding i/o
(nda1:nvme1:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=4a284da8 0 ff 0 0 0
nvme1: READ sqid:5 cid:126 nsid:1 lba:3227128240 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:5 cid:126 cdw0:0
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
nvme1: failing outstanding i/o
nvme1: READ sqid:13 cid:127 nsid:1 lba:3227127984 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:13 cid:127 cdw0:0
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a11b0 0 ff 0 0 0
nvme1: failing outstanding i/o
nvme1: READ sqid:16 cid:127 nsid:1 lba:3227127728 len:256
nvme1: ABORTED - BY REQUEST (00/07) crd:0 m:0 dnr:1 p:0 sqid:16 cid:127 cdw0:0
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
(nda1:nvme1:0:0:1): Error 5, Retries exhausted
nda1 at nvme1 bus 0 scbus5 target 0 lun 1
nda1: <WD_BLACK SN770 2TB 731100WD 23160F800262> s/n 23160F800262 detached
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a10b0 0 ff 0 0 0
GEOM_MIRROR: Device boot: provider nda1p2 disconnected.
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
GEOM_MIRROR: Device boot: provider destroyed.
(nda1:nvme1:0:0:1): Error 6, Periph was invalidated
(nda1:nvme1:0:0:1): READ. NCB: opc=2 fuse=0 nsid=1 prp1=0 prp2=0 cdw=c05a0fb0 0 ff 0 0 0
GEOM_MIRROR: Device boot destroyed.
(nda1:nvme1:0:0:1): CAM status: Unknown (0x420)
(nda1:nvme1:0:0:1): Error 6, Periph was invalidated
Solaris: WARNING: Pool 'tank' has encountered an uncorrectable I/O failure and has been suspended.

I can make them fail pretty systematically by doing a zpool scrub. Running I/O intensive services on the machine also makes them fail consistently. Running smartctl -t does not seem to necessarily make them fail. Initially I thought the PCIe card was at fault, but I moved one of the drives to the motherboard's M.2 and it did not solve the problem. Both drives seem to be from the same batch based on the serial numbers. I am using ECC RAM.

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 1,663,470 [851 GB]
Data Units Written: 12,013,897 [6.15 TB]
Host Read Commands: 15,559,653
Host Write Commands: 196,433,852
Controller Busy Time: 113
Power Cycles: 137
Power On Hours: 2,330
Unsafe Shutdowns: 134
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 63 Celsius
Temperature Sensor 2: 37 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 27 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 5,585,758 [2.85 TB]
Data Units Written: 11,214,987 [5.74 TB]
Host Read Commands: 32,190,918
Host Write Commands: 184,413,733
Controller Busy Time: 120
Power Cycles: 135
Power On Hours: 2,333
Unsafe Shutdowns: 132
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 25 Celsius

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

What can be going on?
 
I tried upgrading the drives to the latest firmware (731120WD). It did not fix the issue. Also measured the motherboard voltage at 3.30V with a multimeter. Should I look for better drives? Are there other things I should attempt?
 
I’ve not had issues with FreeBSD and NVMe SSDs (several servers last couple of years, FreeBSD 13.x) and that seems to be the case for a lot of people but as you’ve seen there are definitely reports about some people encountering issues.

I‘ve not been using ZFS (but the messages seem to be coming from lower levels?) and most of my installs on M.2 slots on Supermicro motherboards not the PCIe daughterboard.

So it seems like there might be something going on in some circumstances but I don’t think anyone has completely figured out what.

Which isn’t any help to you but means it’s hard to find and fix.
 
I’ve not had issues with FreeBSD and NVMe SSDs (several servers last couple of years, FreeBSD 13.x) and that seems to be the case for a lot of people but as you’ve seen there are definitely reports about some people encountering issues.

I‘ve not been using ZFS (but the messages seem to be coming from lower levels?) and most of my installs on M.2 slots on Supermicro motherboards not the PCIe daughterboard.

So it seems like there might be something going on in some circumstances but I don’t think anyone has completely figured out what.

I am using a SolidRun HoneyComb LX2 arm64 motherboard. I experience the issue through both a Supermicro PCIe daughterboard (using bifurcation) and the motherboard's M.2 slot. I am running 14.0-RELEASE-p5. Prior to having this issue the code had been doing quite a bit of I/O for a month or so (mostly reading, swap, atime and auto-trim disabled). The drive temp does not seem to go above 70C (edit). I am using heatsinks with heatpipes. The issue appeared on both drives at the same time, but it is always the same drive that gets dropped first. I am using ZFS encryption. It looks like people experience similar issues on Linux, but not on Windows. This system is critical for me and I need to fix it. Currently I don't know if I can expect to fix the issue by buying different drives, or if the issue will reappear after two months no matter which drives I purchase?
 
Micron are probably better than WD but as you’ve seen on the other thread sko encountered the same issue with Micron.


I‘m using Micron 7450s as well but haven’t seen what sko saw BUT I’ve got a different setup and not doing intensive I/O; usually using the NVMe as boot/OS drives.

Hopefully sko will chip in soon - hopefully he found a way round the issue.
 
I saw reports from people having issues with pretty much any main manufacturer, except maybe Intel? Some people say it is very model or even batch specific. It could also be application specific instead. It looks like I excluded the daughterboard to be at the source of the issue on my system, but I am not sure if the issue is with the OSes or with any SSD I could possible buy? I have an SN735 SSD on my (Linux) laptop with ZFS encryption and I experience no issue of this kind, but the application is very different...
 
I read something somewhere that some issues may be heat related and the addition of a heat sink on the device might help alleviate some problems with intensive read/write operations... YMMV....
 
I read something somewhere that some issues may be heat related and the addition of a heat sink on the device might help alleviate some problems with intensive read/write operations... YMMV....
Yes I have read that as well. I already have the largest (passive) heatsinks I could get that fit in my enclosure. I don't see them going above 70C (edit) when I monitor them with smartctl... Short of using active cooling for the SSD coolers themselves or improving thermal conductivity by using custom copper pads + thermal grease instead of thermal silicone pads I am not sure how I could improve this. The enclosure gets plenty of air and the CPU runs cool - around 45 C. I have an EVGA SuperNova PSU that has plenty of power for this system. 64 GB of ECC RAM. No graphics card. Double power surge protection and UPS with voltage correction. I don't think I have cut corners with the hardware as far as I know.

It seems that either a bunch of NVME drives fail when put other some load that does not occur on Windows, or that both FreeBSD and Linux kernel NVME drivers fail to behave according to what these drives expect under some specific conditions?
 
Is there anything I can do to try isolating the issue? I can trigger the problem quite consistently on my system...
Does it only affect NVME drives or also USB and SATA SSDs?
 
Currently I am using nda. Should I try nvd and see if it makes a difference?

Edit: When I try nvd, zfs scrub successfully completes. The sector size now appears as 512 instead of 4096 though, and speed is slower.
 
Is there anything I can do to try isolating the issue? I can trigger the problem quite consistently on my system...
Does it only affect NVME drives or also USB and SATA SSDs?

I can't reproduce it and as you've seen sko has replied on the other thread and it just seemed to stop happening there.

I've only seen it reported on NVMe SSDs. I've definitely got lots of SATA SSDs running for years on Supermicro/Dell servers with no issues (so far!)

It seems to happen to a few people, some of the time.

It looks ancient, though - that other link about Intel drives was 5 years ago. I think most people using SSDs never see it, but it's obviously there.

The sector size now appears as 512 instead of 4096 though, and speed is slower.
Anything here?


Hopefully nvd does the trick, but it probably wouldn't hurt to try something other than WD if the issue comes back.
 
Thanks. I am wondering if there are some nda tunables I should set, another block size I should select, or if trim settings or the device I use to build the zfs zvol could make it stable? Could there be some kind of conflict in my config between ZFS and nvme for trim settings or block size? For example, currently I have ashift=12 and autotrim=off for my pool, but gpart reports 512 for sector size using nvd, and there are some trim sysctl variables that seem activated.

nvmecontrol identify nvd0
Size: 3907029168 blocks
Capacity: 3907029168 blocks
Utilization: 3907029168 blocks
Thin Provisioning: Not Supported
Number of LBA Formats: 2
Current LBA Format: LBA Format #00
Metadata Capabilities
Extended: Not Supported
Separate: Not Supported
Data Protection Caps: Not Supported
Data Protection Settings: Not Enabled
Multi-Path I/O Capabilities: Not Supported
Reservation Capabilities: Not Supported
Format Progress Indicator: 0% remains
Deallocate Logical Block: Read 00h, Write Zero
Optimal I/O Boundary: 0 blocks
NVM Capacity: 2000398934016 bytes
Globally Unique Identifier: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
IEEE EUI64: xxxxxxxxxxxxxxxxxxxxxx
LBA Format #00: Data Size: 512 Metadata Size: 0 Performance: Good
LBA Format #01: Data Size: 4096 Metadata Size: 0 Performance: Better

Could the fact that LBA Format #00 is selected and using ashift=12 for the pool cause an issue with nda?

I came accross this that seems helpful regarding changing the LBA: https://www.truenas.com/community/threads/intel-p4510-nvme-format-4k.105233/
 
I think the problem is no-one knows - if they did, that would help resolve this issue.

It's sporadic and seems to affect a few users.

If anyone had found the answer(s) then they'd have fixed the code or firmware or be able to answer your questions.

Doesn't sound like sko changed anything (other than using new drives) and the same machines with same PCIe daughterboard started working.
 
Not very helpful answers from me but I'm just not seeing this (like most people).

Not had a problem with SATA SSDs or NVMe SSDs on any operating system - Linux, OpenBSD, FreeBSD, Windows, MacOS. (Sure one day I will, but not so far!)

These are the worst sorts of problems - there's obviously something that happens for a portion of users, but no-one knows what the cause(s) are. And sko's experience is that the problem disappeared with 90% of the same hardware, so maybe it was just a couple of flaky Micron NVMes? Firmware? Or maybe some sporadic issue with the daughterboard that just hasn't reared its head again (and I sincerely hope it doesn't!) And a lot of complex parts involved - hardware, firmware, OS, file systems, drivers, etc., most of them changing fairly frequently.
 
I just finished rebuilding the arrays after changing the LBA to 4096 instead of 512. Both drives failed as soon as I started a ZFS scrub after reenabling nda... Switched back to nvd. So far so good. In my case I know it is not the daughterboard at least since the problem occurs when I use the motherboard M.2 slot as well... I can go buy an SN850X drive and see how it goes. 2TB Micron drives do not seem available, and it seems Samsung drives are more problematic than WD's...
 
Keeping my fingers, legs, eyes & ears crossed. ?

I've got a LOT of Samsung SSDs (SATA & NVMe) but I seemed to miss the bad spell they had (got to be jinxing myself with all these declarations of "not happening to me"!) Not sure if they have recovered yet.
 
High temperature causes the drive to throttle.... no good can come from this.... I'm not generally interested in fixing our over temperature behavior: it's impossible for me to test and there is no real standard for what is supposed to happen.... that sounds like what is happening here...
 
I got an SN850X drive today. I replaced one of the SN770 drives by it. As soon as I added it to the ZFS pool, the SN770 drive generated errors and got dropped. I am unable to reboot the system as the error occurs again before I even have the change to enter the passphrase to decrypt the volume. So it looks like combining an SN770 drive with a different drive does not solve the issue on my system... In order to test the SN850X, I guess I will try booting the system only with the SN770, remove the second drive from the pool, then reboot. I will snapshot the whole tank, then send and receive the snapshot with some throttling involved so the SN770 does not crash?
 
High temperature causes the drive to throttle.... no good can come from this.... I'm not generally interested in fixing our over temperature behavior: it's impossible for me to test and there is no real standard for what is supposed to happen.... that sounds like what is happening here...
The temperature does not go beyond 70C, and sometimes the drives crash a fraction of second after launching an operation after the drives have been idling at 30C, so it does not seem to be the issue...
 
sko's experience was also in a thermally-controlled environment:
And also that's on 13.2 so before the arrival of nvd.
 
Is using the ZFS readlimit parameter the best way to throttle my SN770 so I can successfully rebuild the mirror using my new SN850x drive?
 
Back
Top