ZFS NVMe / controller trouble

Hello everybody,

I've set up a storage server with the following ZFS pool configuration:

Code:
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage    *****  *****  *****        -         -     7%    16%  1.00x    ONLINE  -
  raidz1   *****  *****  *****        -         -     7%  16.2%      -  ONLINE
    ada0       -      -      -        -         -      -      -      -  ONLINE
    ada1       -      -      -        -         -      -      -      -  ONLINE
    ada2       -      -      -        -         -      -      -      -  ONLINE
    ada3       -      -      -        -         -      -      -      -  ONLINE
cache          -      -      -        -         -      -      -      -  -
  nvd0     *****   ****   ****        -         -     0%  47.9%      -  ONLINE
zroot       ****  *****   ****        -         -    24%    11%  1.00x    ONLINE  -
  nvd1p4    ****  *****   ****        -         -    24%  12.0%      -  ONLINE


Within the last few months I've had multiple system halts with the following system errors:

Code:
Oct  7 03:53:22 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct  7 03:53:22 <kern.crit> host kernel: nvme0: resetting controller
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:0 nsid:1 lba:1894353032 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:121 nsid:1 lba:1895271664 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:121 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:122 nsid:1 lba:1778004008 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:122 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:0 nsid:1 lba:1799745920 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:122 nsid:1 lba:1345005208 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:122 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:0 nsid:1 lba:1894189112 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:126 nsid:1 lba:1751740216 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:126 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:4 cid:121 nsid:1 lba:1784656952 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:121 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:0 nsid:1 lba:1780993000 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:124 nsid:1 lba:1664862024 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:124 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvd0: detached


Code:
Oct 12 11:26:08 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct 12 11:26:08 <kern.crit> host kernel: nvme0: resetting controller
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:1 cid:123 nsid:1 lba:1893018288 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:124 nsid:1 lba:1898380624 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:121 nsid:1 lba:1126530744 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:121 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:0 nsid:1 lba:1117794448 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:0 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:127 nsid:1 lba:1773658808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:5 cid:127 nsid:1 lba:1204999600 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:123 nsid:1 lba:1645712752 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:124 nsid:1 lba:1726342736 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:125 nsid:1 lba:1761830928 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:125 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:127 nsid:1 lba:1779467808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvd0: detached

Resulting in a completely non-responsive system (I can type, but I cannot login)
I did a soft reset (reset button) but got a error on boot:
Code:
No suitable dump device was found.
swapon: /dev/nvd1p3: No such file or directory
...
Can't open '/dev/nvd1p1'

Only if I do a hard poweroff and poweron afterwards everything is working fine again.
It seems the complete controller hangs after this error and has to be power cycled to work again.

Kernel:
13.0-RELEASE

NVMe Information:
Code:
nvd0: <ADATA SX6000PNP> NVMe namespace
nvd1: <GIGABYTE GP-ASM2NE2512GTTDR> NVMe namespace
Motherboard:
Supermicro X11SCW-F

Has anybody had similar errors or can imagine if this is a Hardware Problem (BIOS) or a FreeBSD Problem?
I don't think that the NVMe itself is defective, else it would not halt the complete system...

Thank you for any ideas!
 
FreeBSD 13.0 is end-of-life since August 2022 and not supported anymore.

Topics about unsupported FreeBSD versions


If this suddenly happened without updating or changes made to the OS, then it's likely a hardware issue.
It started to happen after about 10-12 months running if I remember correctly and it is happening more often.
If I search for the 'resetting controller' message in the logfiles I get the following dates:
11.04.2022
01.05.2022
12.08.2022
10.09.2022
07.10.2022
12.10.2022
18.10.2022

No hardware or software were changed.

Is it possible that the Cache NVMe is defective, but then why does the complete system hang and even the system NVMe is not reachable anymore?
 
Is it possible that the Cache NVMe is defective, but then why does the complete system hang and even the system NVMe is not reachable anymore?
It basically gets the rug pulled out from under it. The OS isn't going to like that. Any action that requires disk access is going to stall waiting for ZFS to deal with the issue.
 
But that should only apply to the cache pool right? The system pool is another NVMe which also stops working - it seems that everything NVMe seems to stop working in this case.

If I check both NVMes with smartctl I do not see one single error - except the unsafe shutdowns.
Cache pool NVMe:
Code:
Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

Root pool NVMe:
Code:
Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged
 
nvme0 is the controller. The controller itself appears to have problems, so everything attached to it is going to fail too. Not sure how this works with NVMe but on multiple occasions I've had one bad disk just hang up the entire bus on a SAS controller, effectively taking the entire pool of disks out. I could only regain functionality by physically removing that bad disk.
 
I think it is. I watched a video the other day of an SSD where the controller had died and they were bypassing that to extract the data directly from the flash memory (if that’s the correct terminology).

Of course now I can’t find that video or story anywhere!

If you do a search you should find similar information.
 
Do you think it is the controller on the NVMe itself or something on the motherboard?
I really do not need to extract any data from this NVMe - it's just a cache disc. If defective I should be able to remove cache and reattach a cache disc to the pool as far as I understand.
 
Do you think it is the controller on the NVMe itself or something on the motherboard?
That would be the controller on the mainboard. You attach the NVMe drives to that.

If defective I should be able to remove cache and reattach a cache disc to the pool as far as I understand.
You could try to remove it anyway, as you say, it's just a cache disk. Only to check if you get more or less errors on the controller.
 
Urgh, so the Controller is defective and not the NVMe itself?
Is that possible?
It's more complicated. As Sir Dice already hinted at, sometimes a bad drive (SATA, SAS or NVMe) can make the whole controller not operate. This is surprisingly common. In theory, with perfect firmware and software, it should not happen; in theory, we should have perfect fault isolation, where the effect of one failing component is limited to just that component. But in the (ugly and messy) real world, we often have side-effects. This is particularly true with inexpensive consumer-grade hardware; if you were using an enterprise-grade motherboard, enterprise-grade HBA and firmware / driver versions that are carefully vetted and tuned, you might get luckier.

And to make matters worse, you're using an outdated FreeBSD version. I don't know whether newer releases have fixed bugs in the NVMe drivers in FreeBSD, but commonly error handling (the OS dealing with hardware problems) is the biggest area that gets improved.
 
Not saying it's relevant here, but there *are* controllers on NVME M.2 drives - is that right?

e.g.


"The Sabrent Rocket 4 Plus bundles Micron 176-layer TLC NAND with the Phison PS5018-E18 controller."

Just asking because SirDice mentioned "the controller" and was wondering if that was referring to the controller on the drive or the motherboard - and it seems we are talking about the motherboard controller here (but there are controllers on the drive unless I'm mixed up?)
 
That would be the controller on the mainboard. You attach the NVMe drives to that.


You could try to remove it anyway, as you say, it's just a cache disk. Only to check if you get more or less errors on the controller.
Looking at X11SCW-F and its manual (p. 17 @ "C246 System Block Diagram"), that would be the intel C246 chip, i.e. the PCIe controller chip on the motherboard. I agree, just removing the L2ARC cache dev is a way to get more information but my hunch would be that it might not be related to a C246 failure however, I'm speaking from a position of lot less server hardware experience ...

Perhaps also worth some thoughts in the direction of problems like PR-211713 (closed) , especially forwards from the indicated comment, and the "follow-up" PR-262969 (still open).

I think you'll get further guidance when adding to PR-262969; I expect you'll be asked to upgrade to 13.1-RELEASE though.
 
Update:
had another crash yesterday (same as usual).
Tried zpool remove storage nvd0 > Hung the system

Tried to remove the cache nvme from the system while offline and then I have the problem that the boot disk changes the /dev path (from nvd1 to nvd0) and the bootloader does not find the root nvme anymore (argh).

Currently upgrading is not an option - but I will try to replace the nvme with a new one (in case its hardware is faulty) and will recheck the servers behaviour afterwards.
 
Found this:
But I cannot find any ZPOOL_CACHE option in any config file.
 
I've just replaced the NVMe with a new one - the zpool shows it as 'FAULTED' corrupted data.
I've removed it with the zpool remove storage nvd0 command like above and it worked.
Let's see if it will hang again without the cache NVMe.
 
Back
Top