ZFS NVMe / controller trouble

0922Drg · Oct 19, 2022

Hello everybody,

I've set up a storage server with the following ZFS pool configuration:

Code:

NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
storage    *****  *****  *****        -         -     7%    16%  1.00x    ONLINE  -
  raidz1   *****  *****  *****        -         -     7%  16.2%      -  ONLINE
    ada0       -      -      -        -         -      -      -      -  ONLINE
    ada1       -      -      -        -         -      -      -      -  ONLINE
    ada2       -      -      -        -         -      -      -      -  ONLINE
    ada3       -      -      -        -         -      -      -      -  ONLINE
cache          -      -      -        -         -      -      -      -  -
  nvd0     *****   ****   ****        -         -     0%  47.9%      -  ONLINE
zroot       ****  *****   ****        -         -    24%    11%  1.00x    ONLINE  -
  nvd1p4    ****  *****   ****        -         -    24%  12.0%      -  ONLINE

Within the last few months I've had multiple system halts with the following system errors:

Code:

Oct  7 03:53:22 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct  7 03:53:22 <kern.crit> host kernel: nvme0: resetting controller
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:0 nsid:1 lba:1894353032 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:121 nsid:1 lba:1895271664 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:121 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:122 nsid:1 lba:1778004008 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:122 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:0 nsid:1 lba:1799745920 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:122 nsid:1 lba:1345005208 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:122 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:0 nsid:1 lba:1894189112 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:126 nsid:1 lba:1751740216 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:126 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:4 cid:121 nsid:1 lba:1784656952 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:121 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:0 nsid:1 lba:1780993000 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:0 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:124 nsid:1 lba:1664862024 len:8
Oct  7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:124 cdw0:0
Oct  7 03:55:23 <kern.crit> host kernel: nvd0: detached

Code:

Oct 12 11:26:08 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct 12 11:26:08 <kern.crit> host kernel: nvme0: resetting controller
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:1 cid:123 nsid:1 lba:1893018288 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:124 nsid:1 lba:1898380624 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:121 nsid:1 lba:1126530744 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:121 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:0 nsid:1 lba:1117794448 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:0 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:127 nsid:1 lba:1773658808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:5 cid:127 nsid:1 lba:1204999600 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:123 nsid:1 lba:1645712752 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:124 nsid:1 lba:1726342736 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:125 nsid:1 lba:1761830928 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:125 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:127 nsid:1 lba:1779467808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvd0: detached

Resulting in a completely non-responsive system (I can type, but I cannot login)
I did a soft reset (reset button) but got a error on boot:

Code:

No suitable dump device was found.
swapon: /dev/nvd1p3: No such file or directory
...
Can't open '/dev/nvd1p1'

Only if I do a hard poweroff and poweron afterwards everything is working fine again.
It seems the complete controller hangs after this error and has to be power cycled to work again.

Kernel:
13.0-RELEASE

NVMe Information:

Code:

nvd0: <ADATA SX6000PNP> NVMe namespace
nvd1: <GIGABYTE GP-ASM2NE2512GTTDR> NVMe namespace

Motherboard:
Supermicro X11SCW-F

Has anybody had similar errors or can imagine if this is a Hardware Problem (BIOS) or a FreeBSD Problem?
I don't think that the NVMe itself is defective, else it would not halt the complete system...

Thank you for any ideas!

SirDice · Oct 19, 2022

0922Drg said:
13.0-RELEASE

FreeBSD 13.0 is end-of-life since August 2022 and not supported anymore.

Topics about unsupported FreeBSD versions

0922Drg said:
Within the last few months I've had multiple system halts with the following system errors

If this suddenly happened without updating or changes made to the OS, then it's likely a hardware issue.

0922Drg · Oct 19, 2022

SirDice said:
FreeBSD 13.0 is end-of-life since August 2022 and not supported anymore.

Topics about unsupported FreeBSD versions

If this suddenly happened without updating or changes made to the OS, then it's likely a hardware issue.

It started to happen after about 10-12 months running if I remember correctly and it is happening more often.
If I search for the 'resetting controller' message in the logfiles I get the following dates:
11.04.2022
01.05.2022
12.08.2022
10.09.2022
07.10.2022
12.10.2022
18.10.2022

No hardware or software were changed.

Is it possible that the Cache NVMe is defective, but then why does the complete system hang and even the system NVMe is not reachable anymore?

SirDice · Oct 19, 2022

0922Drg said:
Is it possible that the Cache NVMe is defective, but then why does the complete system hang and even the system NVMe is not reachable anymore?

It basically gets the rug pulled out from under it. The OS isn't going to like that. Any action that requires disk access is going to stall waiting for ZFS to deal with the issue.

0922Drg · Oct 19, 2022

But that should only apply to the cache pool right? The system pool is another NVMe which also stops working - it seems that everything NVMe seems to stop working in this case.

If I check both NVMes with smartctl I do not see one single error - except the unsafe shutdowns.
Cache pool NVMe:

Code:

Unsafe Shutdowns:                   21
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

Root pool NVMe:

Code:

Unsafe Shutdowns:                   20
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

SirDice · Oct 19, 2022

nvme0 is the controller. The controller itself appears to have problems, so everything attached to it is going to fail too. Not sure how this works with NVMe but on multiple occasions I've had one bad disk just hang up the entire bus on a SAS controller, effectively taking the entire pool of disks out. I could only regain functionality by physically removing that bad disk.

0922Drg · Oct 19, 2022

Urgh, so the Controller is defective and not the NVMe itself?
Is that possible?

richardtoohey2 · Oct 19, 2022

I think it is. I watched a video the other day of an SSD where the controller had died and they were bypassing that to extract the data directly from the flash memory (if that’s the correct terminology).

Of course now I can’t find that video or story anywhere!

If you do a search you should find similar information.

0922Drg · Oct 19, 2022

Do you think it is the controller on the NVMe itself or something on the motherboard?
I really do not need to extract any data from this NVMe - it's just a cache disc. If defective I should be able to remove cache and reattach a cache disc to the pool as far as I understand.

SirDice · Oct 19, 2022

0922Drg said:
Do you think it is the controller on the NVMe itself or something on the motherboard?

That would be the controller on the mainboard. You attach the NVMe drives to that.

0922Drg said:
If defective I should be able to remove cache and reattach a cache disc to the pool as far as I understand.

You could try to remove it anyway, as you say, it's just a cache disk. Only to check if you get more or less errors on the controller.

ralphbsz · Oct 19, 2022

0922Drg said:
Urgh, so the Controller is defective and not the NVMe itself?
Is that possible?

It's more complicated. As Sir Dice already hinted at, sometimes a bad drive (SATA, SAS or NVMe) can make the whole controller not operate. This is surprisingly common. In theory, with perfect firmware and software, it should not happen; in theory, we should have perfect fault isolation, where the effect of one failing component is limited to just that component. But in the (ugly and messy) real world, we often have side-effects. This is particularly true with inexpensive consumer-grade hardware; if you were using an enterprise-grade motherboard, enterprise-grade HBA and firmware / driver versions that are carefully vetted and tuned, you might get luckier.

And to make matters worse, you're using an outdated FreeBSD version. I don't know whether newer releases have fixed bugs in the NVMe drivers in FreeBSD, but commonly error handling (the OS dealing with hardware problems) is the biggest area that gets improved.

richardtoohey2 · Oct 19, 2022

Not saying it's relevant here, but there *are* controllers on NVME M.2 drives - is that right?

e.g.

Sabrent Rocket 4 Plus 1TB M.2 NVMe SSD Review This is FAST

In our Sabrent Rocket 4 Plus 1TB M.2 NVMe SSD review we find a drive that performs extremely well and at a discount to its nearest competitor

www.servethehome.com

"The Sabrent Rocket 4 Plus bundles Micron 176-layer TLC NAND with the Phison PS5018-E18 controller."

Just asking because SirDice mentioned "the controller" and was wondering if that was referring to the controller on the drive or the motherboard - and it seems we are talking about the motherboard controller here (but there are controllers on the drive unless I'm mixed up?)

Erichans · Oct 20, 2022

SirDice said:
That would be the controller on the mainboard. You attach the NVMe drives to that.

You could try to remove it anyway, as you say, it's just a cache disk. Only to check if you get more or less errors on the controller.

Looking at X11SCW-F and its manual (p. 17 @ "C246 System Block Diagram"), that would be the intel C246 chip, i.e. the PCIe controller chip on the motherboard. I agree, just removing the L2ARC cache dev is a way to get more information but my hunch would be that it might not be related to a C246 failure however, I'm speaking from a position of lot less server hardware experience ...

Perhaps also worth some thoughts in the direction of problems like PR-211713 (closed) , especially forwards from the indicated comment, and the "follow-up" PR-262969 (still open).

I think you'll get further guidance when adding to PR-262969; I expect you'll be asked to upgrade to 13.1-RELEASE though.

0922Drg · Oct 26, 2022

Update:
had another crash yesterday (same as usual).
Tried zpool remove storage nvd0 > Hung the system

Tried to remove the cache nvme from the system while offline and then I have the problem that the boot disk changes the /dev path (from nvd1 to nvd0) and the bootloader does not find the root nvme anymore (argh).

Currently upgrading is not an option - but I will try to replace the nvme with a new one (in case its hardware is faulty) and will recheck the servers behaviour afterwards.

0922Drg · Oct 26, 2022

Found this:

[SOLVED] - How to remove Cache Device from ZFS Pool

Hi, I've got a pool with an cache device. It's an NVME and it's almost end of life SMART sais: "Percentage Used: 190% " I don't need this cache, so I'd like to remove the device. root@vhost:~# zpool status pool: rpool state: ONLINE status: Some supported features are not...

forum.proxmox.com

But I cannot find any ZPOOL_CACHE option in any config file.

SirDice · Oct 26, 2022

0922Drg said:
But I cannot find any ZPOOL_CACHE option in any config file.

It's a configuration file specific to Proxmox.

0922Drg · Nov 10, 2022

I've just replaced the NVMe with a new one - the zpool shows it as 'FAULTED' corrupted data.
I've removed it with the zpool remove storage nvd0 command like above and it worked.
Let's see if it will hang again without the cache NVMe.

ZFS NVMe / controller trouble

0922Drg

SirDice

Administrator

0922Drg

SirDice

Administrator

0922Drg

SirDice

Administrator

0922Drg

richardtoohey2

0922Drg

SirDice

Administrator

ralphbsz

richardtoohey2

Sabrent Rocket 4 Plus 1TB M.2 NVMe SSD Review This is FAST

Erichans

0922Drg

0922Drg

[SOLVED] - How to remove Cache Device from ZFS Pool

SirDice

Administrator

0922Drg