Hello everybody,
I've set up a storage server with the following ZFS pool configuration:
Within the last few months I've had multiple system halts with the following system errors:
Resulting in a completely non-responsive system (I can type, but I cannot login)
I did a soft reset (reset button) but got a error on boot:
Only if I do a hard poweroff and poweron afterwards everything is working fine again.
It seems the complete controller hangs after this error and has to be power cycled to work again.
Kernel:
13.0-RELEASE
NVMe Information:
Motherboard:
Supermicro X11SCW-F
Has anybody had similar errors or can imagine if this is a Hardware Problem (BIOS) or a FreeBSD Problem?
I don't think that the NVMe itself is defective, else it would not halt the complete system...
Thank you for any ideas!
I've set up a storage server with the following ZFS pool configuration:
Code:
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
storage ***** ***** ***** - - 7% 16% 1.00x ONLINE -
raidz1 ***** ***** ***** - - 7% 16.2% - ONLINE
ada0 - - - - - - - - ONLINE
ada1 - - - - - - - - ONLINE
ada2 - - - - - - - - ONLINE
ada3 - - - - - - - - ONLINE
cache - - - - - - - - -
nvd0 ***** **** **** - - 0% 47.9% - ONLINE
zroot **** ***** **** - - 24% 11% 1.00x ONLINE -
nvd1p4 **** ***** **** - - 24% 12.0% - ONLINE
Within the last few months I've had multiple system halts with the following system errors:
Code:
Oct 7 03:53:22 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct 7 03:53:22 <kern.crit> host kernel: nvme0: resetting controller
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:0 nsid:1 lba:1894353032 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:0 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:121 nsid:1 lba:1895271664 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:121 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:1 cid:122 nsid:1 lba:1778004008 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:122 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:0 nsid:1 lba:1799745920 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:0 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:2 cid:122 nsid:1 lba:1345005208 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:2 cid:122 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:0 nsid:1 lba:1894189112 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:0 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:3 cid:126 nsid:1 lba:1751740216 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:126 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:4 cid:121 nsid:1 lba:1784656952 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:121 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:0 nsid:1 lba:1780993000 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:0 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: READ sqid:5 cid:124 nsid:1 lba:1664862024 len:8
Oct 7 03:55:23 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:124 cdw0:0
Oct 7 03:55:23 <kern.crit> host kernel: nvd0: detached
Code:
Oct 12 11:26:08 <kern.crit> host kernel: nvme0: Resetting controller due to a timeout.
Oct 12 11:26:08 <kern.crit> host kernel: nvme0: resetting controller
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: controller ready did not become 0 within 120500 ms
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:1 cid:123 nsid:1 lba:1893018288 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:1 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:124 nsid:1 lba:1898380624 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:3 cid:121 nsid:1 lba:1126530744 len:16
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:3 cid:121 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing queued i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:0 nsid:1 lba:1117794448 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:0 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:4 cid:127 nsid:1 lba:1773658808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:4 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:5 cid:127 nsid:1 lba:1204999600 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:5 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:123 nsid:1 lba:1645712752 len:24
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:123 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:124 nsid:1 lba:1726342736 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:124 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:125 nsid:1 lba:1761830928 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:125 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: failing outstanding i/o
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: READ sqid:6 cid:127 nsid:1 lba:1779467808 len:8
Oct 12 11:28:09 <kern.crit> host kernel: nvme0: ABORTED - BY REQUEST (00/07) sqid:6 cid:127 cdw0:0
Oct 12 11:28:09 <kern.crit> host kernel: nvd0: detached
Resulting in a completely non-responsive system (I can type, but I cannot login)
I did a soft reset (reset button) but got a error on boot:
Code:
No suitable dump device was found.
swapon: /dev/nvd1p3: No such file or directory
...
Can't open '/dev/nvd1p1'
Only if I do a hard poweroff and poweron afterwards everything is working fine again.
It seems the complete controller hangs after this error and has to be power cycled to work again.
Kernel:
13.0-RELEASE
NVMe Information:
Code:
nvd0: <ADATA SX6000PNP> NVMe namespace
nvd1: <GIGABYTE GP-ASM2NE2512GTTDR> NVMe namespace
Supermicro X11SCW-F
Has anybody had similar errors or can imagine if this is a Hardware Problem (BIOS) or a FreeBSD Problem?
I don't think that the NVMe itself is defective, else it would not halt the complete system...
Thank you for any ideas!