NVMe controller timeout

Hi,

I am using FreeBSD-13.0-RELEASE. A couple of months ago I installed a Samung EVO Plus NVMe M.2 into my system and installed the operating system on it. All went fine until suddenly the system froze. On the main console there were error messages about the NVMe controller timing out. From that time onward the system would regularly freeze after some more or less intense hard disc traffic.

I would appreciate any ideas as to what is more probable:
  • Is the controller on the main board broken?
  • Is the NVMe broken (went too hot, for example)?
  • Does the FreeBSD-driver have an issue?
Best,
Holger
 
I've not had any issues with Samsung NVME SSDs on OpenBSD or FreeBSD.

The opinion on the internet about the need for heatsinks is variable - have you got one? I've not used one myself. I have an Intel NUC that has a thermal overload triggered but not tracked down what causes that (might be the heat from the SSD).

Anything in nvmecontrol that shows anything of interest (not used it myself but there might be ways to get temperature or logs and check the firmware version)?

Don't know but the thermal angle does seem like a good way to start to eliminate that.
 
I have been using two Samsung EVOs 960 for about 4 years now and having similar issues occasionally. I first used FreeBSD 12.0-12.2 RELEASE, then I installed a Gentoo on it but the NVMe-s still disappear every now and then.
I have not gotten to the bottom of it but it looks like the hardware fails.

My guess is, it happens probably due to overheating. On my AMD X.399 board there are two m.2 slots. One of them has a heat sink but the other one doesn't. And because I use my NVMe-s in a stripe, when one of them fails the whole system loses "/".

If possible, try to cool your NVMe as best as possible. They are by design quite hot and every bit of extra cooling helps.

I now regret investing in NVMe-s, having two SATA SSD's would have been equally as fast for my daily use and I would have avoided the overheating problem. If you benchmark, NVMe-s win big time but for normal daily usage it does not make a bit of difference. Furthermore, I could have up to 6 SATA drives in my machine that could be striped and outperform the two NVMe-s.
 
Is there a way on FreeBSD to monitor the temperature of the NVMe?
In theory, NVMe (the protocol) does support SMART. I don't know which specific devices do, nor do I know whether the FreeBSD NVMe stack passes it through. Try installing smartmontools (it's a package), and run "smartctl -a /dev/xxx", with the correct file name instead of xxx.
 
Yes it works well. SMART on NVMe.
Munin has a plugin for monitoring. There is also this in the pkg-message:
To include drive health information in your daily status reports,
add a line like the following to /etc/periodic.conf:
daily_status_smart_devices="/dev/ad0 /dev/da0"
substituting the appropriate device names for your SMART-capable disks.

To enable drive monitoring, you can use /usr/local/sbin/smartd.
A sample configuration file has been installed as
/usr/local/etc/smartd.conf.sample
Copy this file to /usr/local/etc/smartd.conf and edit appropriately

To have smartd start at boot
echo 'smartd_enable="YES"' >> /etc/rc.conf
 
Tried that on one of mine - and looks like yes, it (smartmontools) does report temperature. I couldn't see anything useful in nvmecontrol but might have missed something.

Code:
# /usr/local/sbin/smartctl -a /dev/nvme0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.0-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 PRO 512GB
Serial Number:                      S5JYNS0N703016Y
Firmware Version:                   1B2QEXP7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 512,110,190,592 [512 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Utilization:            492,017,659,904 [492 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5701400bc8
Local Time is:                      Fri Feb  4 13:57:30 2022 NZDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0037):   Security Format Frmw_DL Self_Test Directvs
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     81 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    492,760 [252 GB]
Data Units Written:                 1,490,853 [763 GB]
Host Read Commands:                 4,089,305
Host Write Commands:                8,456,591
Controller Busy Time:               9
Power Cycles:                       17
Power On Hours:                     25
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               59 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
So this is in a Dell R220 not doing anything in a non-AC room in NZ summer. Room temp is about 27 deg. C. No heat sink on the SSD.

Near the end it's got two completely different readings - 42 deg. C and 59 deg. C so don't know why so different.

About half-way down it says 81 deg. C is unhappy time.
 
I do have a heatsink installed.

Is there a way on FreeBSD to monitor the temperature of the NVMe?
You also need a temperature sensor on the NVMe hardwarewise. For example, on my motherboard only one of my two NVMe-s does have a sensor.
If available, it should be visible when you call sysctl -a | grep temperature as shown here. Evtl. kernel modules need to be loaded as explained at the latter link.
 
Back to basics for a moment, have you stress-tested the drive with something other than FreeBSD?

For stress purposes: writes.

… regularly freeze after some more or less intense hard disc traffic. …

I had this a few months ago with a new drive that was bad. Replaced, under warranty.
 
Back to basics for a moment, have you stress-tested the drive with something other than FreeBSD?

For stress purposes: writes.



I had this a few months ago with a new drive that was bad. Replaced, under warranty.
I did a little stress test for the NVMe, simply checking out the ports tree, while watching the temperature. The NVMe crashed as expected, but the temperature was never above 28°C.

I tried to installed Ubuntu on the NVMe to test it further, but the installation crashed also. Afterwards the BIOS would not even recognize the NVMe anymore.

So I think, the NVMe has gone bad and I will return it (should still fall under warranty).

Hopefully it's not the controller on the motherboard, though ... We'll see.
 
👍 For reference only, with added emphasis:

… SSD was terribly wrong (near the foot of the page, 2021/07/16 01:03:08):

<https://lists.freebsd.org/archives/freebsd-current/att-0339/2021-07-16_00.53_typescript.txt>

– with a power on time of less than a hundred hours <https://bsd-hardware.info/?probe=7138e2a9e7&log=smartctl#nvme0> and no wrongness revealed by HP diagnostics (comparable to an extended S.M.A.R.T. self-test).

S.M.A.R.T. reading from a drive can not be a substitute for a sustained write test.

tl;dr NVMe, a new computer with Windows 10. Failures not apparent when (repeatedly) installing Windows, but from what the end user described I suspected a storage issue. Results from sustained write tests (use of FreeBSD-CURRENT was incidental) strengthened the suspicion. Windows was problem-free following replacement of the drive.
 
Back
Top