Solved One disk slow

I have a mirrored zpool with NVME and one of the disks is slow. It shows only 534 ops/s (89% busy) while this disk model can do more than 10000 ops/s. Do you think is a hardware issue?

Code:
dT: 1.002s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0    534    529  16384    1.7      5    275    1.1   89.0| nvd0
    0      0      0      0    0.0      0      0    0.0    0.0| nvd0p1
    0      0      0      0    0.0      0      0    0.0    0.0| nvd0p2
    0    534    529  16384    1.7      5    275    1.1   89.0| nvd0p3
    0    538    533  17203    0.1      5    275    0.3    4.4| nvd1
    0      0      0      0    0.0      0      0    0.0    0.0| nvd1p1
    0      0      0      0    0.0      0      0    0.0    0.0| nvd1p2
    0    538    533  17203    0.1      5    275    0.3    4.4| nvd1p3
    0      0      0      0    0.0      0      0    0.0    0.0| mirror/swap
 
SMART shows no issues:

Code:
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL23T8HCLS-00A07
Serial Number:                      S64HNE0R513628
Firmware Version:                   GDC5602Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.4
Number of Namespaces:               32
Local Time is:                      Thu Sep  1 12:13:00 2022 EEST
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x005f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     83 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   14.00W       -    0  0  0  0       70      70
 1 +     8.00W    8.00W       -    1  1  1  1       70      70

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        35 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    261,134,698 [133 TB]
Data Units Written:                 800,181,967 [409 TB]
Host Read Commands:                 484,338,966
Host Write Commands:                1,643,218,472
Controller Busy Time:               8,947
Power Cycles:                       11
Power On Hours:                     9,043
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               35 Celsius
Temperature Sensor 2:               44 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
 
The "only 534 ops per seconds" may not be a problem at all. It may indicate that the disk can not do more ops than this (which would be bad), or that there is no need for the disk to do any more ops (which would be good). How do we distinguish these two cases? We can look at two things: (a) is the disk busy, and (b) is there a queue of IO requests waiting to be done.

Start with item (b): You have no measurement of queue depth. So forget that.

How busy are the two disks? Well, we have a measurement of that, and we can recalculate that from the other measurements. Let's start with disk nvd1: It claims to be 4.4% busy (meaning 4.4% of the time it has at least one IO request either being worked on, or in the queue waiting to be worked on; the other 95.6% of the time it has no work to do). It also claims to be doing 533 reads per second (each read takes 0.1 ms on average), and 5 writes per second (each of those takes 0.3 ms). If we multiply those out, we find that the disk should be 5.48% busy on average (just multiply the number of IOs with the latency of each, and the rate is low enough that queueing is unlikely to be an effect, so a linear extrapolation is justified). But that calculation has large error bars, mostly because the "average 0.1 ms per IO" measurement has a large error: due to rounding it could be anywhere between 0.05 and 0.149 ms. So for this disk we find the 4.4% measurement of "busy" to be justified.

Now go to disk nvd0: It claims to be 89.0% busy. That's in an of itself already surprising, given that it is technically the same disk model as nvd1, and it has roughly the same workload intensity. Let's do the same check: 529 reads at 1.7 ms each + 5 writes at 1.1 ms each mean it should be 90.5% busy. Again, that is in reasonable agreement with the 89.0%. The fact that it is higher could be caused by parallelism: at ~90% busy you expect a significant fraction of IOs to get started while another IO is already being worked on. So this all makes sense.

The real problem seems to be this: Disk nvd0 is a heck of a lot slower than disk nvd1. On reads, it needs on average 1.7 ms per IO, while the good disk needs 0.1 (which means it is between 11x and 35x slower), while on writes it needs 1.1 ms per IO compared to 0.3 (meaning it is between 3x and 5x slower). Clearly this indicates that nvd0 is not a healthy disk.

Here is a question, perhaps to help debugging it: The smart output says that nvd0 has written 409 TB over its lifetime, with a capacity of 3.84 TB. That means each flash cell has been overwritten on average ~100 times. Modern consumer-grade flash disk have very low write endurance. Could it be that you have reached a write limit, where data on the media is highly fragmented, and internal log cleaning has ceased to preserve lifetime? Try finding some information on the expected or guaranteed write lifetime of this drive, and compare the total written between the two drives.
 
The "good" disk has the same 409TB writes over its lifetime and no speed issues. These are "enterprise" disks and they guarantee that you can write the size of the disk per day. I will ask the datacenter to replace first the cable and if this doesn't fix the issue I will ask them to replace the disk.
 
For enterprise flash, one full overwrite per day is compatible with a 5-year expected lifespan. Consumer grade, not so much.

You're probably right that if all other things are equal, the cabling is a likely problem. Good luck.
 
Back
Top