ZFS Fastest device in ZFS mirror

Hello All,

The purpose of this thread is to find a method to identify the fastest disk device in a ZFS pool configured as a mirror, using existing data provided by system, without directly measuring the disks. This is meant to be a generic question, however I have this example here with two different nvme drives in a mirror.

The mirror in this case has two devices:
Code:
# zpool iostat -lv
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ssd_sys      106G   354G     12      7   216K   188K  135us    2ms  129us  372us  658ns    3ms   39us    1ms  141us    5ms      -
  mirror-0   106G   354G     12      7   216K   188K  135us    2ms  129us  372us  658ns    3ms   39us    1ms  141us    5ms      -
    nda0p3      -      -      6      3   108K  94.2K  152us    4ms  145us  776us  819ns    4ms   49us    4ms  138us    1ms      -
    nda1p3      -      -      6      4   109K  94.2K  118us  320us  114us   49us  496ns    1ms   30us  168us  144us   13ms      -
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

I can see that the nda1p3 seems to have lower disk R/W delays than nda0p3. Does it mean it is faster than the first device?

These drives have different controllers
Code:
# pciconf -lv nvme0; pciconf -lv nvme1
nvme0@pci0:2:0:0:       class=0x010802 rev=0x01 hdr=0x00 vendor=0x10ec device=0x5772 subvendor=0x10ec subdevice=0x5772
    vendor     = 'Realtek Semiconductor Co., Ltd.'
    device     = 'RTS5772DL NVMe SSD Controller (DRAM-less)'
    class      = mass storage
    subclass   = NVM
nvme1@pci0:3:0:0:       class=0x010802 rev=0x01 hdr=0x00 vendor=0x1dbe device=0x5220 subvendor=0x1dbe subdevice=0x5220
    vendor     = 'INNOGRIT Corporation'
    device     = 'NVMe SSD Controller IG5220 (DRAM-less)'
    class      = mass storage
    subclass   = NVM

Does this suggest that the device using IG5220 is faster in this case? If this is true, why does the zpool trim oparation on the nvme1 take so much longer?

Code:
# date; zpool trim ssd_sys
Tue Sep 17 19:47:04 EEST 2024

# zpool status -st
  pool: ssd_sys
 state: ONLINE
  scan: scrub repaired 0B in 00:02:34 with 0 errors on Fri Sep 13 22:41:31 2024
config:

        NAME        STATE     READ WRITE CKSUM  SLOW
        ssd_sys     ONLINE       0     0     0     -
          mirror-0  ONLINE       0     0     0     -
            nda0p3  ONLINE       0     0     0     0  (100% trimmed, completed at Tue Sep 17 19:49:41 2024)
            nda1p3  ONLINE       0     0     0     0  (100% trimmed, completed at Tue Sep 17 20:02:52 2024)

When scrub is running:
Code:
# zpool iostat -lv
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ssd_sys      106G   354G     29      7  1.19M   189K    8ms    2ms  166us  379us  658ns    3ms   39us    1ms   14ms    7ms      -
  mirror-0   106G   354G     29      7  1.19M   189K    8ms    2ms  166us  379us  658ns    3ms   39us    1ms   14ms    7ms      -
    nda0p3      -      -     13      3   611K  94.7K   13ms    4ms  209us  792us  819ns    4ms   49us    4ms   23ms    1ms      -
    nda1p3      -      -     15      4   611K  94.7K    4ms  313us  130us   48us  496ns    1ms   30us  164us    7ms   13ms      -
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

And some time later the histogram

Code:
nda0p3       total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0  9.27K  1.92K  22.2K    594     86      1      0
255ns           0      0      0      0   548K  11.1K   103K  5.41K    459     61      0
511ns           0      0      0      0   212K  10.7K  5.81K  27.2K  1.21K     86      0
1us             0      0      0      0   138K  14.0K  1.36K  46.7K  3.70K    207      0
2us             0      0      0      0  16.1K  4.16K    205  46.8K  2.55K    105      0
4us             0      0      0      0  1.10K    816     18  12.7K    706      0      0
8us             0      0      0      0    142     36     26  16.0K  1.40K      0      0
16us          778  8.43K    808  31.2K    303     29     70  33.5K  2.78K      0      0
32us        1.07K   110K  1.13K   293K    272     59    259  59.0K  6.61K      0      0
65us         240K   140K   282K   130K    186     68    615  58.7K  9.94K      0      0
131us        638K   104K   872K  66.6K    153    114  2.39K  66.3K  9.05K    154      0
262us        182K  75.5K   919K  45.4K    139    146  4.63K  51.4K  6.09K     46      0
524us       47.1K  28.2K   531K  5.36K     86    190  3.64K  15.7K  9.23K    207      0
1ms         23.5K  16.6K  42.2K  7.84K     61    163  2.13K  12.9K  16.0K    564      0
2ms         30.6K  14.3K  2.14K    427     31    483    326  12.3K  28.0K   456K      0
4ms         43.9K  15.5K  1.38K    799     13    824     73  13.8K  41.9K  53.8K      0
8ms         64.5K  16.5K  2.35K    263      4  1.69K     83  14.2K  62.7K  2.42K      0
16ms         581K  13.4K    686  2.60K     15    963    115  9.58K   605K     95      0
33ms         574K  2.93K     47     22      0  1.28K     12  1.55K   550K     12      0
67ms         203K  45.1K    223  7.33K      0  3.52K      0  34.1K   199K     24      0
134ms       23.8K    238      3     11      0      0      0    229  23.5K      5      0
268ms         551    181      1      3      0      0      0    173    545      0      0
536ms         608    139      1     51      0      0      0     88    607      0      0
1s            337      3      2      3      0      0      0      0    332      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------

nda1p3       total_wait     disk_wait    syncq_wait    asyncq_wait
latency      read  write   read  write   read  write   read  write  scrub   trim  rebuild
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0      0      0
127ns           0      0      0      0  8.83K  3.20K  23.3K  1.70K  3.21K      0      0
255ns           0      0      0      0   555K  16.8K   104K  19.6K  52.2K      8      0
511ns           0      0      0      0   210K  13.3K  5.92K  73.6K   295K     44      0
1us             0      0      0      0   133K  12.1K  1.60K  84.4K   481K    123      0
2us             0      0      0      0  15.3K    590    213  44.0K  93.2K    279      0
4us             0      0      0      0  1.05K     57     60  17.2K  23.8K      5      0
8us             0      0      0      0    132     13    123  32.9K  36.2K      0      0
16us        34.0K  23.1K  47.2K  65.7K    309     16    295  62.6K  43.3K      0      0
32us         290K   212K   312K   420K    235     38    484  87.8K  40.9K      1      0
65us         396K   209K   412K   147K    246     36    868  83.3K  29.7K      0      0
131us       1.19M   145K  1.53M  79.7K    221     62  3.14K  80.4K  38.2K      0      0
262us        208K  84.2K   541K  18.3K    223    103  4.62K  51.8K  53.2K      0      0
524us        130K  26.3K   255K  1.23K    103    157  3.21K  17.8K  57.7K      0      0
1ms         96.6K  14.9K  23.8K  6.12K     19    335  1.50K  12.4K  73.1K      0      0
2ms          106K  9.49K    878     14      1    375    210  7.90K  98.2K      0      0
4ms          130K  5.73K     66     26      0    718    116  4.75K   126K    460      0
8ms          144K  4.87K      9    116      0  1.10K     17  3.42K   140K  3.71K      0
16ms         155K  2.08K      0      7      0  1.41K      0    626   152K   466K      0
33ms         150K  2.13K      0      6      0  1.87K      0    173   149K  42.6K      0
67ms        77.2K    404      0      0      0    164      0    202  76.0K     75      0
134ms       25.2K      0      0      0      0      0      0      0  25.1K      0      0
268ms         692      0      0      0      0      0      0      0    665      0      0
536ms           0      0      0      0      0      0      0      0      0      0      0
1s              0      0      0      0      0      0      0      0      0      0      0
2s              0      0      0      0      0      0      0      0      0      0      0
4s              0      0      0      0      0      0      0      0      0      0      0
8s              0      0      0      0      0      0      0      0      0      0      0
17s             0      0      0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0      0      0
---------------------------------------------------------------------------------------
 
I didn't calculate mean and variance from the histograms, nor did I turn them into graphs, so my conclusions below may be inaccurate.

Yes, from the data it is obvious that disk 0 is slower on reads (significantly, and surprisingly, reads taking dozens of ms on an SSD is odd), and about even with disk 1 on writes. The part that I don't understand is the significant difference between total_... and disk_... numbers. The former include queuing time (per the ZFS documentation), but I don't know what "queue" means in this context. Is this a ZFS internal queue, before it sends the IO to the kernel for execution, or is ZFS peeking inside the block device driver's queue and measuring its length? For both disks (and particularly for disk 0), the read numbers are significantly slower when including queueing. This makes me suspect that queuing is handled badly for disk 0. If we ignore queueing, both disks are pretty fast, with IO times significantly sub millisecond, average roughly 1/10 ms, and writes faster than reads (as it should be for a good SSD).

I could now give a 2-hour lecture on the importance of queue depth and queue management on disk performance. Let's not, because (a) I don't have 2 hours, and (b) most of my knowledge is relevant to spinning rust on SATA and SCSI interfaces, not to leaky capacitors on NVMe. But to get this optimized would require understanding how ZFS queues IOs, how that interacts with the OSes queues, and how these particular SSDs manage multiple queued IOs internally (if any). One of the fascinating things about SSDs is that (unlike spinning disks) they can work on multiple IOs simultaneously with parallel channels, and that is even visible on their host interfaces. Getting the best performance out of that requires dealing with the disk's firmware and configuration, and the OS queue configuration, and the HBA in between (and I'm not familiar with NVMe at all).

The other fascinating difference is, as you pointed out, the trim operation. Here disk 0 is about 8x faster. And honestly, I'm not terribly surprised by that. Trim is actually quite a complex thing to implement, comparable to a write, but of unlimited size: it changes the disk's internal metadata (what block is allocated or not), including the write overhead of hardening that metadata. If the SSD's data structures are not built with trim in mind, the operation can be difficult to implement, with atomicity and consistency problems that can require slow workarounds.

But trim is a relatively recent addition (only last 5-10 years it has been common). Part of that is likely due to the sociology of these two companies: Realtek is quite old, grew out of low-end sound chips in 1980s and 1990s PCs, and has been doing SSD controllers for a long time. Innogrit is a relatively recent company, refugees from Marvell, using inexpensive Chinese engineering. I bet you someone who was familiar with the FTL industry could see the two companies organization and strength reflected in those performance numbers. As I like to say: culture eats strategy for breakfast.
 
Yes, from the data it is obvious that disk 0 is slower on reads (significantly, and surprisingly, reads taking dozens of ms on an SSD is odd), and about even with disk 1 on writes. The part that I don't understand is the significant difference between total_... and disk_... numbers. The former include queuing time (per the ZFS documentation), but I don't know what "queue" means in this context. Is this a ZFS internal queue, before it sends
I have read the manual zpool-iostat(8), but it is not too specific about how these numbers are calculated. As everybody can see, it just declares that "total_wait Average total I/O time (queuing + disk I/O time)." The most important in this case seems to be "disk_wait Average disk I/O time (time reading/writing the disk)."

Would be interesting to see these results from somebody who has really good high-performance drive. What are the limits of disk_wait times?
 
"Really good high-performance drive" is not me, at home I use ZFS on spinning rust disks, and on a 15-year old Intel SSD connected via SATA.

Here's the data (without capacity and trim columns, my SSD doesn't do trim):
Code:
                       operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
pool                   read  write   read  write   read  write   read  write   read  write   read  write   wait
--------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
home                      0      0    554  1.05K    7ms    2ms    6ms  489us  559us    3us   10ms    2ms   28ms
  mirror-0                0      0    554  1.05K    7ms    2ms    6ms  489us  559us    3us   10ms    2ms   28ms
    gpt/hd14_home         0      0    277    535    7ms    2ms    7ms  567us  516us    3us   12ms    2ms   35ms
    gpt/hd16_home         0      0    277    535    6ms    2ms    5ms  415us  599us    3us    8ms    1ms   23ms
--------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zroot                     0      4  5.18K  48.0K  389us  751us  304us  246us   28us    4us  778us  539us  965us
  ada0p3                  0      4  5.18K  48.0K  389us  751us  304us  246us   28us    4us  778us  539us  965us
Note that my workload is usually light, so there should be few queueing effects.
 
"Really good high-performance drive" is not me, at home I use ZFS on spinning rust disks, and on a 15-year old Intel SSD connected via SATA.
Here is my other desktop system with two spinning disks and SSD cache:
Code:
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
kelder       681G  1.14T      2     13  49.1K   625K   26ms   28ms   15ms    4ms  637us    9ms    1ms   28ms  118ms      -      -
  mirror-0   681G  1.14T      2     13  49.1K   625K   26ms   28ms   15ms    4ms  637us    9ms    1ms   28ms  118ms      -      -
    ada0p3      -      -      1      6  23.0K   312K   29ms   12ms   16ms    3ms  965us   10ms    1ms    8ms  194ms      -      -
    ada1p3      -      -      1      6  26.1K   312K   23ms   44ms   15ms    5ms  290us    8ms    1ms   50ms   75ms      -      -
cache           -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -      -
  ada2p3     205G  2.20G     63      2  1.06M   135K  546us   20ms  481us    3ms   35us  768ns  194us   17ms      -      -      -
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ada0 is Seagate and ada1 is older Toshiba. ada2 is cheap SATA SSD.
 
Back
Top