ZFS Poor performance on unrelated single drives on the same HBA as zpool during resilvering

I'm running FreeBsd (14.3) on old server hardware (SM487 chassis) with the following (copied from dmesg):
Code:
CPU: Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz (3300.22-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x306e4  Family=0x6  Model=0x3e  Stepping=4
  Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
  Features2=0x7fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x1<LAHF>
  Structured Extended Features=0x281<FSGSBASE,SMEP,ERMS>
  Structured Extended Features3=0x9c000000<IBPB,STIBP,L1DFL,SSBD>
  XSAVE Features=0x1<XSAVEOPT>
  VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr
  TSC: P-state invariant, performance statistics
real memory  = 277025390592 (264192 MB)
avail memory = 267661074432 (255261 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 32 CPUs
FreeBSD/SMP: 2 package(s) x 8 core(s) x 2 hardware threads

The drives involved are 35 Seagate Exos x16 16TB and 1 Seagate Exos x18 18TB (one of the ZFS spares, was a warranty replacement). 24 drives are on the front backplane (including the x18); the rear backplane has 12 (all x16). Zpool configuration is 3x11-wide RAIDZ3 with 3 hot spares, with two vdevs entirely on the front backplane and the remaining one entirely on the rear backplane (along with two spares on the front and one on the rear).

I'm considering GELI encypting the drives, should the performance be sufficient, so I set up a few drives by removing the three hot spares from the array and testing them as individual drives. In this setup performance seemed pretty reasonable--when zeroing out the drives dd reported about 220MB/s per drive, down from 270MB/s for the raw unencrypted drives. The performance was the same writing to three drives at a time as it was to one.

So far so good; I added the da*.eli devices as spares to my zfs pool and then offlined three drives (one per vdev). The resilvering performance is hard to judge; I think it's taking a little longer than normal but I can't remember the last time I resilvered three drives at once. So putting that aside for the moment, except to know that it is going on in the background. I did check CPU usage during the resilver and no CPU is reporting much more than 20% utilized.

However, while the resilver is happening, the dd progress cleaning out the next set of three drives reports an absolutely abysmal slowdown--from 220MB/s per drive down to 80MB/s per drive. Doesn't seem to matter if one drive is being zeroed or 3. Rechecking CPU usage shows a couple at 22-23%, the rest still in the mid to high teens. If there's a CPU bottleneck I can't see it. The processor should support AES-NI (and it shows up in the dmesg); all of the GELI devices report accelerated software. I did a `geli list`, and all of the drives look the same, so I'm only showing the output for one drive:

Code:
Geom name: da0.eli
State: ACTIVE
EncryptionAlgorithm: AES-XTS
KeyLength: 256
Crypto: accelerated software
Version: 7
UsedKey: 0
Flags: AUTORESIZE
KeysAllocated: 3726
KeysTotal: 3726
Providers:
1. Name: da0.eli
   Mediasize: 16000900657152 (15T)
   Sectorsize: 4096
   Mode: r1w1e0
Consumers:
1. Name: da0
   Mediasize: 16000900661248 (15T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e1

Anyone have an ideas where else I can look for the cause of the slowdown? Writing all zeroes to the drive shouldn't really be a difficult workload (and in any event I'm comparing the same workload both times). I'm concerned I might get half of my pool encrypted only to find out that performance has completely tanked. I already set `kern.geom.eli.threads` to 1, and while that has limited the threads to one per drive I see very little difference in performance.

Or is it just normal for unrelated I/O to completely tank on a machine that is resilvering a zfs pool?
 
I remember seeing problem reports about geli's performance in FreeBSD's bug database. You could search there for some insights.

Have you tested the performance during resilvering when writing without geli (i.e. directly to disk, without any encryption)?

Also note that writing zeroes to a disk might be optimized in some way so you could be getting misleading performance numbers. I'd try writing random data instead.
 
I'm sure the numbers themselves are meaningless. I'm not trying to gauge actual performance here, just performance loss. The resilver finished today around 5pm. I restarted the dd processes (I had them suspended) and performance picked back up. Without the resilver it's back to averaging 220-230MB/s, no other changes.

I haven't tested writing to the raw disk while resilvering is happening because I don't have any spare bays to try it at the moment (the spares are in use, and the three unused drives are already encrypted and being zeroed). Once the drives are done zeroing out (likely by some time tomorrow), I can offline a second set of three prior to starting the resilver, and try writing to the offlined drives before encrypting them.

I looked through the Bugzilla and don't see any open bugs regarding GELI and performance except for one related to `kern.geom.eli.threads` (solution, set them to 1, which I already did) and one related to some sort of add-on cryptography card that I don't have. But, that's only searching open/new/in progress bugs; I can't see how to search closed/resolved ones to see if there's any help there.
 
Performance writing to the unecrypted device (da22/da23/da24) while a resilver is happening is 63, 66, 87 MB/s (respectively); with the GELI encrypted devices (da22.eli/da23.eli/da24.eli) is 63, 64, 81 MB/s (respectively). Here's the output of `gstat`. At the time this was taken, da0.eli, da2.eli, and da5.eli are part of the zpool and being resilvered to; da6.eli, da25.eli, and da35.eli are in the pool and are being read from; and da22.eli, da23.eli, and da24.eli are not part of the zpool at all and are being zeroed out with dd. The SATA disk drives are da0-da35; ada0 and ada1 are a boot mirror of SSDs and uninvolved, while da36 is a USB key with the FreeBSD 14.3 installer on it.

Code:
dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0  0.000      0      0  0.000    0.0| ada0
    0      0      0      0  0.000      0      0  0.000    0.0| ada1
    2     99      0      0  0.000     97  59796   14.2   75.1| da0
    3     78     57  52097   30.0     19    195  0.552   63.3| da1
    2     86      0      0  0.000     84  49809   17.8   80.8| da2
    3     88     67  61168   27.5     19    199  0.559   64.5| da3
    3     74     54  47548   32.6     18    195  0.606   65.9| da4
    2     84      0      0  0.000     82  51527   16.5   73.0| da5
    3     86     64  51312   27.1     20    195  0.510   63.6| da6
    3     73     53  46524   32.9     18    195  0.568   64.2| da7
    3     73     52  46217   33.5     19    203  0.575   63.8| da8
    3     78     55  49586   32.9     21    203  0.544   64.8| da9
    3     86     63  59266   30.3     21    207  0.445   66.7| da10
    3     72     52  45507   33.3     18    207  0.669   63.8| da11
    3     75     53  46524   33.2     20    211  0.650   64.3| da12
    3     74     54  48565   33.3     18    183  0.614   64.0| da13
    3     80     59  53002   29.4     19    199  0.763   63.3| da14
    3     73     52  45507   33.9     19    203  0.574   63.9| da15
    3     76     53  47233   32.8     21    195  0.523   64.0| da16
    3     76     53  48441   34.9     21    211  0.553   66.2| da17
    3     71     51  46217   34.5     18    203  0.661   64.9| da18
    3     82     60  54019   29.5     20    199  0.653   65.2| da19
    3     74     52  47229   33.8     20    187  0.593   65.0| da20
    3     74     51  48258   35.6     21    211  0.568   65.5| da21
    1    119      0      0  0.000    119 121458    7.0   82.7| da22
    1    114      0      0  0.000    114 116355    7.2   81.7| da23
    1    135      0      0  0.000    135 137789    5.8   78.6| da24
    3     68     48  41297   34.9     18    191  0.657   64.1| da25
    3     78     58  53102   30.1     18    203  0.721   63.3| da26
    3     71     49  46209   37.4     20    207  0.631   65.6| da27
    3     81     61  56973   30.9     18    195  0.607   67.6| da28
    3     72     48  45188   37.2     22    211  0.543   65.6| da29
    3     72     47  43151   37.2     23    207  0.469   64.7| da30
    3     74     52  46528   33.2     20    207  0.608   63.7| da31
    3     84     61  56854   29.7     21    207  0.704   65.8| da32
    3     81     59  54398   29.9     20    199  0.475   64.0| da33
    3     82     59  52456   32.0     21    191  0.515   67.1| da34
    3     87     61  58050   29.9     24    211  0.423   65.3| da35
    0      0      0      0  0.000      0      0  0.000    0.0| da36
    2     99      0      0  0.000     97  59796   15.1   75.2| da0.eli
    2     86      0      0  0.000     84  49809   18.8   80.9| da2.eli
    0      0      0      0  0.000      0      0  0.000    0.0| ada1p1
    0      0      0      0  0.000      0      0  0.000    0.0| ada1p2
    0      0      0      0  0.000      0      0  0.000    0.0| ada1p3
    2     84      0      0  0.000     82  51527   17.6   73.2| da5.eli
    0      0      0      0  0.000      0      0  0.000    0.0| ada0p1
    0      0      0      0  0.000      0      0  0.000    0.0| ada0p2
    0      0      0      0  0.000      0      0  0.000    0.0| ada0p3
    3     86     64  51312   27.8     20    195  0.527   64.0| da6.eli
    3     68     48  41297   35.6     18    191  0.675   64.5| da25.eli
    3     87     61  58050   30.9     24    211  0.440   65.6| da35.eli
    0      0      0      0  0.000      0      0  0.000    0.0| mirror/swap.eli
    0      0      0      0  0.000      0      0  0.000    0.0| da36s1
    0      0      0      0  0.000      0      0  0.000    0.0| da36s2
    0      0      0      0  0.000      0      0  0.000    0.0| gpt/gptboot1
    0      0      0      0  0.000      0      0  0.000    0.0| gptid/4434e3e6-ebd0-11ec-94d8-002590c118c4
    0      0      0      0  0.000      0      0  0.000    0.0| gpt/gptboot0
    0      0      0      0  0.000      0      0  0.000    0.0| gptid/441dbd6d-ebd0-11ec-94d8-002590c118c4
    0      0      0      0  0.000      0      0  0.000    0.0| mirror/swap
    0      0      0      0  0.000      0      0  0.000    0.0| msdosfs/EFISYS
    0      0      0      0  0.000      0      0  0.000    0.0| da36s2a
    0      0      0      0  0.000      0      0  0.000    0.0| ufsid/6842c0e04852554a
    0      0      0      0  0.000      0      0  0.000    0.0| ufs/FreeBSD_Install
    1    119      0      0  0.000    119 121458    8.2   97.1| da22.eli
    1    114      0      0  0.000    114 116355    8.4   95.4| da23.eli
    1    135      0      0  0.000    135 137789    7.0   94.5| da24.eli
Here's the output of `top` while this is happening, if that's helpful at all:
Code:
last pid: 83754;  load averages:  7.41,  7.35,  6.60                                         up 1+23:45:44  10:17:50
60 processes:  1 running, 59 sleeping
CPU 0:   0.4% user,  0.0% nice, 18.9% system,  0.0% interrupt, 80.7% idle
CPU 1:   0.4% user,  0.0% nice, 19.3% system,  0.0% interrupt, 80.3% idle
CPU 2:   0.0% user,  0.0% nice, 18.5% system,  0.0% interrupt, 81.5% idle
CPU 3:   0.8% user,  0.0% nice, 16.1% system,  0.0% interrupt, 83.1% idle
CPU 4:   2.4% user,  0.0% nice, 17.7% system,  0.0% interrupt, 79.9% idle
CPU 5:   0.0% user,  0.0% nice, 14.6% system,  0.0% interrupt, 85.4% idle
CPU 6:   0.0% user,  0.0% nice, 17.3% system,  0.0% interrupt, 82.7% idle
CPU 7:   1.6% user,  0.0% nice, 19.3% system,  0.0% interrupt, 79.1% idle
CPU 8:   0.0% user,  0.0% nice, 23.2% system,  0.0% interrupt, 76.8% idle
CPU 9:   0.4% user,  0.0% nice, 23.2% system,  0.0% interrupt, 76.4% idle
CPU 10:  0.0% user,  0.0% nice, 20.5% system,  0.0% interrupt, 79.5% idle
CPU 11:  0.4% user,  0.0% nice, 20.9% system,  0.0% interrupt, 78.7% idle
CPU 12:  0.4% user,  0.0% nice, 25.2% system,  0.0% interrupt, 74.4% idle
CPU 13:  0.0% user,  0.0% nice, 16.5% system,  0.0% interrupt, 83.5% idle
CPU 14:  0.4% user,  0.0% nice, 20.5% system,  0.0% interrupt, 79.1% idle
CPU 15:  1.2% user,  0.0% nice, 20.5% system,  0.0% interrupt, 78.3% idle
CPU 16:  0.4% user,  0.0% nice, 23.6% system,  1.6% interrupt, 74.4% idle
CPU 17:  0.0% user,  0.0% nice, 16.1% system,  1.2% interrupt, 82.7% idle
CPU 18:  0.0% user,  0.0% nice, 20.5% system,  0.0% interrupt, 79.5% idle
CPU 19:  0.0% user,  0.0% nice, 23.2% system,  0.0% interrupt, 76.8% idle
CPU 20:  0.4% user,  0.0% nice, 19.3% system,  0.0% interrupt, 80.3% idle
CPU 21:  1.6% user,  0.0% nice, 21.3% system,  0.0% interrupt, 77.2% idle
CPU 22:  1.2% user,  0.0% nice, 20.9% system,  0.0% interrupt, 78.0% idle
CPU 23:  0.4% user,  0.0% nice, 18.5% system,  0.0% interrupt, 81.1% idle
CPU 24:  0.4% user,  0.0% nice, 16.9% system,  0.4% interrupt, 82.3% idle
CPU 25:  0.8% user,  0.0% nice, 26.8% system,  0.0% interrupt, 72.4% idle
CPU 26:  0.4% user,  0.0% nice, 16.9% system,  0.4% interrupt, 82.3% idle
CPU 27:  0.8% user,  0.0% nice, 22.4% system,  0.4% interrupt, 76.4% idle
CPU 28:  0.4% user,  0.0% nice, 26.0% system,  0.0% interrupt, 73.6% idle
CPU 29:  0.4% user,  0.0% nice, 19.7% system,  0.0% interrupt, 79.9% idle
CPU 30:  2.0% user,  0.0% nice, 20.1% system,  0.4% interrupt, 77.6% idle
CPU 31:  0.8% user,  0.0% nice, 20.1% system,  0.4% interrupt, 78.7% idle
Mem: 27M Active, 312M Inact, 105G Wired, 143G Free
ARC: 54G Total, 22G MFU, 16G MRU, 9876K Anon, 1325M Header, 15G Other
     31G Compressed, 247G Uncompressed, 7.99:1 Ratio
Swap: 64G Total, 64G Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
56579 root          5  68    0   160M    41M usem    14  92:44   8.02% python3.11
10380 root          1  23    0    16M  3456K physwr  22   0:32   4.50% dd
80417 root          1  21    0    16M  3456K physwr  21   0:30   3.06% dd
  386 root          1  21    0    16M  3460K physwr   3   0:24   2.71% dd
82175 root          1  20    0    15M  3740K CPU19   19   0:00   0.26% top
22972 root          1  20    0    29M    15M select   3   0:13   0.05% sshd-session
69925 root          1  20    0    14M  2628K kqread  27   0:40   0.03% tail
72417 root          1  20    0    14M  2512K select  28   0:36   0.03% tail
18259 root          1  20    0    26M    11M select   5   0:12   0.02% sshd-session
45662 nut           1  20    0    14M  3224K select  10   0:43   0.02% usbhid-ups
53276 root          1  20    0    14M  2284K select   4   0:40   0.02% powerd
79749 root          1  20    0    29M    15M select  15   0:25   0.02% sshd-session
52895 ntpd          1  20    0    24M  7868K select  14   0:19   0.02% ntpd
45718 nut           1  20    0   269M  4812K select  16   0:21   0.01% upsd
47168 root          1  20    0    23M    11M select  28   0:01   0.01% sshd-session
51282 root          1  20    0   161M   203M select  30  29:10   0.00% smbd
79814 root          1  20    0   155M   205M select   7   2:26   0.00% smbd
66742 root          1  20    0    82M    96M select   7   0:08   0.00% nmbd
68162 root          1  21    0   135M   186M select  10   0:08   0.00% smbd
46588 nut           1  20    0    19M  7236K nanslp  22   0:08   0.00% upsmon
 4218 root          1  20    0    24M    10M select  27   0:04   0.00% zfsd
 3439 root          1  20    0    15M  4960K select   7   0:02   0.00% devd
91742 root          2  20    0   136M   184M select  27   0:02   0.00% smbd
19594 root          1  20    0    19M  7036K nanslp  29   0:01   0.00% smartd
76800 root          1  20    0    14M  2596K nanslp  21   0:01   0.00% cron
38915 root          1  20    0    14M  2824K select  17   0:01   0.00% syslogd
93738 root          1  20    0   132M   183M select  16   0:00   0.00% smbd
70251 root          2  20    0    14M  3028K piperd  24   0:00   0.00% sshg-blocker
18488 root          1  20    0    16M  5124K pause    6   0:00   0.00% zsh
 
I have heard about ZFS resilver "disturbing" unrelated disk activity before. I don't remember the outcome.

Obviously it shouldn't do that. At an extreme it could be something like a badly placed kernel lock. Unlikely of course, but I don't see likely explanations for this.
 
I think also of note here is that writes to the ZFS pool itself don't seem to be too badly affected while this resilver is going on (I'll get non-resilvering data tomorrow, when the resilver is complete):

Code:
# dd if=/dev/zero of=/workspc/test_file bs=1M count=102400 conv=sync status=progress
  107156078592 bytes (107 GB, 100 GiB) transferred 122.004s, 878 MB/s
102400+0 records in
102400+0 records out
107374182400 bytes transferred in 122.864198 secs (873925719 bytes/sec)
And /dev/random:
Code:
# dd if=/dev/random of=/workspc/test_file bs=1M count=102400 conv=sync status=progress
  107272470528 bytes (107 GB, 100 GiB) transferred 446.002s, 241 MB/s
102400+0 records in
102400+0 records out
107374182400 bytes transferred in 446.353908 secs (240558401 bytes/sec)
And copying the previously created random file from one dataset to another (because I've always found /dev/random to be a bottleneck):
Code:
# dd if=/workspc/test_file of=/data/test_file bs=1M conv=sync status=progress
  107107844096 bytes (107 GB, 100 GiB) transferred 241.002s, 444 MB/s
102400+0 records in
102400+0 records out
107374182400 bytes transferred in 241.336992 secs (444913901 bytes/sec)
These are Exos X16 16TB drives, so the official spec is sustained transfer of 249MiB/s.
 
I have heard about ZFS resilver "disturbing" unrelated disk activity before. I don't remember the outcome.

Obviously it shouldn't do that. At an extreme it could be something like a badly placed kernel lock. Unlikely of course, but I don't see likely explanations for this.

I don't supposed you recall whether a bug was ever filed on this behavior (whether it was closed/fixed or not)?
 
Looking at the gstat output, could it be that the device is partitioned in a way to make block 0 of the data area misaligned? These devices have 7ms per write, which is below the software stack. Maybe the drive needs to read two tracks, patch in some data in the middle and rewrite then. The software stack seems not to be at fault here.
 
Call me ignorant, but to me the case is pretty clear:
The according drives we're talking here are all dax, right? Which to me are (external) USB drives.
All USB devices are attached to a single Bus. (You may have an USB2, which is different from the USB3... - but you get the point.)
To me it seems it's just the USBus is simply at it's capacity limits.
The resilvering uses up so much traffic, any other drive on the same Bus simply has to wait for its data packages to be delivered.
What also speaks for my thesis is the adax drives, which I believe are SATA drives are not affected, because they are completely independent at total a different Bus.
 
Call me ignorant, but to me the case is pretty clear:
The according drives we're talking here are all dax, right? Which to me are (external) USB drives.
All USB devices are attached to a single Bus. (You may have an USB2, which is different from the USB3... - but you get the point.)
To me it seems it's just the USBus is simply at it's capacity limits.
The resilvering uses up so much traffic, any other drive on the same Bus simply has to wait for its data packages to be delivered.
What also speaks for my thesis is the adax drives, which I believe are SATA drives are not affected, because they are completely independent at total a different Bus.
daN can be scsi, sas, even ata connected to some raid/jbod capable hba
also the bw exhaustion theory does not explain same speed with 1 or 3 drive concurrently
Doesn't seem to matter if one drive is being zeroed or 3
 
Looking at the gstat output, could it be that the device is partitioned in a way to make block 0 of the data area misaligned? These devices have 7ms per write, which is below the software stack. Maybe the drive needs to read two tracks, patch in some data in the middle and rewrite then. The software stack seems not to be at fault here.

There are no partitions anywhere in my data pool. ZFS was originally given raw disks (daX). When moving to GELI encrypted disks, they were created by doing `geli init -e AES-XTS -l 256 -P -s 4096 -K {keyfile} {device}`, and then the "raw" GELI device (da35.eli, for example) is added to the zpool without partitioning it either. If there's a misalignment, it's due to GELI iself. However, I'm only seeing modest performance degradation when writing to the GELI devices vs the unencrypted bare drives, so it seems likely that GELI isn't involved. Unless the fact that the zpool now how has some GELI encrypted drives in it matters--but then, it seems like (absent CPU/RAM exhaustion), that should only affect the zpool, not the individual drives. Instead it seems to be the opposite--the zpool is fine, even when resilvering, but performance on non-pool drives tanks.

Since I posted this, the original resilver completed (20.5TB), and it took about 30 hours. I wouldn't call that speedy, but RAIDZ resilvers have never been particularly fast in my experience.

Call me ignorant, but to me the case is pretty clear:
The according drives we're talking here are all dax, right? Which to me are (external) USB drives.
All USB devices are attached to a single Bus. (You may have an USB2, which is different from the USB3... - but you get the point.)
To me it seems it's just the USBus is simply at it's capacity limits.
The resilvering uses up so much traffic, any other drive on the same Bus simply has to wait for its data packages to be delivered.
What also speaks for my thesis is the adax drives, which I believe are SATA drives are not affected, because they are completely independent at total a different Bus.

Part of me wants to ask who in their right mind sets up a zpool of 36 externally attached USB drives, and how many USB hubs and cables you'd actually need to do that. But the other part of me has read r/datahoarder, so...

Anyway, the da drives (well, da0-da35, the drives in the pool) are not USB drives. They are 3.5" 7200RPM SATA Enterprise drives (Seagate Exos x16 16TB) in the hot swap bays of a SuperMicro 847 chassis, which attaches them to one of two SAS2 backplanes that are connected via an LSI 9211-8i (mps1) in IT mode (one cable to the front backplane, one to the rear). There's a second LSI adapter at mps0 (whose model number I can't quite remember) that shows up in the dmesg, which is connected via cables to the JBOD shelf underneath, but that shelf is not currently powered on, so there's no devices attached to it. There's also no multipath, in case that matters.

The two ada drives are a pair of SSDs (Samsung SSD 870 EVO 500GB) and really don't do much once the system is booted. Including them in any kind of performance calculation is likely to be a waste since they aren't on the same bus (connected to motherboard SATA ports) and aren't spinning disks. The da36 drive is the exception--is a USB boot key, but it's not part of the pool and not even mounted at the moment, so unless merely having it plugged in is an issue, I don't think it's relevant?

daN can be scsi, sas, even ata connected to some raid/jbod capable hba
also the bw exhaustion theory does not explain same speed with 1 or 3 drive concurrently

They are SATA drives plugged into SAS backplanes (see above). And yeah, I don't understand why I get the exact same performance writing to 1, 2, or 3 drives that are on the same SAS backplanes--fast when the resilver is not happening, slow when it is. But the same across all three drives.
 
Part of me wants to ask who in their right mind sets up a zpool of 36 externally attached USB drives, and how many USB hubs and cables you'd actually need to do that.
Me, too. But there are people doing such things. Well, not right minded ones :cool:
Since covacat reminded me da-drives are not limited to USB, only, my post was clearly useless to this topic.
Sorry for that.
And I like to thank you guys being the greater men not grinding me for a stupid post.
I sometimes just like to add 'exotic' ideas into a discussion about an issues not solved.
Sometimes when problem solving gets stuck, it can be because one does not see the forest because of the trees, as we say in german. Even if an idea is silly it can ignite something to get to the right ones.
I don't think it's relevant?
Of course not. I did read that in your post. I never thought it could be relevant.
I just had the idea of some one having attached a jungle of external USB drives with dozens hubs wondering about the shit is so slow 😅

So, sorry, and again thank you.
 
The best storage experts are not very active on the forums. I'd look for a suitable mailing list or alternatively create a problem report in FreeBSD's bugzilla.
 
Me, too. But there are people doing such things. Well, not right minded ones :cool:

Yes, there are. I have seen things.

And I like to thank you guys being the greater men not grinding me for a stupid post.
I sometimes just like to add 'exotic' ideas into a discussion about an issues not solved.
Sometimes when problem solving gets stuck, it can be because one does not see the forest because of the trees, as we say in german. Even if an idea is silly it can ignite something to get to the right ones.

It never hurts, as long as it's presented respectfully. If someone flames you for that, that's on them, not you.

The best storage experts are not very active on the forums. I'd look for a suitable mailing list or alternatively create a problem report in FreeBSD's bugzilla.

I unfortunately do not have time for mailing lists, but I would be happy to file a report in the bugzilla. That's why I asked cracauer@, whose tag says he is a developer, if he was aware of any existing bugs that might be related (specifically resolved/closed) because I don't see any open bugs related to what I'm seeing.
 
is the problem occurring only if you dd to a geli drive or even if you dd to a raw drive ? (while resilvering)

Leaving out the makeup of the zpool, the problem occurs whether I dd to a raw drive (da0, for example) or to the same drive, but geli encrypted (da0.eli). When there is no resilver happening, dd'ing /dev/zero to da0 gives about 270MB/s and to da0.eli gives about 220MB/s. I get the same performance whether I dd to one raw drive or three. At a guess, I'm hitting the limitations of how fast a single thread can encrypt/decrypt, which is... fine. I'd prefer to see less performance loss, but the system will be fine if 220MB/s per drive is all it can manage.

However, with resilver running, dd'ing /dev/zero to da0 gives about 70-80MB/s, and to da0.eli gives about 60-70MB/s (while the overall numbers are abysmal in comparison, the performance of the encrypted vs unencrypted drives is much closer).

That does bring us to the zpool itself, which currently has GELI drives in it (so they are being hit during the entirety of the resilver). If it were absolutely necessary for data collection, I could remove the encrypted drives from the ZFS pool, but that would be many hours of work (each resilver takes 30 hours, so I'd need to do that twice, after the current resilver finishes in the next 5 hours). But it could be done if we want to eliminate GELI entirely from the picture before moving forward with any possible bug report. In any event, I'm not going to swap any more GELI-encrypted drives into the zpool until that can be decided.
 
We have around 120mb/sec writing to da22/23/24. That is half of what sequencial writing does, yes?

Maybe we really see a bus congestion in the hba? Half for reading, half for writing. Using /dev/zero as source does not use bandwidth in the hba. Just my late evening thinking.
 
So new data, since Crivens above mentioned the HBA. I fired up my disk shelf, which is connected to a different HBA (mps0). I started doing the dd from /dev/zero on those, they're completely unaffected by the ongoing resilver. I got up to 5 running simultaneously at ~200MB/s each (they're 5400 RPM drives, so they're not going to touch the 270MB/s that the EXOS drives can do).

By the numbers, that means that mps0 across external cables can do 1000MB/s writing from /dev/zero and mps1 which is internal is topping out around 600MB/s while the resilver is going on. Is it possible that the HBA is being exhausted not by raw throughput but by IOPS or queue length or similar? I need to test again when the resilver is done, so I can export the pool and check performance on the same drives but on different HBAs.

This is what I see from dmesg for both HBAs. mps0 is the HBA for the disk shelf and mps1 is the HBA for the server itself:
Code:
mps0: <Avago Technologies (LSI) SAS2308> port 0x7000-0x70ff mem 0xdfa40000-0xdfa4ffff,0xdfa00000-0xdfa3ffff irq 32 at device 0.0 numa-domain 0 on pci3
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 5a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
mps1: <Avago Technologies (LSI) SAS2008> port 0xf000-0xf0ff mem 0xfbe00000-0xfbe03fff,0xfbd80000-0xfbdbffff irq 64 at device 0.0 numa-domain 1 on pci12
mps1: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd

I was given the impression by the vast majority of the internet that SAS2008 vs SAS2308 doesn't matter for spinning disks and only made a big difference for SSDs. Do I just have enough HDDs that advice doesn't hold? Because, you know, if that's the case I can just go buy a 9207-8i to match my 9207-8e and stop dealing with this issue.
 
I can eliminate it as a possibility. My disk shelf has 45 bays, so I can just move all 36 drives down to it and see if I see the same problem. It's a bit loud though so it will have to wait until I'm done work for the day and don't have to sit right next to it. It's actually less work than removing the GELI drives from the pool.
 
So here's some gstat results under various conditions / loads. Maybe this data will help pinpoint where the bottleneck is. Mapping was added via sesutil, and I marked which drives are in the zpool and which ones aren't:

ses0: On-motherboard AHCI
ses1: Front backplane, chassis (11 random drives, 4/8/12TB)
ses2: Rear backplane, chassis (empty for these tests)
ses3: Front backplane, disk shelf (24 zpool drives)
ses4: Rear backplane, disk shelf (12 zpool drives, 9 random drives 8/12TB)

TL;DR SAS2008 vs SAS2308 has very little effect, but the problem does follow the HBA. Unrelated drives on the HBA with zpool see their performance tank, while drives on the HBA without the zpool are largely unaffected. This result is the same whether the zpool is in the chassis (SAS2008) or in the shelf (SAS2308).

Analysis:
The random drives in ses4 (disk shelf, rear) dropped from 174-203MB/s down to 48-52MB/s once the resilver really got going. This is comparable if not a little worse than the previous setup, but we do have 45 drives on the HBA with the zpool now instead of the previous 36, so that might account for it. Also all of the random drives here (or most, at least) are slower 5400RPM drives which didn't come close to hitting the 270MB/s that the EXOS drives in pool do; rather they topped out at between 180-200MB/s, which could probably account for all of the additional performance loss.

The 11 random drives in ses1 (chassis, front) were affected slightly, going from 140-176MB/s down to 128-160MB/s. This does point to a slight bottleneck in the chassis (about 10% performance loss) somewhere in this configuration compared to the previous one (where the zpool was in the chassis and the extra drives were all in the disk shelf). It's possible this might be because of the SAS2008 chip in the chassis HBA vs the SAS2308 chip for the shelf HBA.

Raw Data:

DD /dev/zero to 11 random spare drives, in the main chassis, on the LSI 9211-8i. No other activity. 1,761 w/s and 1761 MB/s overall.
Code:
dT: 1.028s  w: 1.000s  filter: ^da[0-9]+$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name  Enc   Zpool
    1    141      0      0  0.000    141 144430    6.9   97.3| da2   ses1  N
    1    140      0      0  0.000    140 143434    7.0   97.5| da3   ses1  N
    1    158      0      0  0.000    158 161363    6.1   96.9| da4   ses1  N
    1    176      0      0  0.000    176 180288    5.5   96.5| da5   ses1  N
    1    162      0      0  0.000    162 166343    6.0   97.2| da7   ses1  N
    1    165      0      0  0.000    165 169332    5.9   96.9| da8   ses1  N
    1    160      0      0  0.000    160 164351    6.1   97.1| da9   ses1  N
    1    175      0      0  0.000    175 179292    5.5   97.0| da10  ses1  N
    1    172      0      0  0.000    172 176304    5.6   96.7| da11  ses1  N
    1    156      0      0  0.000    156 159371    6.2   96.9| da12  ses1  N
    1    156      0      0  0.000    156 159371    6.2   97.2| da13  ses1  N
Add DD /dev/zero to 9 random spare drives, in the disk shelf, on the LSI 9207-8e. No other activity. 3,449 w/s and 3455MB/s overall.
Code:
dT: 1.004s  w: 1.000s  filter: ^da[0-9]+$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name  Enc   Zpool
    1    142      0      0  0.000    142 145853    6.8   97.0| da2   ses1  N
    1    141      0      0  0.000    141 144833    6.8   96.7| da3   ses1  N
    1    159      0      0  0.000    159 163193    6.1   96.8| da4   ses1  N
    1    172      0      0  0.000    172 176452    5.6   96.7| da5   ses1  N
    1    161      0      0  0.000    161 165233    6.0   97.6| da7   ses1  N
    1    158      0      0  0.000    158 162173    6.1   97.0| da8   ses1  N
    1    164      0      0  0.000    164 168292    5.9   96.3| da9   ses1  N
    1    169      0      0  0.000    169 173392    5.7   96.3| da10  ses1  N
    1    172      0      0  0.000    172 176452    5.6   95.7| da11  ses1  N
    1    153      0      0  0.000    153 157073    6.3   96.5| da12  ses1  N
    1    156      0      0  0.000    156 160133    6.2   96.7| da13  ses1  N
    1    201      0      0  0.000    201 206031    4.8   96.1| da0   ses4  N
    1    174      0      0  0.000    174 178492    5.5   96.6| da1   ses4  N
    1    203      0      0  0.000    203 208071    4.7   95.6| da6   ses4  N
    1    185      0      0  0.000    185 189711    5.2   96.9| da14  ses4  N
    0    193      0      0  0.000    193 197871    5.0   96.5| da15  ses4  N
    1    188      0      0  0.000    188 192771    5.1   95.8| da16  ses4  N
    0    185      0      0  0.000    185 189711    5.2   96.6| da17  ses4  N
    0    193      0      0  0.000    193 197871    5.0   96.4| da18  ses4  N
    1    180      0      0  0.000    180 184612    5.3   96.2| da19  ses4  N
Start the resilver on the zpool (all on the LSI 9207-8e, now) while all drives continue zeroing.
Code:
dT: 1.008s  w: 1.000s  filter: ^da[0-9]+$
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name  Enc   Zpool
    1    128      0      0  0.000    128 131088    6.4   82.0| da2   ses1  N
    1    129      0      0  0.000    129 132105    6.3   81.7| da3   ses1  N
    1    161      0      0  0.000    161 164623    5.6   90.6| da4   ses1  N
    1    150      0      0  0.000    150 153444    5.3   79.4| da5   ses1  N
    1    148      0      0  0.000    148 151412    5.4   80.1| da7   ses1  N
    1    145      0      0  0.000    145 148364    5.7   83.0| da8   ses1  N
    1    152      0      0  0.000    152 155477    5.5   83.6| da9   ses1  N
    1    154      0      0  0.000    154 157509    5.4   82.5| da10  ses1  N
    1    161      0      0  0.000    161 164623    5.0   80.9| da11  ses1  N
    1    144      0      0  0.000    144 147347    5.7   81.7| da12  ses1  N
    1    161      0      0  0.000    161 164623    5.7   91.0| da13  ses1  N
    1     51      0      0  0.000     51  51826   18.1   91.7| da0   ses4  N
    1     52      0      0  0.000     52  52842   18.3   94.2| da1   ses4  N
    1     53      0      0  0.000     53  53858   17.9   94.0| da6   ses4  N
    0     53      0      0  0.000     53  53858   18.4   96.6| da14  ses4  N
    1     51      0      0  0.000     51  51826   18.3   92.8| da15  ses4  N
    1     50      0      0  0.000     50  50809   18.9   93.9| da16  ses4  N
    1     49      0      0  0.000     49  49793   19.0   92.4| da17  ses4  N
    1     50      0      0  0.000     50  50809   19.0   94.1| da18  ses4  N
    0     51      0      0  0.000     51  51826   18.4   93.2| da19  ses4  N
    1     70      0      0  0.000     70  70399   25.3  100.3| da23  ses4  raidz3-0 (REPLACING, ONLINE)
    3     68     68  68009   42.1      0      0  0.000  100.2| da26  ses3  raidz3-0 (ONLINE)
    3     68     68  68402   44.0      0      0  0.000  100.5| da30  ses3  raidz3-0 (ONLINE)
    3     68     68  67418   44.0      0      0  0.000  100.3| da31  ses3  raidz3-0 (ONLINE)
    3     68     68  68386   43.8      0      0  0.000   99.7| da34  ses3  raidz3-0 (ONLINE)
    3     67     67  66402   44.8      0      0  0.000  101.1| da35  ses3  raidz3-0 (ONLINE)
    3     66     66  65338   45.1      0      0  0.000   99.9| da46  ses3  raidz3-0 (ONLINE)
    3     66     66  65401   44.6      0      0  0.000   99.0| da51  ses3  raidz3-0 (ONLINE)
    3     69     69  69418   43.4      0      0  0.000  100.3| da52  ses3  raidz3-0 (ONLINE)
    3     67     67  66386   45.0      0      0  0.000  101.2| da53  ses3  raidz3-0 (ONLINE)
    3     70     70  69323   42.6      0      0  0.000  100.5| da54  ses3  raidz3-0 (ONLINE)
    3     67     67  58296   43.7      0      0  0.000   99.4| da27  ses3  raidz3-1 (ONLINE)
    3     69     69  58419   43.1      0      0  0.000  100.7| da28  ses3  raidz3-1 (ONLINE)
    3     68     68  58419   41.8      0      0  0.000  100.6| da29  ses3  raidz3-1 (ONLINE)
    2     59      0      0  0.000     59  57796   28.7   98.6| da32  ses3  raidz3-1 (REPLACING, ONLINE)
    3     68     68  58796   43.4      0      0  0.000  100.0| da33  ses3  raidz3-1 (ONLINE)
    3     68     68  58423   44.2      0      0  0.000  101.6| da36  ses3  raidz3-1 (ONLINE)
    3     68     68  58415   43.3      0      0  0.000   99.5| da37  ses3  raidz3-1 (ONLINE)
    3     69     69  58419   43.0      0      0  0.000  100.7| da42  ses3  raidz3-1 (ONLINE)
    3     69     69  58423   42.9      0      0  0.000  100.3| da47  ses3  raidz3-1 (ONLINE)
    3     68     68  58415   43.7      0      0  0.000  100.3| da49  ses3  raidz3-1 (ONLINE)
    3     68     68  57915   43.3      0      0  0.000   99.3| da50  ses3  raidz3-1 (ONLINE)
    1     57      0      0  0.000     57  48729   31.4   99.9| da20  ses3  raidz3-2 (REPLACING, ONLINE)
    2     71     71  53060   40.0      0      0  0.000   97.6| da21  ses4  raidz3-2 (ONLINE)
    2     72     72  53064   39.6      0      0  0.000   98.9| da22  ses4  raidz3-2 (ONLINE)
    2     71     71  51067   39.7      0      0  0.000   97.2| da24  ses4  raidz3-2 (ONLINE)
    3     70     70  52048   40.4      0      0  0.000   97.1| da25  ses4  raidz3-2 (ONLINE)
    3     70     70  54973   40.3      0      0  0.000   97.1| da38  ses4  raidz3-2 (ONLINE)
    2     71     71  53064   39.9      0      0  0.000   98.3| da39  ses4  raidz3-2 (ONLINE)
    2     71     71  55096   40.2      0      0  0.000   97.7| da40  ses4  raidz3-2 (ONLINE)
    2     71     71  52052   40.1      0      0  0.000   97.4| da41  ses4  raidz3-2 (ONLINE)
    2     71     71  53060   40.3      0      0  0.000   98.2| da43  ses4  raidz3-2 (ONLINE)
    3     70     70  54084   38.8      0      0  0.000   96.8| da45  ses4  raidz3-2 (ONLINE)
    3     74     74  73519   38.8      0      0  0.000  101.0| da44  ses4  spare (INUSE, ONLINE)
    3     68     68  60070   42.0      0      0  0.000   99.3| da48  ses3  spare (INUSE, ONLINE)
    3     67     67  52929   42.1      0      0  0.000  100.0| da55  ses3  spare (INUSE, ONLINE)
Output of `top -P` while the resilver and drive zeroing was happening:
Code:
last pid: 11051;  load averages: 14.19, 13.24, 12.51                                         up 4+03:48:56  14:21:02
131 processes: 1 running, 130 sleeping
CPU 0:   5.1% user,  0.0% nice, 29.4% system,  0.4% interrupt, 65.1% idle
CPU 1:   6.7% user,  0.0% nice, 31.0% system,  0.0% interrupt, 62.4% idle
CPU 2:   9.0% user,  0.0% nice, 25.1% system,  0.0% interrupt, 65.9% idle
CPU 3:   3.5% user,  0.0% nice, 29.8% system,  0.0% interrupt, 66.7% idle
CPU 4:   6.3% user,  0.0% nice, 31.0% system,  0.4% interrupt, 62.4% idle
CPU 5:   9.0% user,  0.0% nice, 24.7% system,  0.4% interrupt, 65.9% idle
CPU 6:   8.2% user,  0.0% nice, 27.8% system,  0.4% interrupt, 63.5% idle
CPU 7:   7.1% user,  0.0% nice, 30.6% system,  0.0% interrupt, 62.4% idle
CPU 8:   7.8% user,  0.0% nice, 29.8% system,  0.0% interrupt, 62.4% idle
CPU 9:   7.1% user,  0.0% nice, 27.8% system,  0.4% interrupt, 64.7% idle
CPU 10:  7.5% user,  0.0% nice, 26.3% system,  8.2% interrupt, 58.0% idle
CPU 11:  9.4% user,  0.0% nice, 30.2% system,  2.4% interrupt, 58.0% idle
CPU 12:  9.0% user,  0.0% nice, 29.8% system,  0.0% interrupt, 61.2% idle
CPU 13:  6.7% user,  0.0% nice, 28.2% system,  0.0% interrupt, 65.1% idle
CPU 14:  8.6% user,  0.0% nice, 26.3% system,  0.0% interrupt, 65.1% idle
CPU 15:  6.7% user,  0.0% nice, 36.5% system,  0.4% interrupt, 56.5% idle
CPU 16:  5.9% user,  0.0% nice, 26.7% system,  0.4% interrupt, 67.1% idle
CPU 17:  5.5% user,  0.0% nice, 29.8% system,  0.0% interrupt, 64.7% idle
CPU 18:  4.7% user,  0.0% nice, 31.8% system,  0.0% interrupt, 63.5% idle
CPU 19:  3.9% user,  0.0% nice, 31.4% system,  0.0% interrupt, 64.7% idle
CPU 20:  5.9% user,  0.0% nice, 30.2% system,  0.0% interrupt, 63.9% idle
CPU 21:  4.7% user,  0.0% nice, 28.2% system,  0.0% interrupt, 67.1% idle
CPU 22:  5.9% user,  0.0% nice, 31.4% system,  0.0% interrupt, 62.7% idle
CPU 23:  5.5% user,  0.0% nice, 25.1% system,  0.0% interrupt, 69.4% idle
CPU 24:  5.5% user,  0.0% nice, 29.0% system,  0.0% interrupt, 65.5% idle
CPU 25:  2.7% user,  0.0% nice, 28.6% system,  0.0% interrupt, 68.6% idle
CPU 26:  4.3% user,  0.0% nice, 30.2% system,  0.4% interrupt, 65.1% idle
CPU 27:  3.9% user,  0.0% nice, 29.4% system,  0.0% interrupt, 66.7% idle
CPU 28:  5.5% user,  0.0% nice, 23.5% system,  0.0% interrupt, 71.0% idle
CPU 29:  7.5% user,  0.0% nice, 26.7% system,  0.4% interrupt, 65.5% idle
CPU 30:  5.5% user,  0.0% nice, 27.1% system,  0.4% interrupt, 67.1% idle
CPU 31:  3.5% user,  0.0% nice, 30.2% system,  0.0% interrupt, 66.3% idle
Mem: 109M Active, 232M Inact, 196K Laundry, 68G Wired, 181G Free
ARC: 14G Total, 6362M MFU, 7210M MRU, 1580K Anon, 439M Header, 306M Other
     13G Compressed, 131G Uncompressed, 10.25:1 Ratio
Swap: 64G Total, 21M Used, 64G Free
 
I asked a person about the incident I remembered. That was a case of HBA overload, specifically the single-cable connection to the backplane.
 
Back
Top