ZFS ZFS pool blocks reads for multiple seconds when flushing writes to 4 way mirror

I'm having trouble with a storage pool blocking (almost) all reads as it's flushing dirty writes to HDD. This happens when the server gets busy and starts updating a lot.

Boot: 2 x 500GB SATA3 SSDs
Storage pool: 4 x 12TB SATA3 HDDs (4 way mirror)
SLOG: 16GB NVMe SSD
L2ARC: 512GB NVMe SSD
FreeBSD: 12.2-RELEASE-p6
CPU: Xeon E5-2628L v2 (8 cores, 16 threads)
RAM: 128GB DDR3 ECC
Swap: 32GB (swapinfo reports 0 used)

Used for: MySQL, so sync SLOG writes to the SSD, then more relaxed dirty cache flushes to HDD

Here's an extreme example, showing only 4 read operations getting through over a period of 13 seconds, when there should have been more like 400 operations during that time...

Output of zpool iostat db 1

Code:
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
db          5.99T  4.89T     32    274  14.9M  24.6M
db          5.99T  4.89T     34    275  16.5M  23.7M
db          5.99T  4.89T     25    106  12.7M  30.3M
db          5.99T  4.89T      0     71  35.1K  36.6M
db          5.99T  4.89T      0    294   152K  51.8M
db          5.99T  4.89T      0    130      0  47.7M
db          5.99T  4.89T      4    133  3.16M  51.4M
db          5.99T  4.89T      0    134      0  52.0M
db          5.99T  4.89T      0    146      0  53.8M
db          5.99T  4.89T      0    127      0  60.4M
db          5.99T  4.89T      0    130      0  60.8M
db          5.99T  4.89T      0    107      0  55.0M
db          5.99T  4.89T      0     81      0  59.8M
db          5.99T  4.89T      0     83      0  61.2M
db          5.99T  4.89T      0     80      0  58.7M
db          5.99T  4.89T      0    291  40.0K  66.2M
db          5.99T  4.89T     25  1.19K  12.0M  41.3M
db          5.99T  4.89T     40    177  18.8M  16.5M

Things I've considered:

* On this system I have set sysctl vfs.zfs.txg.timeout=30 (versus default 5). Changing it to back 5 improves problem, but does not eliminate

* CPU load, but it happens even when load is low, like 2 or 3 (this CPU has 8 physical cores)

* One file system on this pool uses gzip-7 for compression, which is much more CPU intensive than lz4, but see last entry re CPU load

* Drive load, but gstat doesn't show the HDDs (or SLOG/L2ARC SSDs) going anywhere near 100% util, even when flushing writes

* A faulty HDD, but iostat -x -w 1 | grep -v pass shows read and write times, plus operation counts, to be similar for all 4 drives, so it's not like one is lagging

* Some kind of quirk or bug when using 4 drives in a ZFS mirror, which is about twice the typical amount

Any ideas? Thanks.
 
Sorry to update so quickly, but it looks like it may be related to CPU load after all. This is when it's not flushing:

Output of top

Code:
  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B   256K CPU3     3 946.3H  98.75% idle{idle: cpu3}
 3284 mysql        96    0    13G    13G CPU9     9  33.2H  98.14% mysqld{mysqld}
   11 root        155 ki31     0B   256K CPU15   15 1091.0  98.12% idle{idle: cpu15}
   11 root        155 ki31     0B   256K CPU7     7 943.8H  94.63% idle{idle: cpu7}
   11 root        155 ki31     0B   256K CPU11   11 941.0H  93.67% idle{idle: cpu11}
   11 root        155 ki31     0B   256K CPU13   13 941.7H  92.47% idle{idle: cpu13}
   11 root        155 ki31     0B   256K CPU4     4 1130.7  91.27% idle{idle: cpu4}
   11 root        155 ki31     0B   256K RUN     10 956.2H  87.92% idle{idle: cpu10}
...<other idle CPUs snipped>...

Then when it's flushing, 12 write threads take up 100% CPU, which for a CPU with 8c/16t is probably overloaded.

Code:
  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   60 root        -12    -     0B  5264K CPU14   14 281.5H 100.00% zpool-db{zio_write_issue_11}
   60 root        -12    -     0B  5264K CPU1     1 281.5H 100.00% zpool-db{zio_write_issue_1}
   60 root        -12    -     0B  5264K CPU3     3 281.5H 100.00% zpool-db{zio_write_issue_2}
   60 root        -12    -     0B  5264K CPU13   13 281.5H 100.00% zpool-db{zio_write_issue_0}
   60 root        -12    -     0B  5264K CPU12   12 281.5H 100.00% zpool-db{zio_write_issue_6}
   60 root        -12    -     0B  5264K CPU6     6 281.5H 100.00% zpool-db{zio_write_issue_4}
   60 root        -12    -     0B  5264K CPU5     5 281.5H 100.00% zpool-db{zio_write_issue_5}
   60 root        -12    -     0B  5264K CPU9     9 281.5H 100.00% zpool-db{zio_write_issue_8}
   60 root        -12    -     0B  5264K CPU7     7 281.5H 100.00% zpool-db{zio_write_issue_7}
   60 root        -12    -     0B  5264K CPU0     0 281.5H 100.00% zpool-db{zio_write_issue_3}
   60 root        -12    -     0B  5264K CPU8     8 281.5H 100.00% zpool-db{zio_write_issue_9}
   60 root        -12    -     0B  5264K CPU2     2 281.5H 100.00% zpool-db{zio_write_issue_10}
   11 root        155 ki31     0B   256K CPU11   11 941.0H  99.53% idle{idle: cpu11}
   11 root        155 ki31     0B   256K RUN     10 956.3H  98.97% idle{idle: cpu10}
   11 root        155 ki31     0B   256K CPU4     4 1130.7  97.62% idle{idle: cpu4}
   11 root        155 ki31     0B   256K CPU15   15 1091.0  71.02% idle{idle: cpu15}
   11 root        155 ki31     0B   256K RUN     14 920.7H  29.11% idle{idle: cpu14}

Is there some way to either,

1) Tell ZFS to flush more lazily, rather than all at once and as fast as possible
or
2) Limit the number (or perhaps CPU priority) of zio_write_issue threads?

UPDATE: Setting sysctl vfs.zfs.txg.timeout=... with a value of 1 or 2 stops (or minimises) the stalling, but it increases CPU and disk load. The latter is a concern. Since the pool has an SSD SLOG to synchronously commit data, I'd much prefer the subsequent asynchronous writes to the HDDs be a lot more relaxed, especially since reads of data come via the HDDs.
 
Tuning,
Set recordsize=16K on InnoDB’s data files to avoid expensive partial record writes
Leave recordsize=128K on the log files.
Set logbias=throughput on the data to stop ZIL from writing twice.

Set skip-innodb_doublewrite or innodb_doublewrite=0 in my.cnf to prevent innodb from writing twice.

I would leave vfs.zfs.txg.timeout to its default value unless there is a good reason to change.

Can you show the output of "zpool list -v"
 
Options,
For the database set set a recordsize of 16K
For mysql (log) a recordsize of 128K
And logbias throughput
I would leave vfs.zfs.txg.timeout to its default value unless there is a good reason to change.
Can you show the output of "zpool list -v"

zpool list -v db

Code:
NAME                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
db                              10.9T  6.01T  4.86T        -         -    38%    55%  1.00x  ONLINE  -
  mirror                        10.9T  6.01T  4.86T        -         -    38%  55.3%
    diskid/DISK-8HHHEEXHp1          -      -      -        -         -      -      -
    diskid/DISK-8HJ3A23Hp1          -      -      -        -         -      -      -
    diskid/DISK-8CJVUS0Ep1          -      -      -        -         -      -      -
    diskid/DISK-51Q0A1NMF95Gp1      -      -      -        -         -      -      -
logs                                -      -      -         -      -      -
  gpt/slog_db                   11.5G   207M  11.3G        -         -     0%  1.76%
cache                               -      -      -         -      -      -
  gpt/l2arc_db                   477G   386G  90.7G        -         -     0%  81.0%

I haven't changed logbias before, and from a quick read, it seems that setting to 'throughput' will actually disable the SLOG? I did a quick test, setting logbias=throughput on a busy file system, and noted that SLOG writes went from tens of megabytes per second to several hundred kilobytes per second (there's still logbias=latency set on other file systems in the pool). When changing back, I monitored one of the HDDs in the main pool, and noted that activity decreased once logbias was set back to latency. Would I be correct in assuming that logbias=throughput will flush a synchronous write to HDD immediately, rather than writing to the SLOG then caching the data for future flush? That's really what I'm trying to avoid.
 
Tuning ZFS for database operations, fun stuff. If you don't have a copy, pick up "FreeBSD Mastery Advanced ZFS" by Michael W Lucas and Allan Jude. Lots of easily understandable stuff in there.
If you haven't yet, install the package zfs-stats. It can pull out some good info.

Writes to a mirror only completed when all devices acknowledge them, so perhaps a disk is a bit slower. You may be able to get some info out of zfs-stats on the I/O to the physical devices.

ZFS writes take precedence over reads, so yes, when writing huge chunks to disk, reads are slow.

I've not done anything like your use case, so the following is my guesses on what you can look at:

There are a bunch of other sysctls that may apply, grab the output of sysctl -a | grep vfs\.zfs
My understanding of the vfs.zfs.txg_timeout is "if nothing else triggers a write of a transaction group to disk, write it this often". Value is in seconds. Default is trigger the writes every 5 secs, you set it to 30 secs. That means trigger every 30 secs, but you wind up with a bigger chunk of data in the write. So a little bit more often, or a lot less often.
One that may be of interest is:
vfs.zfs.dirty_data_max_percent
This represents how big a txg can grow before it gets commited. Value is in "percent of total RAM", default is 10 (percent) so in your case 12.8G. Not sure what you should do with this, if you change the value, you need to reboot for it to actually take effect.
There are a bunch sysctls that affect how ZFS schedules I/O and async read/write bias. Lots of relationships between them so caution when tweaking.
vfs.zfs.vdev.async_write_min_active
vfs.zfs.vdev.async_write_max_active
vfs.zfs.vdev.async_read_min_active
vfs.zfs.vdev.async_read_max_active
vfs.zfs.vdev.async_write_active_min_dirty_percent
vfs.zfs.vdev.async_write_active_max_dirty_percent
 
Back
Top