I'm having trouble with a storage pool blocking (almost) all reads as it's flushing dirty writes to HDD. This happens when the server gets busy and starts updating a lot.
Boot: 2 x 500GB SATA3 SSDs
Storage pool: 4 x 12TB SATA3 HDDs (4 way mirror)
SLOG: 16GB NVMe SSD
L2ARC: 512GB NVMe SSD
FreeBSD: 12.2-RELEASE-p6
CPU: Xeon E5-2628L v2 (8 cores, 16 threads)
RAM: 128GB DDR3 ECC
Swap: 32GB (swapinfo reports 0 used)
Used for: MySQL, so sync SLOG writes to the SSD, then more relaxed dirty cache flushes to HDD
Here's an extreme example, showing only 4 read operations getting through over a period of 13 seconds, when there should have been more like 400 operations during that time...
Output of
Things I've considered:
* On this system I have set
* CPU load, but it happens even when load is low, like 2 or 3 (this CPU has 8 physical cores)
* One file system on this pool uses gzip-7 for compression, which is much more CPU intensive than lz4, but see last entry re CPU load
* Drive load, but
* A faulty HDD, but
* Some kind of quirk or bug when using 4 drives in a ZFS mirror, which is about twice the typical amount
Any ideas? Thanks.
Boot: 2 x 500GB SATA3 SSDs
Storage pool: 4 x 12TB SATA3 HDDs (4 way mirror)
SLOG: 16GB NVMe SSD
L2ARC: 512GB NVMe SSD
FreeBSD: 12.2-RELEASE-p6
CPU: Xeon E5-2628L v2 (8 cores, 16 threads)
RAM: 128GB DDR3 ECC
Swap: 32GB (swapinfo reports 0 used)
Used for: MySQL, so sync SLOG writes to the SSD, then more relaxed dirty cache flushes to HDD
Here's an extreme example, showing only 4 read operations getting through over a period of 13 seconds, when there should have been more like 400 operations during that time...
Output of
zpool iostat db 1
Code:
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
db 5.99T 4.89T 32 274 14.9M 24.6M
db 5.99T 4.89T 34 275 16.5M 23.7M
db 5.99T 4.89T 25 106 12.7M 30.3M
db 5.99T 4.89T 0 71 35.1K 36.6M
db 5.99T 4.89T 0 294 152K 51.8M
db 5.99T 4.89T 0 130 0 47.7M
db 5.99T 4.89T 4 133 3.16M 51.4M
db 5.99T 4.89T 0 134 0 52.0M
db 5.99T 4.89T 0 146 0 53.8M
db 5.99T 4.89T 0 127 0 60.4M
db 5.99T 4.89T 0 130 0 60.8M
db 5.99T 4.89T 0 107 0 55.0M
db 5.99T 4.89T 0 81 0 59.8M
db 5.99T 4.89T 0 83 0 61.2M
db 5.99T 4.89T 0 80 0 58.7M
db 5.99T 4.89T 0 291 40.0K 66.2M
db 5.99T 4.89T 25 1.19K 12.0M 41.3M
db 5.99T 4.89T 40 177 18.8M 16.5M
Things I've considered:
* On this system I have set
sysctl vfs.zfs.txg.timeout=30
(versus default 5). Changing it to back 5 improves problem, but does not eliminate* CPU load, but it happens even when load is low, like 2 or 3 (this CPU has 8 physical cores)
* One file system on this pool uses gzip-7 for compression, which is much more CPU intensive than lz4, but see last entry re CPU load
* Drive load, but
gstat
doesn't show the HDDs (or SLOG/L2ARC SSDs) going anywhere near 100% util, even when flushing writes* A faulty HDD, but
iostat -x -w 1 | grep -v pass
shows read and write times, plus operation counts, to be similar for all 4 drives, so it's not like one is lagging* Some kind of quirk or bug when using 4 drives in a ZFS mirror, which is about twice the typical amount
Any ideas? Thanks.