ZFS Bump up dirty data for ZFS on FreeBSD 13.1 STABLE

pillai_hfx · Apr 22, 2022

I have a FreeBSD 13.1 STABLE storage server in an HPC cluster with 3 storage pools, each with over 100 disks. The storage server has 6 cores and 128 GB ram. ARC size is set at 110 GB. What I am trying to accomplish is the following. There is an MPI climate model that pumps out ~30 GB of data every 10 min and it takes ~4 min to finish writing. Storage is mounted over gigabit ethernet, MPI traffic uses Infiniband, storage server is on 10G ethernet, 6 parallel writers and 6 parallel readers for the MPI model. The writes start out at ~350 MB/s (I use systat -ifstat to see traffic coming in the storage server) and after ~10 GB or so, it drops to ~110 MB/s. What I am trying to do is to keep all 30 GB as dirty data and let the storage server flush to disk taking its time. So computation could proceed to next step without waiting. So far, I have tried the following -

vfs.zfs.dirty_data_max_percent="75"
vfs.zfs.dirty_data_max_max_percent="85"
vfs.zfs.dirty_data_max_max=42949672960
vfs.nfsd.async=1

But I am still hitting a ceiling somewhere around 10 GB. I was wondering if there are any more knobs to turn to allow for more dirty data to stay in ram before getting flushed to disk. The nfs clients run Linux and use nfs 4.2 (128K rsize and wsize). I watched it with zpool iostat and flushes to disk seem decent -

pool3 478T 482T 770 12.3K 3.70M 550M

Testing the pool locally (it has compression set to use lz4)

dd if=/dev/urandom of=test bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 60.705531 secs (345463086 bytes/sec)

I still see flushes happening before it gets to 20G -

pool3 478T 482T 0 8.85K 0 1.64G

Thank you for any suggestions I could try next.

Eric A. Borisch · Apr 22, 2022

Look at significantly bumping vfs.zfs.txg.timeout, and set sync=disabled on the dataset. That should get you closer, but with greater risk of data loss with power outage.

pillai_hfx · Apr 23, 2022

Eric A. Borisch said:
Look at significantly bumping vfs.zfs.txg.timeout, and set sync=disabled on the dataset. That should get you closer, but with greater risk of data loss with power outage.

Thanks. I had already disabled sync and pushed vfs.zfs.txg.timeout to 4 min. Here are a few other things I tried -
vfs.zfs.delay_scale, vfs.zfs.vdev.async_write_active_max_dirty_percent, vfs.zfs.vdev.async_write_active_min_dirty_percent,vfs.zfs.delay_min_dirty_percent,vfs.zfs.dirty_data_sync_percent. It turned out that the latest OS release supports bumping up vfs.nfsd.srvmaxio to 1M, so that nfs clients could mount at 1M rsize and wsize. So far no luck. Some throttle still gets applied after ~10G or so.

covacat · Apr 23, 2022

try making arc smaller with say 20GB

T-Daemon · Apr 23, 2022

Have a look at following blog article, especially chapter "Write delay":

Tuning the OpenZFS write throttle | Delphix

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies…

www.delphix.com

pillai_hfx · Apr 23, 2022

T-Daemon said:
Have a look at following blog article, especially chapter "Write delay":

Tuning the OpenZFS write throttle | Delphix

In previous posts I discussed the problems with the legacy ZFS write throttle that cause degraded performance and wildly variable latencies…

www.delphix.com

Thanks. I read this document and the one from openzfs yesterday and bumped up the following -

vfs.zfs.txg.timeout=75
vfs.zfs.delay_min_dirty_percent=90
vfs.zfs.dirty_data_sync_percent=95
vfs.zfs.vdev.def_queue_depth=128
vfs.zfs.vdev.write_gap_limit=0
vfs.zfs.vdev.aggregation_limit=104857600
vfs.zfs.delay_scale=100
vfs.zfs.vdev.max_active=100000
vfs.zfs.vdev.queue_depth_pct=5000
vfs.zfs.zio.dva_throttle_enabled=0
vfs.zfs.arc.lotsfree_percent=0
vfs.zfs.per_txg_dirty_frees_percent=0
vfs.zfs.vdev.async_write_active_min_dirty_percent=30
vfs.zfs.vdev.async_write_active_max_dirty_percent=90

No luck so far.

pillai_hfx · Apr 23, 2022

covacat said:
try making arc smaller with say 20GB

I changed the ARC setting to 32G and changed the following to allow for a much bigger dirty buffer

vfs.zfs.dirty_data_max_percent="95"
vfs.zfs.dirty_data_max_max_percent="98"
vfs.zfs.dirty_data_max_max=96636764160

But no improvement so far. The throttle is still kicking in for some reason.

pillai_hfx · Apr 23, 2022

I removed all the tunable settings and started fresh like this -

vfs.zfs.dirty_data_max_percent="95"
vfs.zfs.dirty_data_max_max_percent="98"
vfs.zfs.dirty_data_max_max=96636764160
vfs.zfs.arc_max="32G"

vfs.zfs.txg.timeout=300
vfs.zfs.delay_min_dirty_percent=90
vfs.zfs.dirty_data_sync_percent=95
vfs.zfs.vdev.async_write_active_min_dirty_percent=30
vfs.zfs.vdev.async_write_active_max_dirty_percent=90

Anything I am missing here that could potentially trigger the write throttle? I retested the model and at ~10G of writes, the network throughput gets scaled back to ~120MB/s and it stays there. 90G of dirty buffer space should comfortably hold ~30G of writes I assume. Thanks.

mer · Apr 23, 2022

Have you looked at data with sysutils/zfs-stats, especially when the throttling is in effect?
the "-A" gives you ARC information, -E gives you ARC efficency. The data may give a clue as to "what next".

pillai_hfx · Apr 23, 2022

mer said:
Have you looked at data with sysutils/zfs-stats, especially when the throttling is in effect?
the "-A" gives you ARC information, -E gives you ARC efficency. The data may give a clue as to "what next".

I thought ARC is more of a read cache. Would that affect writes in any way? I see some flushes to the disk, but not the whole 30G of data. So the dirty data settings may be somewhat working.

pool3 478T 482T 0 5.86K 0 1.08G
Sat Apr 23 09:33:14 2022
pool3 478T 482T 0 15.5K 0 1.27G
pool3 478T 482T 0 3.58K 0 1.04G
Sat Apr 23 09:35:11 2022
pool3 478T 482T 0 8.27K 0 1.68G
Sat Apr 23 09:35:16 2022
pool3 478T 482T 0 7.36K 801 224M

mer · Apr 23, 2022

My understanding is ARC is more read, but the tool gives more information than just ARC. If the data being written is more "file updates" there may be a certain level of reading involved.
Your first post, the clients doing the writing of the data are NFS to the storage device?
I see that in the first post you tested locally using dd, so the implication is that NFS shouldn't be affecting it.

pillai_hfx said:
The writes start out at ~350 MB/s (I use systat -ifstat to see traffic coming in the storage server) and after ~10 GB or so, it drops to ~110 MB/s.

Is this saying that the writes are happening at 350MB/s until you've seen 10GB come in on the network, then the writes drop to 110MB/s?
I'm trying to figure out/correlate this with the local dd command.

I'm also trying to see/understand if the 10GB is a trigger or limit for something else. Say what happens if you write 30GB of data but 10GB at a time? Write 10GB, sleep 1 s, write 10GB, sleep 1s, write 10GB. Does each chunk get written at the max rate?

pillai_hfx · Apr 23, 2022

mer said:
My understanding is ARC is more read, but the tool gives more information than just ARC. If the data being written is more "file updates" there may be a certain level of reading involved.
Your first post, the clients doing the writing of the data are NFS to the storage device?
I see that in the first post you tested locally using dd, so the implication is that NFS shouldn't be affecting it.

Is this saying that the writes are happening at 350MB/s until you've seen 10GB come in on the network, then the writes drop to 110MB/s?
I'm trying to figure out/correlate this with the local dd command.

I'm also trying to see/understand if the 10GB is a trigger or limit for something else. Say what happens if you write 30GB of data but 10GB at a time? Write 10GB, sleep 1 s, write 10GB, sleep 1s, write 10GB. Does each chunk get written at the max rate?

The dd was a local test on storage server for a baseline performance. Unfortunately I don't have much control with model writes. It would write out ~30G for every simulated hr. Multiple nfs clients write to the nfs storage using hdf5 parallel i/o. I didn't see all of the data flushed to the zfs pool though while writes are going on. So the dirty data settings seems working. But some stalling/throttling is taking place past 10G. Thanks.