I have a FreeBSD 13.1 STABLE storage server in an HPC cluster with 3 storage pools, each with over 100 disks. The storage server has 6 cores and 128 GB ram. ARC size is set at 110 GB. What I am trying to accomplish is the following. There is an MPI climate model that pumps out ~30 GB of data every 10 min and it takes ~4 min to finish writing. Storage is mounted over gigabit ethernet, MPI traffic uses Infiniband, storage server is on 10G ethernet, 6 parallel writers and 6 parallel readers for the MPI model. The writes start out at ~350 MB/s (I use systat -ifstat to see traffic coming in the storage server) and after ~10 GB or so, it drops to ~110 MB/s. What I am trying to do is to keep all 30 GB as dirty data and let the storage server flush to disk taking its time. So computation could proceed to next step without waiting. So far, I have tried the following -
vfs.zfs.dirty_data_max_percent="75"
vfs.zfs.dirty_data_max_max_percent="85"
vfs.zfs.dirty_data_max_max=42949672960
vfs.nfsd.async=1
But I am still hitting a ceiling somewhere around 10 GB. I was wondering if there are any more knobs to turn to allow for more dirty data to stay in ram before getting flushed to disk. The nfs clients run Linux and use nfs 4.2 (128K rsize and wsize). I watched it with zpool iostat and flushes to disk seem decent -
pool3 478T 482T 770 12.3K 3.70M 550M
Testing the pool locally (it has compression set to use lz4)
dd if=/dev/urandom of=test bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 60.705531 secs (345463086 bytes/sec)
I still see flushes happening before it gets to 20G -
pool3 478T 482T 0 8.85K 0 1.64G
Thank you for any suggestions I could try next.
vfs.zfs.dirty_data_max_percent="75"
vfs.zfs.dirty_data_max_max_percent="85"
vfs.zfs.dirty_data_max_max=42949672960
vfs.nfsd.async=1
But I am still hitting a ceiling somewhere around 10 GB. I was wondering if there are any more knobs to turn to allow for more dirty data to stay in ram before getting flushed to disk. The nfs clients run Linux and use nfs 4.2 (128K rsize and wsize). I watched it with zpool iostat and flushes to disk seem decent -
pool3 478T 482T 770 12.3K 3.70M 550M
Testing the pool locally (it has compression set to use lz4)
dd if=/dev/urandom of=test bs=1M count=20000
20000+0 records in
20000+0 records out
20971520000 bytes transferred in 60.705531 secs (345463086 bytes/sec)
I still see flushes happening before it gets to 20G -
pool3 478T 482T 0 8.85K 0 1.64G
Thank you for any suggestions I could try next.