ZFS Limit Delete Priority

vermaden · Jul 10, 2018

The problem:

When I start deleting large amount of data (5-10 GB) the ZFS prioritizes deletion and system is VERY unresponsive.

Which ZFS parameter to change to make deletion of files a lot less priority for ZFS? ... or how to prioritize reads with ZFS?

VladiBG · Jul 10, 2018

Read the second post of Ian Howson here:

https://serverfault.com/questions/801074/delete-10m-files-from-zfs-effectively/801393

vermaden · Jul 10, 2018

@ VladiBG

Thank You for trying but its not that case. It can be ONE large file with 5-10 GB in size (a movie or VM disk) and behavior is the same, ZFS tried to 'deallocate' all used blocks as fast as possible with read being dead in that time, system literally freezes which is unacceptable.

I think that I have seen somewhere discussion about sysctl(8) values that can be set to slow this down but can not find it anymore.

sko · Jul 11, 2018

I *think* what you are looking for are the various sysctls for dirty data, respectively vfs.zfs.dirty* and vfs.zfs.vdev.async_write_active_[min|max]_dirty_percent.
I suspect deleting a big amount of data fills up the max amount of dirty data ("in flight" writes, not yet committed to disk) allowed, so zfs is constantly trying to get the TXGs committed to disk.

The default values usually work at least "good enough" except for some really extreme edge cases and should never cause the problems you are seeing. So I highly suspect there is another root cause - e.g. a dying disk that degrades pool performance or heavy memory pressure on the system.
I write and delete large disk images with well over 100GB (full disk backups of some clients) on our storage server and never saw anything like the behaviour you described. Even my desktop machine with much less RAM and disks never got unresponsive when I bashed its single ZFS pool with similar tasks.

If you still want/need to adjust some knobs of ZFS, there is no "master recipe" on how to adjust these (or other zfs-related) sysctls - you have to carefully monitor the system behaviour under load to understand where the bottleneck is. Dtrace is your very best friend for this. I can *highly recommend you read the sections on "Performance" and "Tuning" in "FreeBSD Mastery: Advanced ZFS" by Michael W. Lucas and Allan Jude. They provide you with a structured method on how to identify performance bottlenecks as well as some example Dtrace-scripts that can be adjusted to your needs. The dtrace-toolkit (available from ports and packages) also has a lot of zfs-, disk- and i/o-related scripts that can help narrowing down the exact bottleneck.
That being said, I still suspect there is a much easier solution to your problem - so you should start from a high level and narrow down the true root cause.

What layout has the pool you are seeing this behaviour? Does the pools ashift size fit the drives blocksize?
Any errors reported by zpool status? Memory throttle counts reported by zfs-stats -A?
During deletion of large files, try monitoring the pool with zpool iostat -v 1 - do the "operations" and "bandwidth" numbers look plausible and are they relatively evenly distributed across all vdevs and providers? As said - a single dying or misbehaving drive can send the performance of the whole pool into the abyss. SSDs that have reached their max wear level (or min wearout indicator for intel) are notorious for this because they tend to throttle back to sub-1MiB/s throughput levels.[/cmd]

VladiBG · Jul 11, 2018

zfs_free_max_blocks and zfs_free_min_time_ms

vermaden · Jul 11, 2018

@ sko

Thanks I will look into it.

This is on a single SSD drive on GELI (aligned to 4k) and ZFS pool also aligned to that 4k (ashift=12). Disk is not dying. I will check sysctl(8)'s you mentioned. Thanks.

vermaden · Jul 11, 2018

VladiBG said:
zfs_free_max_blocks and zfs_free_min_time_ms

Code:

% sysctl zfs_free_max_blocks
sysctl: unknown oid 'zfs_free_max_blocks'

% sysctl zfs_free_min_time_ms
sysctl: unknown oid 'zfs_free_min_time_ms'

% uname -spr
FreeBSD 11.2-RELEASE amd64

% sysctl -a | grep zfs_free_min_time_ms
(none)

% sysctl -a | grep zfs_free_max_blocks 
(none)

SirDice · Jul 11, 2018

Those appear to be ZFS-on-Linux module parameters.

t1066 · Jul 12, 2018

Actually, they are vfs.zfs.free.max.blocks and vfs.zfs.free.min.time.ms.
And you can get a list of sysctl zfs variables with a line of description by using
sysctl -d vfs.zfs

rihad · Oct 7, 2019

Did any of it help?

rihad · Oct 8, 2019

The essence of the problem is that the machine (FreeBSD 11.3) frequently becomes literally unresponsive for up to a minute when it's busy removing stuff, not the slowness per se. Especially when zfs dedup is enabled.

vermaden · Oct 8, 2019

rihad said:
The essence of the problem is that the machine (FreeBSD 11.3) frequently becomes literally unresponsive for up to a minute when it's busy removing stuff, not the slowness per se. Especially when zfs dedup is enabled.

In my case it was the slow TRIM function on cheap SSD drive.

In my case this helped in /etc/sysctl.conf file:

vfs.zfs.vdev.trim_max_active=1

Regards.

rihad · Oct 8, 2019

Thanks. No, not my case. The machine is a DigitalOcean VPS and SSD TRIM doesn't even apply to it. It's just overloaded with the amount of reads & writes according, up to 95-100% load according to gstat. Which probably shouldn't make a machine completely unresponsive for up to a minute. Lowering vfs.zfs.per_txg_dirty_frees_percent from 30 to 5 hasn't helped at all. It's the sheer volume of writes causing all this.

vermaden · Oct 8, 2019

rihad said:
Thanks. No, not my case. The machine is a DigitalOcean VPS and SSD TRIM doesn't even apply to it. It's just overloaded with the amount of reads & writes according, up to 95-100% load according to gstat. Which probably shouldn't make a machine completely unresponsive for up to a minute. Lowering vfs.zfs.per_txg_dirty_frees_percent from 30 to 5 hasn't helped at all. It's the sheer volume of writes causing all this.

Try asking Allan - http://www.allanjude.com/ - https://twitter.com/allanjude - about this issue.

He helped me with the TRIM thing.

rihad · Oct 17, 2019

allan@ hasn't replied to me yet, but I think I've finally googled up the most probable cause )
zpool status -D (oddly enough -D isn't documented in FreeBSD's zpool(8)) showed current DDT table status:

dedup: DDT entries 1995465, size 502 on disk, 365 in core

502 & 365 are in bytes, so by simply multiplying them by the number of entries I get a little under 1GB on disk and a little over 700MB in memory. It turns out ZFS silently decided to spill to disk even with a couple gigs of free ram available. I thought the system would exhibit RAM scarcity by swapping to disk or something. Is there a way to configure the DDT in-core size limit? Maybe it's vfs.zfs.arc_max ?

It's no surprise data deletion is so slow. The whole table has to be scanned and references updated.

ZFS Limit Delete Priority

Administrator