FreeBSD 13.0 zfs+dedup terrible performance

FreeBSD 13.0 release notes say that it now sports the OpenZFS implementation. Compared to 11.4 on the same hardware (VPS), performance (when dedup is used, haven't tested without it) is absolutely terrible. Writes take much longer to complete, deletes simply halt other writes for a very long time up to 20-40 seconds, which makes interactvity (logins/logouts which tend to cause some subtle disk i/o for the usual accounting) a nightmare. Other than setting arc.max and a couple more no other tweaks have been made here or in its previous incarnation, 11.4.

/boot/loader.conf:

Code:
virtio_balloon_load="YES"
virtio_blk_load="YES"
virtio_load="YES"
virtio_pci_load="YES"
if_vtnet_load="YES"

hw.usb.no_shutdown_wait=1
loader_logo=none
zfs_load="YES"

vfs.root.mountfrom="zfs:zroot"
vfs.zfs.arc.max="8G"
vfs.zfs.vdev.cache_size="16M"

console="vidconsole,comconsole"
autoboot_delay="5"


/etc/sysctl.conf:

Code:
kern.maxvnodes=800000
vfs.zfs.arc.meta_limit=3221225472

The last one defaults to 25% of arc_max iirc and it was increased in the hopes of speeding up dedup back in 10.3/11.4.


Running without dedup isn't an option, obviously ))
Code:
$ zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot   640G   159G   481G        -         -    83%    24%  12.00x    ONLINE  -

Can someone please point me in the right direction in the Expert Tweaking Land, I'm new here :)
 
I think dedup can be usefull when you have 64GB of memory. With 8GB I would disable it unless there is a very good reason to enable it. But that would be a "special" application. The zfs options you enable depend on the specific kind of end-application.
 
All tutorials on ZFS I have read say one should be very careful about deduplication. Do you need it for a very specific reason?
Normally it suffices to enable compression (gzip for textual data, or lz4 for binary).
Sorry, I am not aware of any specifics related to 13.0 deduplication yet.
If you really need it, try to limit it only to the data which is relevant - for example database files, or an FTP server directory. Don't dedup the system files.
Oh yeah, as Alain De Vos mentioned, having more RAM helps.
 
Please note the zpool list output above at the end )
12x ratio means that without deduplication data would take over 2TB disk space, not the current 159G. And there's only 640G space, upgrading the vps isn't currently an option.

And the question is more about the performance regression between FreeBSD 11.4 and 13.0 zfs. Afaik the former is Illumos based, and the latter is implemented by OpenZFS.
Perhaps there are some tweaks to bring OpenZFS on par with 11.4 speed-wise?
 
According to ZFS tuning guide min 5GB mem per 1TB storage.

But more important recommendation is:

"If you are going to use deduplication and your machine is underspec'ed, you must set vfs.zfs.arc_max to a sane value or ZFS will wire down as much available memory as possible, which can create memory starvation scenarios."

So i would think if your mem is say 8 GB setting up vfs.zfs.zrc_max to 8GB will just hog the system.

In case you missed it below is link to zfs tuning guide on dedub:

 
Please note the zpool list output above at the end )
12x ratio means that without deduplication data would take over 2TB disk space, not the current 159G. And there's only 640G space, upgrading the vps isn't currently an option.

And the question is more about the performance regression between FreeBSD 11.4 and 13.0 zfs. Afaik the former is Illumos based, and the latter is implemented by OpenZFS.
Perhaps there are some tweaks to bring OpenZFS on par with 11.4 speed-wise?
Good point. Nevermind my comment.
 
By the way, it's probably a good idea to watch some I/O stats while you are experiencing the bad performance. This might give you some ideas what is the bottleneck and how to optimize.
These are a couple of monitoring commands I have written down so far:
Bash:
# Memory, resource info:
vmstat 3
iostat -w3
systat -iostat
gstat
procstat -av
ps vax
top -w    # shows swap per process
 
I have found that during heavy i/o this sysctl:
Code:
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_throttle: 53480
keeps increasing in small bursts. These artificial halts hinder interactivity. Trying to find a way to eliminate them completely or considerably increase their activation threshold.
Here are my current modifications with defaults in the preceding comment:

Code:
#100000
vfs.zfs.max_async_dedup_frees=9000

#60
vfs.zfs.delay_min_dirty_percent=75

#60
vfs.zfs.vdev.async_write_active_max_dirty_percent=75

#0
vfs.zfs.arc.min=2147483648

Some fine manual for the middle two: https://openzfs.github.io/openzfs-docs/Performance and Tuning/ZFS Transaction Delay.html
 
Have you seen any improvement in the server response after changes?
A tiny bit, maybe. Still tweaking, the 75 above are today's changes and I need to see how it feels. Along with interactivity I'm trying to bring back sheer throughput, because the daily task that took 5 hours in 11.4 is now taking 6+ hours in 13.0, this is 20%+ speed penalty.
 
Sure. It's a staging environment having a bunch of initially identical PostgreSQL databases. Space-wise zfs dedup suits perfectly.
Why don't you work by cloning ZFS datasets? When your data diverges, simply make a new clone and change its data. ZFS records only the diffs.
First create a single database in a dataset, then clone it a number of times and then do the changes everywhere. Deduplication would not be necessary, unless you plan to make a lot of changes in each database and then end up with almost the same data everywhere.

I think it's an application issue and you are trying to solve it on storage level.
 
It's a general issue, I guess. Not only PG, ElasticSearch also benefits N-fold from zfs dedup, which starts working instantly and transparently to any app by simply turning the knob.

Also, manuals state that clones can only be created from snapshots, and snapshots are read-only. Not sure how to arrange PG & ES to be aware of all that or if they should )

Moreover, some googling shows there are enough people complaining about zfs killing interactivity during heavy write activity, even from people not using dedup.
 
If you are lucky, the issues will be fixed. But optimizations are usually a tradeoff - either you save space, or you save time. Having it bothways is usually hard or not possible.
 
What I'm trying to battle here are absolutely intolerable interactivity blocks for up to 20-60 seconds or so, not performance in the first place (although overall throughput also suffers compared to 11.4 as mentioned previously). During problem times, 20-60sec is how long I need to wait for ssh log ins or log outs to happen, or for vim to open any file.

Generally speaking: if vdisks are only this fast there's little you can do at the OS/app level. You would just notice that disk i/o operations like writing a file or logging in/logging out complete a bit later than it does during idle times, yet _predictably_ later. This is what I'm trying to achieve. But if the delays and all this throttling happen artificially not to overload the disk, in case some preset "max op" value is reached and no further i/o operations are attempted until the value goes below that threshold - this is another story and this should be tweaked/fixed to match current hw capabilities.

Currently I've undone all customizations attempted in bulk. The idea now is to do just one modification and wait for at least 1 day to see if it changes anything (heavy disk activity takes place every morning).
I've noticed that compared to 11.4, in 13.0 this is 2, not 1:
vfs.zfs.vdev.async_write_min_active: 2

According to OpenZFS docs this is the minimum number of operations issued when txg is committed to storage (every 5 sec by default). Let's see if changing it to 1 helps overall i/o load, and as a result, mitigates all these freezes.
 
Wow, this value in 11.4 defaults to 1, and to 0 in 13.0:

Code:
vfs.zfs.dedup.prefetch: Enable prefetching dedup-ed blks


Maybe this is it. Ok, so just these two values with their defaults after the comment sign:

Code:
vfs.zfs.vdev.async_write_min_active=1 #2
vfs.zfs.dedup.prefetch=1 #0

Waiting until tomorrow.
 
Nope, turning on dedup prefetch (like it was in 11.4) decreased performance further. Now I only have this change:
vfs.zfs.vdev.async_write_min_active=1

I also reverted vfs.zfs.arc.min from 2gb to 0 as per default.
Waiting until tomorrow (c).
 
Back
Top