ZFS Reading on ZFS extremely slow

Cath O'Deray · Mar 5, 2022

Erichans said:
… which is it? …

The manual page appears to be more recent.

Module Parameters.rst: outdated · Issue #276 · openzfs/openzfs-docs

From https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Module%20Parameters.html#zfs-arc-max: zfs_arc_max ~~~~~~~~~~~ Maximum size of ARC in bytes. If set to 0 then the maximum ARC s...

github.com

Cath O'Deray · Mar 5, 2022

Erichans said:
… OpenZFS now uses a connecting "dot" instead of a connecting underscore: …

Loosely speaking (it may be that some cases are not yet fixed):

. for consistency
plus _ for legacy.

For example:

FreeBSD: Add legacy compat arc_min and arc_max tunables · Pull Request #10579 · openzfs/zfs

Motivation and Context These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Description Add legacy compat tunables for the old names. How ...

github.com

FreeBSD: Add legacy arc_min and arc_max · openzfs/zfs@0421f25

These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <be...

github.com

FreeBSD: Add legacy arc_min and arc_max · freebsd/freebsd-src@0421f25

These tunables were renamed from vfs.zfs.arc_min and vfs.zfs.arc_max to vfs.zfs.arc.min and vfs.zfs.arc.max. Add legacy compat tunables for the old names. Reviewed-by: Brian Behlendorf <be...

github.com

Side note: FreeBSD bug 218538 – tuning(7) should either be removed or strictly maintained.

Erichans · Mar 5, 2022

grahamperrin said:
Sorry, I could have been clearer.

No problem. Documentation could indeed do with some updates, thanks for reporting that.

IF the Handbook decides to document such tunables, a mention about what "0" means would be nice; especially since it is also used as an output value of the tunable. It would also be nice if it was documented somewhere that zfs ARC max is exposed through the (kernel) parameter zfs_arc_max and through the tunable vfs.zfs.arc.max; especialy since it used to be vfs.zfs.arc_max.

We have lost vfs.zfs.arc_free_target as a tunable. It seems to have gone underground: no tunable for arc_free_target (arc_os.c - line 71-74) The Advanced ZFS book writes about it in the section about ARC tuning. It's described as an alternative to tuning with arc_max (arc_min). By its description, I see it as a better tuning start point then arc_max because:

the enforcing mechanism is different
it can be adjusted at run time

vfs.zfs.arc_max & vfs.zfs.arc_min is reported not to be tunable on the fly in the Advanced ZFS book. That aspect has changed somewhat it seems.

Cath O'Deray · Mar 6, 2022

See also: <https://forums.freebsd.org/posts/551519>

Erichans said:
… We have lost vfs.zfs.arc_free_target as a tunable. It seems to have gone underground: …

Code:

% zfs version
zfs-2.1.99-FreeBSD_g17b2ae0b2
zfs-kmod-2.1.99-FreeBSD_g17b2ae0b2
% sysctl vfs.zfs.arc.sys_free
vfs.zfs.arc.sys_free: 0
% sysctl vfs.zfs.arc_free_target
vfs.zfs.arc_free_target: 86267
% sudo sysctl vfs.zfs.arc.sys_free=100000
grahamperrin's password:
vfs.zfs.arc.sys_free: 0 -> 100000
% sudo sysctl vfs.zfs.arc.sys_free=0
vfs.zfs.arc.sys_free: 100000 -> 0
% sudo sysctl vfs.zfs.arc_free_target=256000
vfs.zfs.arc_free_target: 86267 -> 256000
% sudo sysctl vfs.zfs.arc_free_target=86267
vfs.zfs.arc_free_target: 256000 -> 86267
% uname -aKU
FreeBSD mowa219-gjp4-8570p-freebsd 14.0-CURRENT FreeBSD 14.0-CURRENT #5 main-n253627-25375b1415f-dirty: Sat Mar  5 14:21:40 GMT 2022     root@mowa219-gjp4-8570p-freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC-NODEBUG amd64 1400053 1400053
%

Cross-reference {link removed, link provider grahamperrin is dead, the killers can celebrate}:

<https://github.com/openzfs/zfs/comm...89dd6d116ebd50071673f6e5146f1ac290882R71-R101> (2020-04-15) began:

/* * We don't have a tunable for arc_free_target due to the dependency on * pagedaemon initialisation. */

Allan or anyone: please, is that comment redundant?

<https://forums.freebsd.org/posts/558971> if I'm not mistaken, there's tuning.

Postscript

Allan Jude helped me to understand that what I tuned was not a tunable. In the context of the code comment, tunable is a noun; "… basically a special type of sysctl that gets its initial value from the kernel environment (set by loader)".

Erichans · Mar 7, 2022

Thanks for looking into this and Allan Jude's explanation.

The documentation in the Handbook could do with some updating, especially when introducing vfs.zfs.arc_max (& min) and then not describing that "0" represents the default value. Not mentioning that the tunable has changed name (because of the changed internal source code tree structure) from vfs.zfs.arc_max to vfs.zfs.arc.max doesn't help either in clarifying things.

Great to know that vfs.zfs.arc_free_target is still usable as a knob to turn on. I wasn't able to fully appreciate the comment I was quoting in message #53; I hadn't tried to look at the source code. The kernel parameter arc_free_target doesn't have a tunable, it is itself derived from other tunables if I understand correctly. As such the sysctl vfs.zfs.arc_free_target cannot be set by the loader (=it cannot be set in /boot/loader.conf).

Cath O'Deray · Mar 16, 2022

Erichans said:
… cannot be set by the loader (=it cannot be set in /boot/loader.conf).

I stumbled across a bookmarked tutorial, from 2019, that led indirectly to this:

usakhncit said:
Is there any way to confirm that whether they are kernel-tunables or sysctl-variables?

alfonsosiciliano said:
Yes, it is a flag question, you could install sysutils/nsysctl (>= 1.1) [1]:

% nsysctl -aNG | grep elantech

you can read the comments of sys/sysctl.h for a description of the flags (if you like a GUI: deskutils/sysctlview [2] has a window for the flags and Help->Flags for a description)

[1] nsysctl tutorial
[2] sysctlview screenshots

nsysctl(8)

So, for example, vfs.zfs.arc_free_target near the head of this list:

Code:

% nsysctl -NG vfs.zfs | grep -v \ TUN | sort
vfs.zfs.anon_data_esize:  RD MPSAFE
vfs.zfs.anon_metadata_esize:  RD MPSAFE
vfs.zfs.anon_size:  RD MPSAFE
vfs.zfs.arc_free_target:  RD WR RW MPSAFE
vfs.zfs.crypt_sessions:  RD MPSAFE
vfs.zfs.l2arc_feed_again:  RD WR RW MPSAFE
vfs.zfs.l2arc_feed_min_ms:  RD WR RW MPSAFE
vfs.zfs.l2arc_feed_secs:  RD WR RW MPSAFE
vfs.zfs.l2arc_headroom:  RD WR RW MPSAFE
vfs.zfs.l2arc_noprefetch:  RD WR RW MPSAFE
vfs.zfs.l2arc_norw:  RD WR RW MPSAFE
vfs.zfs.l2arc_write_boost:  RD WR RW MPSAFE
vfs.zfs.l2arc_write_max:  RD WR RW MPSAFE
vfs.zfs.l2c_only_size:  RD MPSAFE
vfs.zfs.mfu_data_esize:  RD MPSAFE
vfs.zfs.mfu_ghost_data_esize:  RD MPSAFE
vfs.zfs.mfu_ghost_metadata_esize:  RD MPSAFE
vfs.zfs.mfu_ghost_size:  RD MPSAFE
vfs.zfs.mfu_metadata_esize:  RD MPSAFE
vfs.zfs.mfu_size:  RD MPSAFE
vfs.zfs.mru_data_esize:  RD MPSAFE
vfs.zfs.mru_ghost_data_esize:  RD MPSAFE
vfs.zfs.mru_ghost_metadata_esize:  RD MPSAFE
vfs.zfs.mru_ghost_size:  RD MPSAFE
vfs.zfs.mru_metadata_esize:  RD MPSAFE
vfs.zfs.mru_size:  RD MPSAFE
vfs.zfs.super_owner:  RD WR RW MPSAFE
vfs.zfs.vdev.cache:  RD WR RW
vfs.zfs.version.acl:  RD MPSAFE
vfs.zfs.version.ioctl:  RD MPSAFE
vfs.zfs.version.module:  RD MPSAFE
vfs.zfs.version.spa:  RD MPSAFE
vfs.zfs.version.zpl:  RD MPSAFE
%

chrcol · Mar 21, 2022

Interesting its been reported performance recovers after recreating the pool, a likely explanation for that is better fragmentation.

gpw928 · Mar 22, 2022

I vacated my 10TB tank, which was nearly 10 years old, by sending a snapshot to external media, and then sending it back. I did re-configure and re-initialise the tank while the data were away -- it got an extra spindle in the RAID-1Z set.

The scrub time came down to 5 hours. It was, to the best of my recollection, up around around 12 hours.

So fragmentation probably matters...

Cath O'Deray · Mar 22, 2022

gpw928 said:
… The scrub time came down to 5 hours. It was, to the best of my recollection, up around around 12 hours.

So fragmentation probably matters…

I should not expect fragmentation of files, alone, to have so extreme an effect on scrub (of pool metadata and blocks).

From zpool-scrub.8 — OpenZFS documentation:

… A scrub is split into two parts: metadata scanning and block scrubbing. The metadata scanning sorts blocks into large sequential ranges which can then be read much more efficiently from disk when issuing the scrub I/O. …

mack3457 · Mar 22, 2022

chrcol said:
Interesting its been reported performance recovers after recreating the pool, a likely explanation for that is better fragmentat

No, I had similar problems on the new pool. This is why I dug into other configuration options and came up with restricting the default arc size.

mack3457 · Mar 22, 2022

Ok there is another issue. Even though our machine has rather low write activities "on average", we sometime have high write activities by creating many tens of thousands rather small files.

This might create some kind of fragmentation due to the ZIL acitivities that are on the same device, as we don't have additional disks available, see https://thomas.gouverneur.name/2011/06/20110609zfs-fragmentation-issue-examining-the-zil/

This gives a hint at what happened, as a) the 14 TB partition was running low on disk space and b) very many files where created at one time, when we were very low on space. So maybe, here we had problems with massive fragmentation.

gpw928 · Mar 23, 2022

grahamperrin said:
I should not expect fragmentation of files, alone, to have so extreme an effect on scrub (of pool metadata and blocks).

Nor would I. However, the original tank was in continuous service for 8 years. It had 6 spindles in RAID-Z1 configuration. This is sub-optimal, and 7 spindles (which is what I went to) is technically better. But the general advice is that when you turn on compression (and I did), the spindle count advantage is diminished.

In any event, I can't go back, and have recently evolved again to use a 4-way stripe of 2-spindle mirrors with a separate ZIL.

batot · Aug 4, 2024

Sorry if you consider my post OTG.
I decided to write despite using Proxmox version 8.2.4 version I have the exact same problem.

IOPS degradation by 2000% write on one of the arrays (raid5sas) on LSI SAS2008 controller to about 20-30MB/s after filling the ARC buffer.
Reading from the same array is correct >400MB/s I did not measure it exactly.
ARC cache set to:
c 4 67483668480
c_min 4 4217729280
c_max 4 67483668480
128GB RAM.

I do not think that ARC settings have anything to do with it because the second SATA array on the same hardware works correctly for both write and read.

root@pve2:/raid5sas/test/temp# zpool status
pool: raid5low
state: ONLINE
scan: scrub repaired 0B in 02:56:37 with 0 errors on Sun Jul 14 03:20:39 2024
config:

NAME STATE READ WRITE CKSUM
raid5low ONLINE 0 0 0
sdg ONLINE 0 0 0
sdi ONLINE 0 0 0
sdk ONLINE 0 0 0
sdh ONLINE 0 0 0
sdj ONLINE 0 0 0
sdf ONLINE 0 0 0

errors: No known data errors

pool: raid5sas
state: ONLINE
scan: scrub repaired 0B in 01:52:42 with 0 errors on Sun Aug 4 07:11:04 2024
config:

NAME STATE READ WRITE CKSUM
raid5sas ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
wwn-0x5000c500587b5f5f ONLINE 0 0 0
wwn-0x5000c5008591cf77 ONLINE 0 0 0
wwn-0x5000c5008591a4cb ONLINE 0 0 0
wwn-0x5000c50085880b2f ONLINE 0 0 0

errors: No known data errors
root@pve2:/raid5sas/test/temp#

That's more or less how it looks for me.
The history of this raid5sas array is that it was created under the Rocky8 ~=RHEL8 system.
Then the controller and disks were moved to newer hardware for the Proxmox system.
And here, after the first power-up, it did not detect the pool, so I imported it.

During import, it shouted that /dev/sdf /dev/sdg /dev/sdh /dev/sdi was busy (the names of the disks in the system had changed), so according to the guide, I gave the -f option indicating which disks to import, i.e. /dev/sda /dev/sdb /dev/sdc /dev/sdd.

After importing, it wrote that the array was OK and the disks were displayed using WWN numbering.

Apart from that, there was nothing unusual.

Scrub did not detect anything.

Did mack3457 manage to solve the problem? How?

mack3457 · Aug 4, 2024

In /boot/loader.conf I added:

Code:

# Max. 4 GB arc Size:
vfs.zfs.arc_max="4294967296"

I think, that has resolved the issue, but it's a long time since then. I don't remember the details anymore.

The problem was that the arc cache didn't restrict itself as expected and took more memory than desired. See post #39 in this thread.