ZFS Reading on ZFS extremely slow

Cath O'Deray · Feb 15, 2022

mack3457 said:
the system doesn't boot.

mack3457 said:
this other problem of the system not booting.

FreeBSD bug {link removed} – there's a moderated how-to.

mack3457 · Feb 16, 2022

So, we do give up this server. We already rented a second one and are copying everything from the old server and the backup to the new one, which btw. gives us completely different io values, e.g. about 300 to 700 kB/s per IO instead of 16 kB/s etc..

The most important dataset is still accessible (about 14 TB), so beyond being offline for a couple of days and much work, it looks "basically" fine. So it's probably not the fault of 12.2 to 13.0 migration but just a defective system.

It's unfortunate that zpool status or similar didn't report any problems, so we didn't notice it in time.

diizzy · Feb 16, 2022

zfs won't notify you until something (a device) actually breaks, you can however use smartd to monitor disk status using like netdata (I like it becasue its easy to setup) or another system monitoring software.

gpw928 · Feb 16, 2022

diizzy said:
zfs won't notify you until something (a device) actually breaks, you can however use smartd to monitor disk status using like netdata (I like it becasue its easy to setup) or another system monitoring software.

That's good advice. smartd(8) is a must for any system you care about. It's going to tell you about all sorts of developing disk problems, and is dead easy to configure. One line is usually enough in /usr/local/etc/smart.conf. Here's mine:

Code:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,50,55 -m alerts@mailhost

You may care to review the "-W 4,50,55" as 55 degrees C would generally be considered "hot" (no A/C in my office). See smartd.conf(5).

I use email to post the alerts. You do have to read your email ("alerts@mailhost" get forwarded to me). That's enough for my situation. More serious circumstances would warrant an active monitoring and alert system, perhaps capable of waking you up at night.

You should also thoughtfully configure the "zfs" options enumerated in /etc/defaults/periodic.conf. Then read root's email, daily. It's my first task every morning -- after coffee.

[My tank is 12TB. I scrub it every 35 days (automated by the periodic(8) configuration). It takes about 5 hours, and everything runs slow for the duration (so timing may be important), but it provides real confidence in the integrity of the system.]

Peter Eriksson · Feb 16, 2022

gpw928 said:
Our posts crossed.

My ZFS zroots are all small, and on mirrors, with boot blocks on each disk. So I have no direct experience of your situation.

Knowing the status of the zroot pool is important. You need to know if it's seriously damaged.

You also need to verify that your service provider actually replaced ada0.

I'd boot from the rescue media. Have a look at zpool status zroot.

Is ada0 "OFFLINE"? Are all the other disks "ONLINE"? If not, stop, and re-consider.

Assuming ada0 has been replaced, you are going to need to put a gpt label on it. Assuming that they all disks all the same:

Code:

gpart show # check the partitions, especially the freebsd-boot partition number -- assuming 1 below) gpart destroy -F ada0 gpart backup ada1 | gpart restore -F ada0 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

If it still won't boot from ada0, try switching to ada1 in the BIOS (ada1 might need bootcode installed).

If you do manage to boot, pause and prepare a plan for the re-silvering.

There's plenty of expertise on this list. They just need a chance to respond.

Perhaps the server is using UEFI-boot? Then a different bootblock should be used...

I've written a shell script that checks (and optionally reinstalls) the bootblocks that you might find useful:

http://www.grebo.net/~peter/freebsd/check-bootcode

(By default it uses "gpart" to look for "efi" and "freebsd-boot" partitions)

- Peter

gpw928 · Feb 16, 2022

Peter Eriksson said:
Perhaps the server is using UEFI-boot? Then a different bootblock should be used...

I've written a shell script that checks (and optionally reinstalls) the bootblocks that you might find useful:

Good stuff. I was trying to avoid too much complication. And I did say:

Code:

gpart show    # check the partitions, especially the freebsd-boot partition number -- assuming 1 below

My assumption was that with freebsd-zfs zroot in partition 3 (see initial post), UEFI was not in use.

Cath O'Deray · Feb 17, 2022

Peter Eriksson said:
http://www.grebo.net/~peter/freebsd/check-bootcode

Thanks, I see this line:

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i $P "$DEV"

From <{link removed}>, with added emphasis:

… To update old ESP partitions, users should stop using the gpart(8) utility. Instead, …

If I'm taking that line of the script out of context: apologies.

mack3457 · Feb 17, 2022

Just one other question: when I boot the old system into the rescue system, I get everything mounted properly now.

After importing with

Code:

zpool import -R /mnt zroot

it looks like this:

Code:

[root@rescue ~]# zfs list
NAME                 USED  AVAIL     REFER  MOUNTPOINT
zroot               14.5T   750G      128K  none
zroot/ROOT           215G   750G      128K  none
zroot/ROOT/default   215G   750G      215G  /mnt
zroot/files         14.2T   750G     14.2T  /mnt/files
zroot/tmp           12.7G   750G     12.7G  /mnt/tmp
zroot/usr           13.1G   750G      151K  /mnt/usr
zroot/usr/home      10.7G   750G     10.7G  /mnt/usr/home
zroot/usr/ports     2.36G   750G     2.36G  /mnt/usr/ports
zroot/usr/src        128K   750G      128K  /mnt/usr/src
zroot/var           34.2G   750G      128K  /mnt/var
zroot/var/audit     10.1G   750G     10.1G  /mnt/var/audit
zroot/var/crash     21.0G   750G     21.0G  /mnt/var/crash
zroot/var/log       1.57G   750G     1.57G  /mnt/var/log
zroot/var/mail      1.53G   750G     1.53G  /mnt/var/mail
zroot/var/tmp       15.3M   750G     15.3M  /mnt/var/tmp

Now, I would like to boot from this, but it does not come up. Is there any idea, what might be wrong?

gpw928 · Feb 17, 2022

What does gpart show say?

What is the boot order set to in the BIOS?

mack3457 · Feb 17, 2022

Code:

[root@rescue ~]# gpart show
=>         40  11721045088  ada0  GPT  (5.5T)
           40         1024     1  freebsd-boot  (512K)
         1064          984        - free -  (492K)
         2048     83886080     2  freebsd-swap  (40G)
     83888128  11637155840     3  freebsd-zfs  (5.4T)
  11721043968         1160        - free -  (580K)

=>         40  11721045088  diskid/DISK-Y7QOK01AFTTB  GPT  (5.5T)
           40         1024                         1  freebsd-boot  (512K)
         1064          984                            - free -  (492K)
         2048     83886080                         2  freebsd-swap  (40G)
     83888128  11637155840                         3  freebsd-zfs  (5.4T)
  11721043968         1160                            - free -  (580K)

=>         40  11721045088  ada1  GPT  (5.5T)
           40         1024     1  freebsd-boot  (512K)
         1064          984        - free -  (492K)
         2048     83886080     2  freebsd-swap  (40G)
     83888128  11637155840     3  freebsd-zfs  (5.4T)
  11721043968         1160        - free -  (580K)

=>         40  11721045088  ada2  GPT  (5.5T)
           40         1024     1  freebsd-boot  (512K)
         1064          984        - free -  (492K)
         2048     83886080     2  freebsd-swap  (40G)
     83888128  11637155840     3  freebsd-zfs  (5.4T)
  11721043968         1160        - free -  (580K)

=>         40  11721045088  ada3  GPT  (5.5T)
           40         1024     1  freebsd-boot  (512K)
         1064          984        - free -  (492K)
         2048     83886080     2  freebsd-swap  (40G)
     83888128  11637155840     3  freebsd-zfs  (5.4T)
  11721043968         1160        - free -  (580K)

[root@rescue ~]#

Boot order is network boot first, then ada1 or ada2, don't remember currently.

gpw928 · Feb 17, 2022

May we assume that "diskid/DISK-Y7QOK01AFTTB" is a new disk, added to replace the failing one?

What does zpool status say? Also ls -lad /dev/ada? /dev/diskid/DISK-Y7QOK01AFTTB.

Assuming that the status shows ada0 through ada3 as the members of the RAID set, you can add the required boot blocks to each drive with:

Code:

for disk in 0 1 2 3
do
    gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 /dev/ada${disk}
done

Change the BIOS boot order to place working disks first, and try to reboot.

[I am not a fan of giant root file systems of any kind. Two small enterprise class SSDs (e.g. Intel D3 or Optane) would make a good root mirror, with fast swap, and allow you the option to add a separate ZFS intent log (SLOG) and a level 2 ARC (L2ARC). But that's probably best set aside until the next major upgrade.]

mack3457 · Feb 17, 2022

No, there are only four disks, ada0 to ada3.

Code:

[root@rescue ~]# zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 1 days 11:34:35 with 0 errors on Wed Feb 16 22:24:04 2022
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            ada0p3  OFFLINE      0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

errors: No known data errors
[root@rescue ~]#

This DISK-... is ada0:

Code:

[root@rescue ~]# dmesg | grep Y7QOK01
ada0: Serial Number Y7QOK01AFTTB
ada0: Serial Number Y7QOK01AFTTB
[root@rescue ~]# ls -lad /dev/ada? /dev/diskid/DISK-Y7QOK01AFTTB
crw-r-----  1 root  operator  0x70 Feb 17 16:47 /dev/ada0
crw-r-----  1 root  operator  0x7a Feb 17 16:47 /dev/ada1
crw-r-----  1 root  operator  0x7c Feb 17 16:47 /dev/ada2
crw-r-----  1 root  operator  0x7e Feb 17 16:47 /dev/ada3
crw-r-----  1 root  operator  0x78 Feb 17 16:47 /dev/diskid/DISK-Y7QOK01AFTTB
[root@rescue ~]#

Giant root file system: no, I never do it like this, when I have the chance. This server costs us about 82 € / month. If we want more disks of that size, the price will increase by minimum 60%. So, that's the reason for this layout.

I won't have time tomorrow, I think. So no chance for other tests or responses then. Anyway, the new server is up and live almost completely, so it's just a kind of curiosity to know, why it doesn't boot. We almost don't need it anymore.

gpw928 · Feb 17, 2022

Hmmm, I suspect that you have been here, but logic suggests that it would boot if you put boot blocks on ada1, and changed the boot order to ada1 first. As a hedge, put the boot blocks on ada2 and ada3 as well. Then reboot after you rotate the "first" boot disk in the BIOS. Do it a couple of times to be sure of not selecting ada0.

mack3457 · Mar 3, 2022

One result on the slow ZFS problem: the arc size was set to 0 and this probably was a problem.

I had it the same way on the new machine and sometimes experienced very slow disk access as well. I now limited it to 4 GB (out of 32 GB RAM on the machine) and it seems to be MUCH better.

The default seems to be "all RAM less 1 GB", which doesn't seem to free memory fast enough to avoid swapping.

We are sometimes reading large files of some GB in size, which probably filled up the arc cache way too much and without any practical use.

Maybe I test with L2ARC on ram disk for the database, as the db has very low volume write access (without critical data) and high read access requirements.

Erichans · Mar 4, 2022

I cannot imagine where that setting (="the arc size was set to 0") would have an actual use case. The ARC is designed to gobble up RAM quite greedily and to free it speedily when required; a large amount of RAM is essential for a swift working ZFS. That speed, although fast, is not equal to the speed of allocating a chunk of memory out of freely available memory: your use case seems to hit that particular "vulnerability". Taking action to limit the greediness of the ARC seems apropriate.

The ARC tries to keep the most likely needed info in memory; it uses an MFU (Most Frequently Used list) and an MRU (Most Recently Used list); you'll see them after the "ARC:" when using top(1). Besides the max value you brought down from the default setting there are other tunables relating to the ARC. Alltough you seem to have alleviated the most stringent limiting factor, you are limiting the default behaviour by quite a lot; your currrent setting may or may not be optimal. When your ARC and your DB requirements can be met by the 32GB RAM, I think there is little to be gained by adding an (expensive) L2ARC, ~~deleted~~**

For further statistical information I suggest you have a look at sysutils/zfs-stats; zfs-stats -E will give relevant data about a.o. the hit & miss ratio of the MFU & MRU lists. I suspect that when you go with the default setting ("all RAM less 1 GB") you'll see the relevant ratio's deterioating. This might give some enlightening info about the functioning of ZFS under your workload when changing tunables.

The following might also be useful:

Using zpool iostat to monitor pool performance and health by Klara Systems - October 29, 2020
Tuning OpenZFS by Allan Jude and Michael W. Lucas - www.usenix.org: SYSADMIN, WINTER 2016 VOL. 41 , NO. 4
Advanced ZFS (book & ebook); specifically Chapter 7: Caches - see FreeBSD Development: Books, Papers, Slides

Special use cases, like yours apparently, require and warrent changing ZFS tunables, but*:

Understand how the ARC behaves before you start fiddling with it, however.

___
* Advanced ZFS, p. 123 - in No. 3 in the above list.
** Edit: deleted the remark about write issues; if you have "to much writing", you should look into adding a SLOG, if adding more RAM will not address the problem.

gpw928 · Mar 4, 2022

I'm not an expert on the subject of ARCs but have been reading a bit, lately.
L2ARC works really well with a random read-heavy workload.
However, there's a significant overhead in the ARC for pointers to the L2ARC.
So creating a huge L2ARC will diminish the usable entries in your ARC.

As Erichans suggests, do the reading. Here's another from Klara Systems.

Cath O'Deray · Mar 4, 2022

Erichans said:
I cannot imagine where that setting (="the arc size was set to 0") …

I read it colloquially, as meaning vfs.zfs.arc.max: 0 (the default, if I'm not mistaken).

Code:

% sysctl vfs.zfs.arc.
vfs.zfs.arc.prune_task_threads: 1
vfs.zfs.arc.evict_batch_limit: 10
vfs.zfs.arc.eviction_pct: 200
vfs.zfs.arc.dnode_reduce_percent: 10
vfs.zfs.arc.dnode_limit_percent: 10
vfs.zfs.arc.dnode_limit: 0
vfs.zfs.arc.sys_free: 0
vfs.zfs.arc.lotsfree_percent: 10
vfs.zfs.arc.min_prescient_prefetch_ms: 0
vfs.zfs.arc.min_prefetch_ms: 0
vfs.zfs.arc.average_blocksize: 8192
vfs.zfs.arc.p_min_shift: 0
vfs.zfs.arc.pc_percent: 0
vfs.zfs.arc.shrink_shift: 0
vfs.zfs.arc.p_dampener_disable: 1
vfs.zfs.arc.grow_retry: 0
vfs.zfs.arc.meta_strategy: 1
vfs.zfs.arc.meta_adjust_restarts: 4096
vfs.zfs.arc.meta_prune: 10000
vfs.zfs.arc.meta_min: 0
vfs.zfs.arc.meta_limit_percent: 75
vfs.zfs.arc.meta_limit: 0
vfs.zfs.arc.max: 0
vfs.zfs.arc.min: 0
%

mack3457 · Mar 4, 2022

Yes, arc_max = arc_min = 0 was the default in vfs.zfs.arc.*.

We sometimes had swap use of 15 GB and more, even though I couldn't see processes using so much memory additionaly to the 32 GB RAM, so I suspect ZFS to not have freed memory fast enough.

Thanks for the links, looks interesting.

Erichans · Mar 4, 2022

mack3457 said:
So, no change in ARC size (min and max = 0).

grahamperrin said:
I read it colloquially, as meaning vfs.zfs.arc.max: 0 (the default, if I'm not mistaken).

Well, I read mack3457's statement rather literally,* as for example in /boot/loader.conf:

Code:

vfs.zfs.arc_max=”0”
vfs.zfs.arc_min=”0”

So, just like mack3457 mentions:

mack3457 said:
Yes, arc_max = arc_min = 0 was the default in vfs.zfs.arc.*.

That "default" baffles me.

In 20.6.1. Tuning of the Handbook it mentions about the default value for vfs.zfs.arc_max:

vfs.zfs.arc_max - Upper size of the ARC. The default is all RAM but 1 GB, or 5/8 of all RAM, whichever is more. Use a lower value if the system runs any other daemons or processes that may require memory. Adjust this value at runtime with sysctl(8) and set it in /boot/loader.conf or /etc/sysctl.conf.

An example mentioned* for a 32GB system (Advanced ZFS, Restricting the ARC size, p. 129):

Here we set an upper limit of 20 GB in /boot/loader.conf
vfs.zfs.arc_max=”21474836480”

grahamperrin said:
Code:

% sysctl vfs.zfs.arc. <snip> vfs.zfs.arc.max: 0 vfs.zfs.arc.min: 0

If I read this correctly, my imagination has failed me.

___
* I think OpenZFS now uses a connecting "dot" instead of a connecting underscore:

Code:

vfs.zfs.arc.max=”0”
vfs.zfs.arc.min=”0”

mack3457 · Mar 4, 2022

I found "0" to be documented as being the default in Linux and Oracle documentations == 50% or 75% or 1 GB less than RAM.

Erichans · Mar 4, 2022

Thanks. In relation with FreeBSD ZFS I did not find any mention of that "0" is the default or would somehow be an appropriate setting. I'd like to know if I have been missing some peculiarity. I can't really speak to any Linux (originally ZFS there was labelled as ZOL) default values but my guess would be that since OpenZFS, as a functioning centralized repository and a guiding initiative, has become the reference, instead of (once Sun's OS, Solaris, originally and later) illumos, ~~ZOL~~ OpenZFS on Linux would be in line with OpenZFS; as (ZFS in) FreeBSD is.

As for Oracle, I wouldn't rely on anything as specific as tunables for guidance for current best ZFS practices on FreeBSD*. Even for principles and as general information I would only rely on Oracle's ZFS documentation, after checking that with the (Open)ZFS documentation applicable to FreeBSD, and FreeBSD's own documentation. It has been long since Oracle ZFS has taken its own closed source path.

Besides the information mentioned, especially the ZFS advanced book, I would also have a look at the OpenZFS documentation:

System Administration
(and perhaps even Developer resources)

Following System Administration --> 3.1.7 Performance --> Home » Performance and Tuning » Workload Tuning I find no mention of either vfs.zfs.arc_max or vfs.zfs.arc.max; making tuning for that a higly specific niche.

___
* Edit: From Wikipedia-ZFS:

According to Matt Ahrens, one of the main architects of ZFS, over 50% of the original OpenSolaris ZFS code has been replaced in OpenZFS with community contributions as of 2019, making “Oracle ZFS” and “OpenZFS” politically and technologically incompatible.

Cath O'Deray · Mar 5, 2022

grahamperrin said:
FreeBSD bug {link removed}

Erichans said:
… I think OpenZFS now uses a connecting "dot" instead of a connecting underscore: …

Code:

% sysctl vfs.zfs.arc_max
vfs.zfs.arc_max: 0
% sysctl vfs.zfs.arc.max
vfs.zfs.arc.max: 0
%

– compare with the spoiler (above) from the same computer.

mack3457 said:
I found "0" to be documented as being the default in Linux and Oracle documentations == 50% or 75% or 1 GB less than RAM.

From <{link removed}>:

… Under Linux, half of system memory will be used as the limit. …

– then there's the statement about FreeBSD.

Erichans · Mar 5, 2022

I find those parts from the Handbook & the Advanced ZFS book versus the OpenZFS webpage with zfs_arc_max=0B (ulong) not very clear.

It could be that what is mentioned as zfs_arc_max=0B (ulong) (with combining underscores) is the source code setting which you can only set when compiling. That could be different from the tunable:

Code:

% sysctl vfs.zfs.arc.max

So that could be two different entry points to the same property.

grahamperrin, (I don't have a ZFS install ready) could you check with an OpenZFS install (I presume you are on 13.0 or later) after commenting out any entries:

Code:

vfs.zfs.arc.max=
vfs.zfs.arc.min=

in /boot/loader.conf or in /etc/sysctl.conf

After a reboot, what do you get from # sysctl vfs.zfs.arc.

Erichans · Mar 5, 2022

OpenZFS - Home » Performance and Tuning » Module Parameters zfs_arc_max:

zfs_arc_max
Maximum size of ARC in bytes. If set to 0 then the maximum ARC size is set to 1/2 of system RAM.

Click to expand...
OpenZFS - Home » Man Pages » Devices and Special Files (4) » zfs.4 zfs_arc_max:

zfs_arc_max=0B (ulong)
Max size of ARC in bytes. If 0, then the max size of ARC is determined by the amount of system memory installed. Under Linux, half of system memory will be used as the limit. Under FreeBSD, the larger of all_system_memory - 1GB and 5/8 × all_system_memory will be used as the limit. This value must be at least 67108864B (64MB).

Click to expand...

Same website, same parameter (character-by-character): which is it?

Cath O'Deray · Mar 5, 2022

Sorry, I could have been clearer.

Erichans said:
… after commenting out any entries: …

The values above were obtained with no entries in effect.