Solved disk read extremely slow after some uptime, scrub makes system unresponsive

Hi everyone, I just registered on the forum to ask for help with my FreeBSD NAS.
Hardware is a HP Proliant gen8 microserver, with two 3TB WD red drives.
System is FreeBSD 12.0-RELEASE-p4, with default kernel.

gpart show output:
Code:
gpart show
=>        40  5860533088  ada0  GPT  (2.7T)
          40        2008        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     4194304     2  freebsd-swap  (2.0G)
     4198400    31457280     3  freebsd-zfs  (15G)
    35655680  5824671744     4  freebsd-zfs  (2.7T)
  5860327424      205704        - free -  (100M)

=>        40  5860533088  ada1  GPT  (2.7T)
          40        2008        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     4194304     2  freebsd-swap  (2.0G)
     4198400    31457280     3  freebsd-zfs  (15G)
    35655680  5824671744     4  freebsd-zfs  (2.7T)
  5860327424      205704        - free -  (100M)

zpool list output:
Code:
zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot   14.9G  2.01G  12.9G        -         -    29%    13%  1.00x  ONLINE  -
zstore  2.70T  1.33T  1.38T        -         -    18%    49%  1.00x  ONLINE  -

I have configured periodic scrubs on the two pools starting every Tuesday morning, in a moment of no activity on the NAS. Average scrub will take around 2-4 minutes for zroot and 3-4 hours for zstore. Everything has worked fine for almost a couple years since first install.

Beginning some time after the upgrade to 12.0, every now and then (not necessarily the first time it is executed) a scrub will "fail", that is, it will perform at extremely slow speeds and almost completely hang the system while running.
For example, SSH access via puTTY will take around 20-30 seconds between entering the password and having a prompt, or accessing the samba share from a windows machine will take 2-3 minutes.

gstat during these scrubs will look like this:
Annotazione 2019-06-04 221555.png


When this happens, stopping the scrub will instantly make the system responsive again, and for example will let me transfer a file via smb at full gigabit network speed, both from and to the NAS. Rebooting and restarting the scrub will make it go at the "normal" speed.

I honestly do not have the minimum knowledge of FreeBSD and ZFS required to investigate further on the topic, and would like to know how to possibly rule out a hardware issue (maybe with the controller, or the cable). I tried switching the disks to the unused bays but to no avail. SMART info looks fine.

Any help would be greatly appreciated. Thanks!
 
As you seem to have both your pools on the same disks, you are actually running two scrubs in parallel.
Even if scrubbing would normally "pause" for a moment when you e.g. access your zstore pool, I guess the scrub on zroot will continue to keep the same drives busy.
I would
  1. start scrubbing the pools on different days (or time of day)
  2. or at least adjust vfs.zfs.scrub_delay and/or vfs.zfs.scan_idle (info here)
Personally I've never had a need to tweak scrubbing, but I'd never put two pools on the same disks.
 
Yeah I think having two pools on the same disks and scrubbing them both at the same time is the main cause. I'd scrub them consecutively, waiting for one to finish before starting the other.
 
While I agree with the theory put forth by k.jacker and SirDice, and while that theory even makes intuitive sense (long seeks causing by two operation streams each with high locality, but in different areas of the disk), the data in the printout of gstat above does not agree with it. It shows that the only activity is on ada[01]p4, which is the second pool. So at the time of the slowdown, there seems to be only one scrub active as far as the disk is concerned.

Still, not starting two scrubs at once seems like a good starting point for investigation and repair.
 
Hi there, thanks both for the input, but I honestly don't think that is the issue. The zroot pool is very small (2GB occupied on a total of 15), so it takes just a couple minutes to scrub.

In a normal situation, the scrub speed of the larger pool starts slower, and then gets up to full speed (usually around 100MB/s IIRC) when the smaller pool finishes scrubbing. The system stays responsive (albeit a little bit slower during transfers, ofc) for the whole time.

When I have the problem, both scrubs start a lot slower, and even when the smaller one is done (after maybe 20 minutes instead of the usual 2) the bigger pool does not increase scrubbing speed at all, but proceeds at speeds often under 10KB/s - if I forget to check I can maybe find it still scrubbing after two days instead of the usual 3-4 hours. If I stop the scrub and relaunch it alone, without scrubbing the other pool, speed will still be slow - and the system unresponsive - until a reboot.
 
That confirms what ralphbsz pointed out.

Could it be you are low on ram? Maybe system starts to swap while scrubbing both pools simulatneously.
In such case, I'd limit arc size for further testing, by setting vfs.zfs.arc_max to 3/4 of GiB RAM installed in your system multiplied by 1073741824
 
Hi k.jacker , last evening while experiencing the issue I checked with htop and I had around 600MB ram used on a total of 2GB installed.
It's true that I also had >0 swap used, around 50MB if I remember correctly.

However I already tried to set swap off when the issue appears using swapoff -a, but with no visible difference.
 
That is VERY little ram and disabling swap isn't the right way to solve your problem.

Have you tried to reduce arc size?
If you don't know what that means, it's not a shame to ask for more details.

Anyway, I'd try limiting arc to 1GiB and disable prefetch.
To do that, add the following to /etc/sysctl.conf and reboot.
Code:
vfs.zfs.arc_max=1073741824
vfs.zfs.prefetch_disable=1
 
Anyway, I'd try limiting arc to 1GiB and disable prefetch.
Prefetching is disabled by default in systems with less than 4GB; so with such little memory it should be already off.
/var/run/dmesg.boot said:
ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
 
Hi there, I tried adding both lines to /etc/sysctl.conf and rebooted the system.
I will also try changing the scheduling of the two scrubs so they fire on different days of the week.

The issue is quite erratic, so it will maybe take a couple weeks or more before it eventually happens again, but in case I'll update this thread.

Just for info, today I launched a scrub with the NAS in "nominal" conditions, and the larger pool completed at an average 111MB/s in a couple hours. I will also look into how to read the output of zfs-stats -a (I'm attaching the output).

Thanks for the input for now!
 

Attachments

  • zfs-stats.txt
    11.2 KB · Views: 183
Those should be set in /boot/loader.conf not /etc/sysctl.conf. They need to be set before the kernel and ZFS modules are loaded.
 
So, this happened (obviously) again this morning.
Of course, I totally forgot about changing the scheduled operations in order to not start both scrubs together, so they did. They started this morning at 10AM, I came home at 6PM to find that the smaller pool completed scrub in 23 minutes and the larger one was still under 5%. System was unresponsive (took around 30 seconds to have a ready shell via puTTY).

So, I stopped the ongoing scrub: zpool scrub -s zstore
System became responsive almost immediately. gstat output:
Annotazione 2019-06-11 190847.png


Then I relaunched the scrub on the small pool: zpool scrub zroot, keeping an eye on gstat:
Annotazione 2019-06-11 191042.png


scrub completed in 45 seconds, as per zpool status output:

Code:
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:45 with 0 errors on Tue Jun 11 19:10:45 2019
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/zroot0  ONLINE       0     0     0
            gpt/zroot1  ONLINE       0     0     0

errors: No known data errors

Then, after the smaller one ended, I relaunched the scrub on the larger pool: zpool scrub zstore
After almost 10 minutes, this is an average gstat screen:
Annotazione 2019-06-11 192058.png


Notice the difference in kBps column, almost 1% of the speed of the smaller pool.
Current zpool status output for larger pool (the 63.1G number has not changed in the last 5-6 minutes, and the 103M/s figure is decreasing):
Code:
  pool: zstore
 state: ONLINE
  scan: scrub in progress since Tue Jun 11 19:13:29 2019
        63.1G scanned at 103M/s, 516K issued at 838/s, 1.35T total
        0 repaired, 0.00% done, no estimated completion time
config:

        NAME             STATE     READ WRITE CKSUM
        zstore           ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            gpt/zstore0  ONLINE       0     0     0
            gpt/zstore1  ONLINE       0     0     0

errors: No known data errors

Finally, I am also attaching the full output of zstats -a while the larger scrub is still going, and a screen from htop showing RAM and swap usage:
Annotazione 2019-06-11 193522.png


If I now stopped the scrub, rebooted the system, and relaunched both scrubs one after the other, I am confident both would run at full speed.

I know this is a lot of stuff, so I am very thankful for even taking the time to read through this :)
 

Attachments

  • zfsstats.txt
    11.2 KB · Views: 165
ZFS requires a lot of RAM. The copy-on-write architecture requires a large memory cache in order to perform reasonably.
Also, the bigger the file systems you have, the more RAM you would need. I have seen figures of 4GB recommended as bare minimum. Your system has 2,7 TB of storage so I assume you need even more than 4 GB.

One more thing - make sure you are not using deduplication. This is a very memory hungry feature and you have only 2GB of memory.

The recommendation for your system would probably be to use UFS (and you could still use mirroring for redundancy via gmirror).
Adding more memory would also be a good option.
 
I'm running ZFS with about 5 or so TB of disk space, on a 3GB 32-bit machine, with none of these problems. While we always hear that ZFS can use a lot of memory, I wouldn't jump to the conclusion that this much memory is strictly necessary.

And I don't think the arguments based on copy-on-write and dedup apply here: This system is currently not doing much file creation IO, so little copying or looking up duplicates is required; scrubbing is sort of a "read-only" workload.

The interesting thing here is the performance data form gstat for the scrub of the large pool. Each of the two disks drives is doing roughly 200 IO per second. That makes sense, that's roughly the seek rate for a modern disk drive. Yet they are only doing about 1500 to 1900 kB per second. That means each IO is only on average 7.5 to 9.5 KB in size. That's amazingly small. Contrast that with the gstat output from the scrub of the small pool: Again roughly 200 IOps (actually a few more than on the large pool, which can be explained by short-stroking), but the IO bandwidth is about 100 MB/s, meaning an average IO size of about 500kB. That makes more sense.

Can anyone come up with an idea of why scrubbing the large pool is being done with such small IOs? I don't know how ZFS scrub works (whether it proceeds at the block or file level, and how it orders operations). One interesting bit of input would be to look at the file size distribution on the large pool. Perhaps it has a very large number of tiny files, which might explain the abundance of seeks and the tiny IO size.
 
Hey everyone, I think I found a possible cause starting from here and here- it seems that a new sequential scrub algorithm was released in recent versions of ZFS.
Quoting from second link,
The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.
so... what if the in-memory queue of I/Os, in some cases, does not find enough RAM available?

Short answer: starting from the situation described in yesterday's post and noted by ralphbsz in his last reply, I stopped the slooow scrub on the larger pool once again, disabled the new scrubbing algorithm, relaunched the scrub and guess what? Same numbers as the smaller pool.
Then I tried with the new algorithm but disabling scrub prefetch, still high speed.
The problem reappeared as soon as I reenabled both scrub prefetch and the new algorithm together.

So, it seems that either
Code:
        vfs.zfs.zfs_scan_legacy                 0
        vfs.zfs.no_scrub_prefetch               1
or
Code:
        vfs.zfs.zfs_scan_legacy                 1
        vfs.zfs.no_scrub_prefetch               0
will do the trick.

Of course I will keep the thread updated in the next few weeks while I keep an eye on any possible misbehavings.
For now, thanks everyone for the answers and tips, and if you have any further insight to share you're welcome.

edited to correct the parameters
 
Last edited:
Wow ... something actually works and makes sense. I hope some ZFS developers are reading this thread: they should really make the two settings you describe into the default.

By the way, after ~25 years of working on storage, if you had to summarize all my knowledge into one sentence, it would be: Large and sequential IO good, small and random IO bad.
 
After a couple weeks and a few planned scrubs which went very well, I'm marking this as solved. Also made a little edit to the tunables in my last post.
Thanks again everyone, hope this is useful to someone else in the future :)
 
As you seem to have both your pools on the same disks, you are actually running two scrubs in parallel.

Scrubbing the pools in parallel is not the problem. You can do that, it will be ugly slow (due to the necessary disk seeks), but as in this case with a very small 2nd pool this should not harm.

The problem is a different one: a scrub usually runs with high priority, slowing other access to the disks. Therefore, when there is normal disk access, ZFS will pause the scrub for a few clock ticks (but it seems only write accesses are recognized, because the scrub itself creates read accesses).
The problem here is: when system wants to access it's root disks (and it needs to access the root disks to be responsive), ZFS will not recognize that these accesses as going to the same disks that are currently scrubbed, because different pool.

So, in such a configuration it should be expected that normal OS operation might get slow during scrubs, and careful tuning is necessary to workaround this. (Alternative: put the OS onto some further disk, or onto a small SSD that then could also be used as l2arc, reducing the demand on core memory.)
If we look at the gstat in posting #1, that looks quite nice: both disks are pulling 140MB per second. But the queues (column 1) are very full, and the OS will have a hard time pushing it's read requests (for /usr/bin or whatever) in between.

This is more or less the background why one should not put multiple pools on one disk (or shoud do it only with caution) - it is independent from the more severe problem which the TO seems to have now solved.
 
Code:
        vfs.zfs.zfs_scan_legacy                 0
        vfs.zfs.no_scrub_prefetch               1

Just to report that, after over 2 years running without a hitch, my FreeBSD11-based NAS server started showing the exact same symptoms as vemp 's. Thanks to his great job in chasing down and documenting his solution, I was able to solve my issue exactly in the manner depicted above.

I agree with ralphbsz, this should really be the default.
 
Hmm.. Curious - would the above settings possibly affect resilver operations too? Don't see a "no_resilver_prefetch" sysctl tunable but perhaps it'll help to set "zfs_scan_legacy" to 1 in that case...?

We had some severe problems this winter with a resilver causing NFS I/O to grind to a virtual halt (long hickups) for clients when resilvering a ~120TB pool (consisting of 12 10TB drives configured as two RAIDZ2(4+2) vdevs). It seemed to cause metadata in the ARC to get evicted to make room for the resilver'd data and then when the NFS clients want' to access metadata it had to go to the spinnings disks a lot - so we fought a lot on trying to tune it to minimize the effects but it never really was solved (the rebuild took 10 non-fun days)...

(FreeBSD 11.3, machine had 256GB of RAM but that was apparently not enough - it now have 512GB RAM and we'll see how things behave the next time).

In ZoL there is a tunable where you can set a "min limit" on the metadata in ARC that possibly would have made this behave better but that one isn't in FreeBSDs ZFS implementation.
 
Back
Top