ZFS disk read extremely slow after some uptime, scrub makes system unresponsive

vemp

New Member

Reaction score: 2
Messages: 6

Hi everyone, I just registered on the forum to ask for help with my FreeBSD NAS.
Hardware is a HP Proliant gen8 microserver, with two 3TB WD red drives.
System is FreeBSD 12.0-RELEASE-p4, with default kernel.

gpart show output:
Code:
gpart show
=>        40  5860533088  ada0  GPT  (2.7T)
          40        2008        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     4194304     2  freebsd-swap  (2.0G)
     4198400    31457280     3  freebsd-zfs  (15G)
    35655680  5824671744     4  freebsd-zfs  (2.7T)
  5860327424      205704        - free -  (100M)

=>        40  5860533088  ada1  GPT  (2.7T)
          40        2008        - free -  (1.0M)
        2048        1024     1  freebsd-boot  (512K)
        3072        1024        - free -  (512K)
        4096     4194304     2  freebsd-swap  (2.0G)
     4198400    31457280     3  freebsd-zfs  (15G)
    35655680  5824671744     4  freebsd-zfs  (2.7T)
  5860327424      205704        - free -  (100M)
zpool list output:
Code:
zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zroot   14.9G  2.01G  12.9G        -         -    29%    13%  1.00x  ONLINE  -
zstore  2.70T  1.33T  1.38T        -         -    18%    49%  1.00x  ONLINE  -
I have configured periodic scrubs on the two pools starting every Tuesday morning, in a moment of no activity on the NAS. Average scrub will take around 2-4 minutes for zroot and 3-4 hours for zstore. Everything has worked fine for almost a couple years since first install.

Beginning some time after the upgrade to 12.0, every now and then (not necessarily the first time it is executed) a scrub will "fail", that is, it will perform at extremely slow speeds and almost completely hang the system while running.
For example, SSH access via puTTY will take around 20-30 seconds between entering the password and having a prompt, or accessing the samba share from a windows machine will take 2-3 minutes.

gstat during these scrubs will look like this:
Annotazione 2019-06-04 221555.png


When this happens, stopping the scrub will instantly make the system responsive again, and for example will let me transfer a file via smb at full gigabit network speed, both from and to the NAS. Rebooting and restarting the scrub will make it go at the "normal" speed.

I honestly do not have the minimum knowledge of FreeBSD and ZFS required to investigate further on the topic, and would like to know how to possibly rule out a hardware issue (maybe with the controller, or the cable). I tried switching the disks to the unused bays but to no avail. SMART info looks fine.

Any help would be greatly appreciated. Thanks!
 

k.jacker

Aspiring Daemon

Reaction score: 391
Messages: 678

As you seem to have both your pools on the same disks, you are actually running two scrubs in parallel.
Even if scrubbing would normally "pause" for a moment when you e.g. access your zstore pool, I guess the scrub on zroot will continue to keep the same drives busy.
I would
  1. start scrubbing the pools on different days (or time of day)
  2. or at least adjust vfs.zfs.scrub_delay and/or vfs.zfs.scan_idle (info here)
Personally I've never had a need to tweak scrubbing, but I'd never put two pools on the same disks.
 

SirDice

Administrator
Staff member
Administrator
Moderator

Reaction score: 7,409
Messages: 29,985

Yeah I think having two pools on the same disks and scrubbing them both at the same time is the main cause. I'd scrub them consecutively, waiting for one to finish before starting the other.
 

ralphbsz

Daemon

Reaction score: 1,082
Messages: 1,724

While I agree with the theory put forth by k.jacker and SirDice, and while that theory even makes intuitive sense (long seeks causing by two operation streams each with high locality, but in different areas of the disk), the data in the printout of gstat above does not agree with it. It shows that the only activity is on ada[01]p4, which is the second pool. So at the time of the slowdown, there seems to be only one scrub active as far as the disk is concerned.

Still, not starting two scrubs at once seems like a good starting point for investigation and repair.
 
OP
OP
V

vemp

New Member

Reaction score: 2
Messages: 6

Hi there, thanks both for the input, but I honestly don't think that is the issue. The zroot pool is very small (2GB occupied on a total of 15), so it takes just a couple minutes to scrub.

In a normal situation, the scrub speed of the larger pool starts slower, and then gets up to full speed (usually around 100MB/s IIRC) when the smaller pool finishes scrubbing. The system stays responsive (albeit a little bit slower during transfers, ofc) for the whole time.

When I have the problem, both scrubs start a lot slower, and even when the smaller one is done (after maybe 20 minutes instead of the usual 2) the bigger pool does not increase scrubbing speed at all, but proceeds at speeds often under 10KB/s - if I forget to check I can maybe find it still scrubbing after two days instead of the usual 3-4 hours. If I stop the scrub and relaunch it alone, without scrubbing the other pool, speed will still be slow - and the system unresponsive - until a reboot.
 

k.jacker

Aspiring Daemon

Reaction score: 391
Messages: 678

That confirms what ralphbsz pointed out.

Could it be you are low on ram? Maybe system starts to swap while scrubbing both pools simulatneously.
In such case, I'd limit arc size for further testing, by setting vfs.zfs.arc_max to 3/4 of GiB RAM installed in your system multiplied by 1073741824
 
OP
OP
V

vemp

New Member

Reaction score: 2
Messages: 6

Hi k.jacker , last evening while experiencing the issue I checked with htop and I had around 600MB ram used on a total of 2GB installed.
It's true that I also had >0 swap used, around 50MB if I remember correctly.

However I already tried to set swap off when the issue appears using swapoff -a, but with no visible difference.
 

k.jacker

Aspiring Daemon

Reaction score: 391
Messages: 678

That is VERY little ram and disabling swap isn't the right way to solve your problem.

Have you tried to reduce arc size?
If you don't know what that means, it's not a shame to ask for more details.

Anyway, I'd try limiting arc to 1GiB and disable prefetch.
To do that, add the following to /etc/sysctl.conf and reboot.
Code:
vfs.zfs.arc_max=1073741824
vfs.zfs.prefetch_disable=1
 

Maxnix

Well-Known Member

Reaction score: 197
Messages: 322

Anyway, I'd try limiting arc to 1GiB and disable prefetch.
Prefetching is disabled by default in systems with less than 4GB; so with such little memory it should be already off.
/var/run/dmesg.boot said:
ZFS NOTICE: Prefetch is disabled by default if less than 4GB of RAM is present;
to enable, add "vfs.zfs.prefetch_disable=0" to /boot/loader.conf.
 
OP
OP
V

vemp

New Member

Reaction score: 2
Messages: 6

Hi there, I tried adding both lines to /etc/sysctl.conf and rebooted the system.
I will also try changing the scheduling of the two scrubs so they fire on different days of the week.

The issue is quite erratic, so it will maybe take a couple weeks or more before it eventually happens again, but in case I'll update this thread.

Just for info, today I launched a scrub with the NAS in "nominal" conditions, and the larger pool completed at an average 111MB/s in a couple hours. I will also look into how to read the output of zfs-stats -a (I'm attaching the output).

Thanks for the input for now!
 

Attachments

phoenix

Administrator
Staff member
Administrator
Moderator

Reaction score: 1,245
Messages: 4,092

Those should be set in /boot/loader.conf not /etc/sysctl.conf. They need to be set before the kernel and ZFS modules are loaded.
 
OP
OP
V

vemp

New Member

Reaction score: 2
Messages: 6

So, this happened (obviously) again this morning.
Of course, I totally forgot about changing the scheduled operations in order to not start both scrubs together, so they did. They started this morning at 10AM, I came home at 6PM to find that the smaller pool completed scrub in 23 minutes and the larger one was still under 5%. System was unresponsive (took around 30 seconds to have a ready shell via puTTY).

So, I stopped the ongoing scrub: zpool scrub -s zstore
System became responsive almost immediately. gstat output:
Annotazione 2019-06-11 190847.png


Then I relaunched the scrub on the small pool: zpool scrub zroot, keeping an eye on gstat:
Annotazione 2019-06-11 191042.png


scrub completed in 45 seconds, as per zpool status output:

Code:
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:45 with 0 errors on Tue Jun 11 19:10:45 2019
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            gpt/zroot0  ONLINE       0     0     0
            gpt/zroot1  ONLINE       0     0     0

errors: No known data errors
Then, after the smaller one ended, I relaunched the scrub on the larger pool: zpool scrub zstore
After almost 10 minutes, this is an average gstat screen:
Annotazione 2019-06-11 192058.png


Notice the difference in kBps column, almost 1% of the speed of the smaller pool.
Current zpool status output for larger pool (the 63.1G number has not changed in the last 5-6 minutes, and the 103M/s figure is decreasing):
Code:
  pool: zstore
 state: ONLINE
  scan: scrub in progress since Tue Jun 11 19:13:29 2019
        63.1G scanned at 103M/s, 516K issued at 838/s, 1.35T total
        0 repaired, 0.00% done, no estimated completion time
config:

        NAME             STATE     READ WRITE CKSUM
        zstore           ONLINE       0     0     0
          mirror-0       ONLINE       0     0     0
            gpt/zstore0  ONLINE       0     0     0
            gpt/zstore1  ONLINE       0     0     0

errors: No known data errors
Finally, I am also attaching the full output of zstats -a while the larger scrub is still going, and a screen from htop showing RAM and swap usage:
Annotazione 2019-06-11 193522.png


If I now stopped the scrub, rebooted the system, and relaunched both scrubs one after the other, I am confident both would run at full speed.

I know this is a lot of stuff, so I am very thankful for even taking the time to read through this :)
 

Attachments

roccobaroccoSC

Well-Known Member

Reaction score: 95
Messages: 423

ZFS requires a lot of RAM. The copy-on-write architecture requires a large memory cache in order to perform reasonably.
Also, the bigger the file systems you have, the more RAM you would need. I have seen figures of 4GB recommended as bare minimum. Your system has 2,7 TB of storage so I assume you need even more than 4 GB.

One more thing - make sure you are not using deduplication. This is a very memory hungry feature and you have only 2GB of memory.

The recommendation for your system would probably be to use UFS (and you could still use mirroring for redundancy via gmirror).
Adding more memory would also be a good option.
 

ralphbsz

Daemon

Reaction score: 1,082
Messages: 1,724

I'm running ZFS with about 5 or so TB of disk space, on a 3GB 32-bit machine, with none of these problems. While we always hear that ZFS can use a lot of memory, I wouldn't jump to the conclusion that this much memory is strictly necessary.

And I don't think the arguments based on copy-on-write and dedup apply here: This system is currently not doing much file creation IO, so little copying or looking up duplicates is required; scrubbing is sort of a "read-only" workload.

The interesting thing here is the performance data form gstat for the scrub of the large pool. Each of the two disks drives is doing roughly 200 IO per second. That makes sense, that's roughly the seek rate for a modern disk drive. Yet they are only doing about 1500 to 1900 kB per second. That means each IO is only on average 7.5 to 9.5 KB in size. That's amazingly small. Contrast that with the gstat output from the scrub of the small pool: Again roughly 200 IOps (actually a few more than on the large pool, which can be explained by short-stroking), but the IO bandwidth is about 100 MB/s, meaning an average IO size of about 500kB. That makes more sense.

Can anyone come up with an idea of why scrubbing the large pool is being done with such small IOs? I don't know how ZFS scrub works (whether it proceeds at the block or file level, and how it orders operations). One interesting bit of input would be to look at the file size distribution on the large pool. Perhaps it has a very large number of tiny files, which might explain the abundance of seeks and the tiny IO size.
 
OP
OP
V

vemp

New Member

Reaction score: 2
Messages: 6

Hey everyone, I think I found a possible cause starting from here and here- it seems that a new sequential scrub algorithm was released in recent versions of ZFS.
Quoting from second link,
The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance.
so... what if the in-memory queue of I/Os, in some cases, does not find enough RAM available?

Short answer: starting from the situation described in yesterday's post and noted by ralphbsz in his last reply, I stopped the slooow scrub on the larger pool once again, disabled the new scrubbing algorithm, relaunched the scrub and guess what? Same numbers as the smaller pool.
Then I tried with the new algorithm but disabling scrub prefetch, still high speed.
The problem reappeared as soon as I reenabled both scrub prefetch and the new algorithm.

So, it seems that
Code:
    vfs.zfs.no_scrub_prefetch               1
    vfs.zfs.no_scrub_io                     0
will do the trick.

Of course I will keep the thread updated in the next few weeks while I keep an eye on any possible misbehavings.
For now, thanks everyone for the answers and tips, and if you have any further insight to share you're welcome.
 

ralphbsz

Daemon

Reaction score: 1,082
Messages: 1,724

Wow ... something actually works and makes sense. I hope some ZFS developers are reading this thread: they should really make the two settings you describe into the default.

By the way, after ~25 years of working on storage, if you had to summarize all my knowledge into one sentence, it would be: Large and sequential IO good, small and random IO bad.
 
Top