ZFS ZFS disk activity every 2 seconds on an idle system

vext01 · Saturday at 8:55 AM

Hi,

Recently I've noticed that my file server makes about half a second of spinning disk access noise about every 2 seconds. Crunch...1....2....crunch....1....2.... etc.

I inspected top and iotop, turned on/off services etc. trying to figure out what is causing this. Eventually figured out that a system process called `zfskern` is the cause and it does this on an idle system all day and night long it seems.

The system has two mirror zpools. Both have been scrubbed recently, but neither are currently scrubbing. There are no errors on the zpools.

Are these accesses normal and just what zfs routinely does? I don't recall hearing these noises until recently, but who knows. I did upgrade the zpools recently.

Thanks

Eric A. Borisch · Saturday at 5:52 PM

That’s odd. The typical zfs_txg_timeout setting is 5s.

Do you have a database or some other task that is calling sync() of some kind?

mer · Saturday at 6:47 PM

Eric A. Borisch said:
That’s odd. The typical zfs_txg_timeout setting is 5s.

Do you have a database or some other task that is calling sync() of some kind?

And that really should only apply to writes ("txg") , so if nothing is writing (which a sync may cause) should not affect anything.
But I have seen this pattern on occasion, I think related to running of a periodic job. So once a week/month I hear/see it for a while but it eventually stops.

Eric A. Borisch · Saturday at 8:52 PM

Does top in i/o mode show anything?

vext01 · Sunday at 10:39 PM

Do you have a database or some other task that is calling sync() of some kind?

Not that I'm aware of. Like I said, I turned off as much as I could to see if I could narrow it down to one specific process, but no.

Using `top -S -m io -o total` I see `zfskern` jump to the top of the process list every couple of seconds, but everything else sits idle.

Code:

last pid: 58797;  load averages:  0.01,  0.05,  0.06                                                         up 5+23:33:31  22:24:05
55 processes:  2 running, 51 sleeping, 2 waiting
CPU:  0.0% user,  0.0% nice,  0.6% system,  0.0% interrupt, 99.4% idle
Mem: 1184M Active, 3397M Inact, 7822M Wired, 88K Buf, 3396M Free
ARC: 5739M Total, 2363M MFU, 2816M MRU, 400K Anon, 60M Header, 494M Other
     4692M Compressed, 6640M Uncompressed, 1.42:1 Ratio
Swap: 2048M Total, 2048M Free

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
    6 root          87      0      1     86      0     87 100.00% zfskern
40704 root           0      0      0      0      0      0   0.00% nfsd
    1 root           0      0      0      0      0      0   0.00% init
    2 root         395      0      0      0      0      0   0.00% clock
    3 root           0      0      0      0      0      0   0.00% crypto
...

Running `sync` manually doesn't seem to interfere with the cycle.

This is interesting:

Code:

$ zpool iostat -v 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0     93      0  1.52M
  mirror-0  2.72T   922G      0     93      0  1.52M
    ada2        -      -      0     46      0   780K
    ada3        -      -      0     46      0   780K
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0    100      0  1.37M
  mirror-0  2.72T   922G      0    100      0  1.37M
    ada2        -      -      0     49      0   703K
    ada3        -      -      0     51      0   703K
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
sea         2.72T   922G      0      0      0      0
  mirror-0  2.72T   922G      0      0      0      0
    ada2        -      -      0      0      0      0
    ada3        -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
zroot       57.5G   172G      0      0      0      0
  mirror-0  57.5G   172G      0      0      0      0
    ada0p4      -      -      0      0      0      0
    ada1p4      -      -      0      0      0      0
----------  -----  -----  -----  -----  -----  -----

The `sea` zpool is being written to every few seconds. And it does seem to be more like every 4 or 5 seconds, not 2.

Any other tools/commands I could use to get insight?

So once a week/month I hear/see it for a while but it eventually stops.

This has been going on for about a week now. Starting to wonder if it was one of the things `zpool upgrade` brought in...

atax1a · Sunday at 11:00 PM

in your shoes we'd try adapting some of the dtrace scripts from https://axcient.com/blog/using-dtrace-to-find-block-sizes-of-zfs-nfs-and-iscsi/ to figure out what's being written to.

vext01 · Sunday at 11:12 PM

I've tried the two suggestions here: https://www.reddit.com/r/zfs/comments/fpfezj/comment/flkmbz4/

Namely:
- Seeing which datasets are written to over time with `zfs list -o name,written`
- Taking snapshots, waiting a while and using `zfs diff`.

Neither show up any filesystem-level-visible changes, so I can only assume this is zfs doing something internally, or some process is doing something idempotent, like setting some file metadata to values that they already are?

Changing `sysctl vfs.zfs.txg.timeout` to 10 certainly makes the disk crunches less frequent.

vext01 · Sunday at 11:33 PM

Last insight before I go to bed.

Using `top -S -m io -o write` to sort the process list by the number of writes I notice that the column is zero all the way down until `zfskern` kicks in and then that row is the only entry with a non-zero write count.

I can only conclude from all of these tests that no userspace process is writing to the disk and this is just zfs writing some metadata or something.

The disk sound is annoying enough that I want to bump `sysctl vfs.zfs.txg.timeout` to a larger number

atax1a · Sunday at 11:35 PM

you could maybe use lsof to see what files are open for write on that mount point, too

Eric A. Borisch · 2025-03-24T00:33:44+0000

vext01 said:
I've tried the two suggestions here: https://www.reddit.com/r/zfs/comments/fpfezj/comment/flkmbz4/

Namely:
- Seeing which datasets are written to over time with `zfs list -o name,written`
- Taking snapshots, waiting a while and using `zfs diff`.

Neither show up any filesystem-level-visible changes, so I can only assume this is zfs doing something internally, or some process is doing something idempotent, like setting some file metadata to values that they already are?

Changing `sysctl vfs.zfs.txg.timeout` to 10 certainly makes the disk crunches less frequent.

So zfs list -o name,written is showing identically zero say sixty seconds (or more) after taking a recursive snapshot of the whole pool?

vext01 · 2025-03-24T09:39:08+0000

Eric A. Borisch said:
So zfs list -o name,written is showing identically zero say sixty seconds (or more) after taking a recursive snapshot of the whole pool?

- `zfs list -o name,written` is showing no writes to those datasets after a couple of disk crunches.
- taking a snapshot and diffing the live datasets to the snapshots after a few disk crunches shows no filesystem differences.

sko · 2025-03-24T10:13:13+0000

vfs.zfs.txg.timeout only applies if nothing else triggers commiting a txg to disk - i.e. the various zfs.zfs.dirty_data_* thresholds. Especially vfs.zfs.dirty_data_max_percent, which defaults to 10, which translates to 10% of memory and might be the culprit if this system has very little memory.

Also: does any of the pools have a cache or L2ARC device? Those will also cause frequent commits to disk.

vext01 · 2025-03-24T10:14:50+0000

The `top` output above shows some info about the system memory and ARC setup. Does anything strike you as fishy there? Looks like we have plenty of memory.

sko · 2025-03-24T11:00:48+0000

vext01 said:
The `top` output above shows some info about the system memory and ARC setup. Does anything strike you as fishy there? Looks like we have plenty of memory.

So this system only has 16GB of RAM? For a file server this isn't really "plenty" of memory. Check the ARC statistics with filesystems/zfs-stats (output from -A and -E)

Other points to check:
- how are nfs shares mounted on the clients? Usually it defaults to 'sync'
- do any datasets have their 'sync' property set to 'always'? Also (for testing!) you can try to set this value to 'disabled' for datasets which might receive sync writes (e.g. nfs shares). If this helps, check the mount options on all clients.
- on which datasets is atime enabled? IIRC some metadata is always synced to disk ASAP, bypassing the txg/dirty_data thresholds, not sure about atime though. Usually one can safely set atime=off for all datasets used for a fileserver (and the whole zroot, ecxept for /var/mail).
- does the "Anon" value in the ARC line of the top output change rather quickly in size? This value represents metadata held in the cache. By default metadata can use up to 1/4 of the ARC max size - for fileservers with lots of small files this can be a bottleneck, especially if the ARC (and metadata cache) is constantly being purged due to memory pressure. (zfs-stats -E can identify such problems)

T-Aoki · 2025-03-24T11:35:43+0000

Maybe related but a bit off-topic, this thread made me recall early "NTFS sound (noise)". As WinNT is / was a closed-source OS and I'm not at the position that can read its source codes, but what I've heared about it was because WinNT (at least from first introduction of NTFS and until, maybe, 4.x) attempts to flush write caches, re-calibrate and retract heads of all drives used for NTFS partitions periodically (maybe, IIRC, 2 secs or so).

vext01 · 2025-03-24T23:07:13+0000

So this system only has 16GB of RAM? For a file server this isn't really "plenty" of memory.

Yeah, I thought that'd be enough for a couple of VMs and a NFS server with a couple of clients... This isn't a system with an enterprise workload. It's just my personal file server in my home.

Check the ARC statistics with filesystems/zfs-stats (output from -A and -E)

Code:

# zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Mar 24 22:41:57 2025
------------------------------------------------------------------------

ARC Summary: (HEALTHY)
        Memory Throttle Count:                  0

ARC Misc:
        Deleted:                                20
        Mutex Misses:                           0
        Evict Skips:                            93

ARC Size:                               46.30%  2.78    GiB
        Target Size: (Adaptive)         48.45%  2.91    GiB
        Min Size (Hard Limit):          8.27%   508.04  MiB
        Max Size (High Water):          12:1    6.00    GiB
        Compressed Data Size:                   2.00    GiB
        Decompressed Data Size:                 4.09    GiB
        Compression Factor:                     2.05

ARC Size Breakdown:
        Recently Used Cache Size:       62.38%  1.81    GiB
        Frequently Used Cache Size:     37.62%  1.09    GiB

ARC Hash Breakdown:
        Elements Max:                           171.32  k
        Elements Current:               100.00% 171.32  k
        Collisions:                             27.63   k
        Chain Max:                              4
        Chains:                                 6.76    k

------------------------------------------------------------------------

# zfs-stats -E

------------------------------------------------------------------------
ZFS Subsystem Report                            Mon Mar 24 22:42:04 2025
------------------------------------------------------------------------

ARC Efficiency:                                 7.01    m
        Cache Hit Ratio:                97.61%  6.85    m
        Cache Miss Ratio:               2.39%   167.49  k
        Actual Hit Ratio:               97.61%  6.85    m

        Data Demand Efficiency:         87.17%  855.42  k
        Data Prefetch Efficiency:       4.94%   16.07   k

        CACHE HITS BY CACHE LIST:
          Most Recently Used:           26.88%  1.84    m
          Most Frequently Used:         73.12%  5.01    m
          Most Recently Used Ghost:     0.00%   0
          Most Frequently Used Ghost:   0.00%   0

        CACHE HITS BY DATA TYPE:
          Demand Data:                  10.89%  745.64  k
          Prefetch Data:                0.01%   793
          Demand Metadata:              87.52%  5.99    m
          Prefetch Metadata:            1.58%   108.16  k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  65.54%  109.78  k
          Prefetch Data:                9.12%   15.28   k
          Demand Metadata:              12.33%  20.65   k
          Prefetch Metadata:            13.01%  21.79   k

------------------------------------------------------------------------

- how are nfs shares mounted on the clients? Usually it defaults to 'sync'
- do any datasets have their 'sync' property set to 'always'? Also (for testing!) you can try to set this value to 'disabled' for datasets which might receive sync writes (e.g. nfs shares). If this helps, check the mount options on all clients.

I don't think it's to do with NFS. I've just turned off the systems that are clients are the disk accesses persist every 5 seconds.

on which datasets is atime enabled

It looks like atime is enabled for all datasets on that zpool. But doesn't this being a problem suppose that something (in userspace) is actually reading the filesystem? `zpool iostat -v 1` doesn't show any read operations. Only write operations.

does the "Anon" value in the ARC line of the top output change rather quickly in size?

Anon is stable at `1716K` at the moment.

What I don't understand is that this system has two zpools, but only one has this strange disk access pattern.

vext01 · 2025-03-24T23:10:57+0000

Also `ztop` just shows zeroes across the board.

Eric A. Borisch · 2025-03-24T23:28:15+0000

Sorry to be pedantic; when you this:

zfs snapshot -r sea@testing; sleep 60; zfs list -ro name,written sea

All of the “written” results show “0”?

Be sure to destroy the snapshot when done.

Of note, NFS writes, atime updates, etc. should all be visible in this written metric.

vext01 · 2025-03-24T23:33:14+0000

Hi Eric, I saw your message, but was writing one at the same time. I'll respond in a moment.

Using the vfssnoop dtrace script, I see a lot of lines like:

Code:

# ./vfssnoop.d
TIMESTAMP           UID    PID PROCESS          CALL             SIZE PATH/FILE
...
85474152145096        0     18 syncer           vop_fsync           - /sea/media/music/<unknown>
...
85803975785222        0     18 syncer           vop_fsync           - /var/log/<unknown>
...
85798975750803        0     18 syncer           vop_fsync           - /sea/unsorted/<unknown>
...

I see about a line a second on varying paths, both on the problem pool and elsewhere. Almost looks like something is walking over the filesystems calling sync. Odd.

Still not sure what exactly is requiring flushing to the disk though.

vext01 · 2025-03-24T23:38:54+0000

zfs snapshot -r sea@testing; sleep 60; zfs list -ro name,written sea

OK, eliding some of the output for privacy, but:

Code:

# zfs snapshot -r sea@testing; sleep 60; zfs list -ro name,written sea
NAME                      WRITTEN
sea                             0
sea/media                       0
sea/media/music              136K
...

I think that's a fluke tbh, because repeating `zfs list -ro name,written sea` after another minute, and the write count for the music dataset remains 136K.

vext01 · 2025-03-24T23:40:55+0000

Can I check that it's safe to remove those snapshots we just made with `zfs destroy -R sea@test`?

vext01 · 2025-03-24T23:54:10+0000

Last post before bed.

Using `zfstxgsyncbytes.d` I get one line every 5 seconds:

Code:

# ./zfstxgsyncbytes.d sea
dtrace: script './zfstxgsyncbytes.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1  80826                 none:txg-syncing    0MB of    0MB used
  0  80826                 none:txg-syncing    0MB of    0MB used
  2  80826                 none:txg-syncing    0MB of    0MB used
  4  80826                 none:txg-syncing    0MB of    0MB used
  0  80826                 none:txg-syncing    0MB of    0MB used
  0  80826                 none:txg-syncing    0MB of    0MB used
  1  80826                 none:txg-syncing    0MB of    0MB used
  0  80826                 none:txg-syncing    0MB of    0MB used

Not sure if "0MB of 0MB used" is a clue.

Eric A. Borisch · 2025-03-25T03:00:45+0000

vext01 said:
OK, eliding some of the output for privacy, but:

Code:

# zfs snapshot -r sea@testing; sleep 60; zfs list -ro name,written sea NAME WRITTEN sea 0 sea/media 0 sea/media/music 136K ...

I think that's a fluke tbh, because repeating `zfs list -ro name,written sea` after another minute, and the write count for the music dataset remains 136K.

If it is always updating the same 136K for some reason, it won’t grow.

Eric A. Borisch · 2025-03-25T04:30:39+0000

What does
zfs diff sea/media/music@test sea/media/music
show at this point? (With non-zero written?)

6502 · 2025-03-25T07:12:22+0000

Are you sure this is from ZFS? Reboot and stop on startup menu, then wait and see whether the same sounds exist. I have Windows laptop with similar effects and think it is HDD calibration.