looking for the reason for sudden very high CPU utilization on network/zfs related process

bschwand · Thursday at 11:24 AM

First a bit of background:

This is on a FreeBSD14 relatively older machine (CPU: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz (2593.86-MHz K8-class CPU), with a 4 port Gbit ethernet interface (bge1: <HP Ethernet 1Gb 4-port 331FLR Adapter) and 384GB of RAM.
a few ZFS pools set on 4x 8TB SAS drives in a raidz vdev, hooked to an LSI controller (mps0: <Avago Technologies (LSI) SAS2008> ) not running any hardware raid, direct JBOD mode.

I run the client for the Storj network, within a jail. The jail has one of the interfaces dedicated to it. Gateway mode is not enabled, so not routing between interfaces.

In that same machine, a zabbix server runs in his own jail, on his own dedicated interface, with a mysql instance.
A zabbix agent also runs on the host. That agent runs on a pool with a mirrored of 2 300GB drives, hanging off a HP P420i HBA controller set to JBOD mode (no hardware raid, I describe the process here)

Yes, I know some things are not optimal, I should run the zabbix server on a separate machine, at least right now it uses the zroot pool so it should not interfere with the storagenode process.

How it was running until a week ago:

Very low cpu use, top would consistently show 2-6% WCPU on the Storj (called storagenode) process. cpu ~1-2%, user ~1-2%, mostly idle, even when the interface showed 150Mb/s incoming bandwidth and zfs showed about 15MB/s writes. Perfectly normal. This went on for weeks without problem.
I could copy files locally with rsync and hit 130MB/s write on the (tank) zfs pool and see no performance hit at all.

How it runs now:

spikes of cpu load lasting 3 hours or more, during which the storagenode process goes to 600%-1200% WCPU (according to top), overall load going up a lot as reported in zabbix, overall sluggish behavior (terminal over ssh interactio is slow, like man pages take a couple seconds to show any text)

during these high load times, network bandwidth actually goes down (I attribute that to the fact that the cpu load goes up and makes the storagenode too slow to up/download data in time, therefore it aborts many transfers as seen in its logs), disk activity also goes down accordingly

Of course, it could be badly designed or unoptimized software however what puzzles me and makes me think it might be system related is this output from top while hitting those high cpu load


CPU:  1.2% user,  0.0% nice, 76.9% system,  0.0% interrupt, 21.9% idle
Mem: 1324M Active, 301G Inact, 1395M Laundry, 62G Wired, 225M Buf, 8219M Free
ARC: 36G Total, 13G MFU, 9566M MRU, 37M Anon, 526M Header, 14G Other
     8871M Compressed, 46G Uncompressed, 5.27:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 7390 storagenod  615  68    0  2975M  1138M uwait    9 131.4H 1331.80% storagenode
63355     88       79  20    0  3673M   956M select  11  63.8H 104.52% mysqld
73892 root          1 124    0    51M    35M CPU9     9  42.4H  99.43% find
12195 storagenod   26  99    9  1270M    80M uwait   22  71:46  94.05% storagenode
92235 storagenod   27  98    9  1272M    31M uwait    7 720:18  72.07% storagenode
25940 hostd        35  68    0    54G  3007M uwait    8  34.1H  47.85% hostd
12947 zabbix        1  20    0   132M    17M kqread   7  29:11   1.28% zabbix_server
66512 bruno         1  28    0    14M  3692K CPU0     0   0:01   0.27% top
 7389 root          1  20    0    13M  1656K kqread  22   4:42   0.17% daemon
12993 zabbix        2  20    0   192M    13M kqread  22   3:44   0.04% zabbix_server
12952 zabbix        4  20    0   131M    18M kqread  22   9:04   0.03% zabbix_server
12965 zabbix        1  20    0   128M    20M sbwait  14   2:01   0.02% zabbix_server
11722 zabbix        1   4    0    30M  5460K RUN     22   1:07   0.02% zabbix_agentd
26509 zabbix        1  20    0    27M  6232K nanslp  18   2:18   0.01% zabbix_agentd
11720 zabbix        1  20    0    27M  5472K nanslp  10   1:54   0.01% zabbix_agentd

76.9% system ? so that means whatever is using up all cpu is actually waiting on some system processes ?
here is top in iomode (-m)


last pid: 90993;  load averages: 25.82, 20.82, 18.69                                                                  up 12+07:54:16  17:56:26
195 processes: 2 running, 193 sleeping
CPU:  2.8% user,  0.0% nice, 61.9% system,  0.0% interrupt, 35.3% idle
Mem: 1378M Active, 304G Inact, 1423M Laundry, 59G Wired, 229M Buf, 8304M Free
ARC: 33G Total, 13G MFU, 6385M MRU, 26M Anon, 396M Header, 13G Other
     5425M Compressed, 23G Uncompressed, 4.26:1 Ratio
Swap: 4096M Total, 206M Used, 3890M Free, 5% Inuse

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
 7390 storagenod 11479   1648     76    364      0    440  42.07% storagenode
63355     88    3790138    142      4     98      0    102   9.75% mysqld
73892 root         198    333    103      0      0    103   9.85% find
92235 storagenod   393    175      5      0      0      5   0.48% storagenode
25940 hostd       1255     73      2    243      0    245  23.42% hostd
12952 zabbix        62     12      0      0      0      0   0.00% zabbix_server
12965 zabbix        21      0      0      0      0      0   0.00% zabbix_server
 7389 root         145      0      0    150      0    150  14.34% daemon
88576 bruno          2      1      0      0      0      0   0.00% top
11729 zabbix         6      0      0      0      0      0   0.00% zabbix_agentd
12992 zabbix        16      0      0      0      0      0   0.00% zabbix_server
12993 zabbix        12      0      0      0      0      0   0.00% zabbix_server
12962 zabbix        12      0      0      0      0      0   0.00% zabbix_server
12954 zabbix        31      0      0      0      0      0   0.00% zabbix_server
12944 zabbix        19      0      0      0      0      0   0.00% zabbix_server
12963 zabbix        22      0      0      0      0      0   0.00% zabbix_server
12983 zabbix        12      0      0      0      0      0   0.00% zabbix_server

ioztat output does not show anything abnormal to me


                          operations    throughput      opsize
dataset                   read  write   read  write   read  write
-----------------------  -----  -----  -----  -----  -----  -----
frontpool                    0      0      0      0      0      0
  data                       0      0      0      0      0      0
tank                         0      0      0      0      0      0
  hostd                      0      0      0      0      0      0
    data                     0      0      0  2.67M      0     4M
    lib                    268     28  1.05M  57.7K     4K  2.01K
  storj                      1    105  19.8K  3.89M  11.9K  38.0K
    databases                0      0      0      0      0      0
zroot                        0      0      0      0      0      0
  ROOT
    default                  0      0      0      0      0      0
  appjail                    0      0      0      0      0      0
    components               0      0      0      0      0      0
      amd64                  0      0      0      0      0      0
        14.0-RELEASE         0      0      0      0      0      0
          default            0      0      0      0      0      0
    jails                    0      0      0      0      0      0
      hostd                  0      0      0      0      0      0
        jail                 0      0      0      0      0      0
      storj                  0      0      0      0      0      0
        jail                 0     49      0  14.5K      0    302
      zabbix                 0      0      0      0      0      0
        jail               107     90  1.12M  2.02M  10.6K  22.9K
    logs                     0      0      0      0      0      0
      jails                  0      0      0      0      0      0
        hostd                0      0      0      0      0      0
          console            0      0      0      0      0      0
        storj                0      0      0      0      0      0
          console            0      0      0      0      0      0
          startup-start      0      0      0      0      0      0
          startup-stop       0      0      0      0      0      0
        testjail             0      0      0      0      0      0
          console            0      0      0      0      0      0
        zabbix               0      0      0      0      0      0
          console            0      0      0      0      0      0
    networks                 0      0      0      0      0      0
      p5jails                0      0      0      0      0      0
    releases                 0      0      0      0      0      0
      amd64                  0      0      0      0      0      0
        14.0-RELEASE         0      0      0      0      0      0
          default            0      0      0      0      0      0
            release          0      0      0      0      0      0
  home                       0      0      0      0      0      0
  tmp                        0      0      0      0      0      0
  usr
    ports                    0      0      0      0      0      0
    src                      0      0      0      0      0      0
  var
    audit                    0      0      0      0      0      0
    crash                    0      0      0      0      0      0
    log                      0      0      0      0      0      0
    mail                     0      0      0      0      0      0
    tmp                      0      0      0      0      0      0

So, before I pull out the profiler, instrument the storagenode process to find what it is doing internally, I'd like to make sure it's not something I could change on the system, which to me sounds probable.

Any hints and advice on how to track this down ? What to look for, ideas to test ?

Thanks for any help

Eric A. Borisch · Friday at 2:17 PM

You can try hotkernel or hotuser from sysutils/dtrace-toolkit (in /usr/local/share/dtrace-toolkit) to see if that gives any hints. I also like gstat -pdo to look at raw disk utilization, and you can show system threads in top by pressing S.

cracauer@ · Friday at 2:29 PM

You can also trace the syscalls that `storagenode` is making with `trusss -p`

Overall I am looking at those `find` jobs. Are they present and active when the load is normal?

Eric A. Borisch · Friday at 6:49 PM

cracauer@ said:
Overall I am looking at those `find` jobs. Are they present and active when the load is normal?

Exactly. Find can be great at bogging down IO (and thereby impact many things) if it is trawling a large directory. There are some fairly standard periodic scripts that use it, as I recall.

ps -Ad can be useful to see process trees, if you need to look at “who launched that”.

bschwand · 2024-06-29T13:28:47+0100

thanks for the suggestions, I am still trying to track this down.
I noticed that in "top", the storagenode process (when CPU pegged) always shows a state of "uwait"

I tried truss and it locked up the process and had to restart the process. Then it just went for 12 hours at low cpu as it used to, then slowly climbed back for 3 hours, until it just exploded to high cpu load again.

I found this and was going to use dtruss but it seems to use a probe that does not exist in FreeBSD14 anymore. (probe description syscall::lstat*:return does not match any probes) . thinking of changing this, but dtrace -l | grep lstat shows only compat*.lstat probes ... what would be the equivalent for FBSD 14 ?
BTW, it is too bad the dtrace toolkit is not maintained anymore, there are lots of good tools in there.

Overall I am looking at those `find` jobs. Are they present and active when the load is normal?

not sure, I did not pay attention to it before.
that find command is rather strange, this is what ps -l -p 73892 -ww outputs:


UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TT        TIME COMMAND
  0 73892 73887 15  68  0 52692 32772 zio->io_ D     -  4241:22.69 / /usr/local/appjail /frontpool /var/mail /usr/src /home /mnt/data1 /zroot /
usr/local/appjail/cache/components /usr/local/appjail/releases /usr/local/appjail/networks /usr/local/appjail/jails /var/log/appjail /usr/local
/appjail/networks/p5jails /usr/local/appjail/releases/amd64 /usr/local/appjail/cache/components/amd64 /var/log/appjail/jails /usr/local/appjail
/cache/components/amd64/14.0-RELEASE /usr/local/appjail/jails/storj /var/log/appjail/jails/storj /usr/local/appjail/cache/components/amd64/14.0
-RELEASE/default /var/log/appjail/jails/testjail /usr/local/appjail/releases/amd64/14.0-RELEASE /var/log/appjail/jails/storj/console /var/log/a
ppjail/jails/storj/startup-stop /usr/local/appjail/jails/storj/jail /usr/local/appjail/releases/amd64/14.0-RELEASE/default /var/log/appjail/jai
ls/testjail/console /usr/local/appjail/releases/amd64/14.0-RELEASE/default/release /var/log/appjail/jails/storj/startup-start /tank /tank/storj
 /tank/hostd /usr/local/appjail/jails/hostd /usr/local/appjail/jails/hostd/jail /var/log/appjail/jails/hostd /var/log/appjail/jails/hostd/conso
le /tank/hostd/data /tank/hostd/lib /var/log/appjail/jails/zabbix /var/log/appjail/jails/zabbix/console /usr/local/appjail/jails/zabbix /usr/lo
cal/appjail/jails/zabbix/jail /dev/null (find)

I see in the process tree it is coming from cron, part of the periodic daily security scripts. Looks like those scripts traverse all pools and all files ??
(/bin/sh - /etc/periodic/security/100.chksetuid)

bschwand · 2024-06-29T14:22:07+0100

ok so that find was what triggered the high cpu load..
the daily periodic security check would traverse all filesystems and peg the filesystems. That specific instance had been going for 2 days (for a daily hahaha)
Killing that process reverts cpu usage to normal.

With that said, I do not view this as the root cause. I think the storagenode process has a bug, because it seems to busy wait when starved from IO.
I'll dig a bit more into that and report it.

Thanks everyone for the hints and tips !

VladiBG · 2024-06-29T14:38:43+0100

Check your zpool iostat. Look how many i/o operations per seconds. For Business critical 3.5'' 7200 SAS disk the upper limit is around 140-150iops for 8k per cmd with 4q anything above that and the disk just can't keep up. For the mission critical 2.5'' sas disk the iops limit is around 200-250 for 10k rpm and 300iops for 15k rpm

bschwand · 2024-06-29T16:37:00+0100

ok well no victory lap yet.
although disabling those security scripts had an immediate positive effect, turns out the storagenode process also creates a process which does exactly what the scripts did, traverse all the filesystem to do some cleanup. This can't be disabled. Clearly the root cause should be tackled. At least I am pretty confident to have identified the real problem and it can probably be fixed.

Eric A. Borisch · 2024-06-29T16:39:31+0100

For the periodic scans, at least some of them skip mounts with, for example, noexec or nosuid flags. Consider setting these on large “data” filesystems when appropriate to cut down the load.

bschwand · 2024-06-29T16:40:25+0100

VladiBG said:
Check your zpool iostat. Look how many i/o operations per seconds. For Business critical 3.5'' 7200 SAS disk the upper limit is around 140-150iops for 8k per cmd with 4q anything above that and the disk just can't keep up. For the mission critical 2.5'' sas disk the iops limit is around 200-250 for 10k rpm and 300iops for 15k rpm

they are 3.5" SAS 7200rpm drives


bruno@proliant5:~ $ zpool iostat -lvy
                          capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim  rebuild
pool                    alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   wait
----------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
tank                    20.7T  8.46T    527      0  8.25M      0   18ms      -   18ms      -    3us      -      -      -      -      -      -
  raidz1-0              20.7T  8.46T    792      0  9.29M      0   16ms      -   16ms      -    2us      -      -      -      -      -      -
    gpt/slot0-7SGV590G      -      -    264      0  1.03M      0   12ms      -   12ms      -  768ns      -      -      -      -      -      -
    gpt/slot2-7SG1A8RR      -      -    265      0  1.04M      0   25ms      -   25ms      -    6us      -      -      -      -      -      -
    gpt/slot4-7SG1APWR      -      -    265      0  7.26M      0   12ms      -   12ms      -  384ns      -      -      -      -      -      -
    gpt/slot6-7SG19LRR      -      -      0      0      0      0      -      -      -      -      -      -      -      -      -      -      -
----------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

I checked before that storagenode does a lot of 16KB operations.

bschwand · 2024-06-29T16:44:01+0100

Eric A. Borisch said:
For the periodic scans, at least some of them skip mounts with, for example, noexec or nosuid flags. Consider setting these on large “data” filesystems when appropriate to cut down the load.

sounds good, how are they set on zfs datasets ?
ah I see it's zfs properties

Eric A. Borisch · 2024-06-30T02:52:50+0100

Try gstat -pdo. This will show disk-level stats, including busy %.

Traversing filesystems for cleanup sounds like an occasional task. There may be tunable for how aggressive it is during cleanup; otherwise, if it just tries to go “as fast as possible”, it should be no surprise that it impacts other tasks.

Eric A. Borisch · 2024-06-30T03:04:19+0100

Might this be what you’re running into?