Overnight periodic report? triggering in bottlenecked processes on jail mounts

Hi, posting here because this started after the recent 15.0 upgrade, but I'm not convinced it is due to the upgrade.

The problem: at 3am each day something is triggering a bottlenecking of the processes related to mounts for the jails. We have not had this problem before, and have been running this jail setup for about a year now.


Description of our environment: We are currently running Host -> vnet Jail -> 4 child jails on a smaller machine: 8-core, 16gb of memory, with one jail being a reverse proxy, another being a production web app, a third is a development version of the web app, and the 4th is a WIP something else.

ZFS datasets have been created as needed and our approach for mounting datasets into the jail were already following the recommended approach as described in the recent feature for mounting datasets.

When the vnet Jail is started, Host's /usr/local/jails is mounted at vnet Jail's /usr/local/jails. We mount a templated base for each child jail, and then the respective userland, etc.


The symptom: when running htop from the Host, sorted by CPU high to low, the top two processes show the filesystem mounts. These don't normally appear on the list when things are "working". Although the system responds somewhat to some commands, jails cannot be shutdown with normal commands from the vnet jail. A reboot has been required which returns the loads to normal. The processes - whatever it is - is pegged at the equivalent of one processor core. I haven't found how to trigger the problem, and am not sure how to diagnose what process may be causing it.


Attempts to solve: I thought maybe there was a conflict between periodic reports being run since high CPU is always starting at 3am, so I disabled the ones in the child jails related to filesystem changes, and for a night that seemed to fix it. Interestingly, there are no errors I can find in any logs, and the jail processes are running fine when checked, but the resources are being hogged by the Host or something else. When running htop from a child jail, it's almost all red across all cores.

But yesterday it started showing high CPU again during the day - however, the processes showing the filesystem mounts were not on the list. It seemed to resolve when I experimented by shutting down the development jail, so I figured it might be related to a process on it. I choose that jail because our production jail can call to the development jail but the 4 jails are otherwise not able to talk to each other. However, last night something got triggered and the CPU was high with the filesystem mounts, and this was with the development jail off.


I'm looking for some ideas of what to try or test, thanks.

Host jail.conf which starts the vnet jail:
Code:
mount.devfs;                               # Mount devfs inside the jail
exec.clean;                                # prevent use of host environment in commands
exec.start = "/bin/sh /etc/rc";            # Start command
exec.stop = "/bin/sh /etc/rc.shutdown";    # Stop command
exec.consolelog = "/var/log/jail_console_${name}.log";
host.hostname = "redacted";

jail_parent {
    path = "/usr/local/jails/containers/${name}";
    vnet;
    vnet.interface="e0b_jail_parent";
    devfs_ruleset=70;
    exec.created+="zfs jail $name zroot/poudriere";
    exec.created+="zfs jail $name zroot/jails/children";
    exec.created+="zfs jail $name zroot/jails/root_templates";
    exec.created+="zfs jail $name zroot/jails/databases";
    exec.created+="/usr/local/jails/scripts/zfs_set_jailed_on";
    exec.start+="service jail onestart nginx";
    exec.start+="service jail onestart beta";
    exec.start+="service jail onestart marketing";
    exec.start+="service jail onestart develop";
    exec.start+="wg-quick up wg_client";
    allow.raw_sockets;
    allow.mlock;
    allow.mount;
    allow.mount.devfs;
    allow.mount.fdescfs;
    allow.mount.fusefs;
    allow.mount.linprocfs;
    allow.mount.linsysfs;
    allow.mount.nullfs;
    allow.mount.procfs;
    allow.mount.tmpfs;
    allow.mount.zfs;
    enforce_statfs = 1;
    children.max=20;
    exec.prestop+="/sbin/ifconfig e0b_jail_parent -vnet $name";
    exec.stop+="service jail onestop develop";
    exec.stop+="service jail onestop marketing";
    exec.stop+="service jail onestop beta";
    exec.stop+="service jail onestop nginx";
    exec.stop+="wg-quick down wg_client";
    exec.poststop+="zfs unjail $name zroot/poudriere";
    exec.poststop+="zfs unjail $name zroot/jails/children";
    exec.poststop+="zfs unjail $name zroot/jails/root_templates";
    exec.poststop+="zfs unjail $name zroot/jails/databases";
    exec.poststop+="/usr/local/jails/scripts/zfs_set_jailed_off";
}

Vnet jail.conf which starts the child jails:
Code:
mount.devfs;                               # Mount devfs inside the jail
devfs_ruleset = "0";
ip6="disable";
exec.clean;
exec.start = "/bin/sh /etc/rc";            # Start command
exec.stop = "/bin/sh /etc/rc.shutdown";    # Stop command
exec.consolelog = "/var/log/jail_console_${name}.log";
path = "/usr/local/jails/$name";

nginx {
    # hostname
    host.hostname = "redacted";

    # network
    ip4.addr = "192.168.1.1";

    # permissions
    enforce_statfs = 1;
    allow.raw_sockets;
    allow.mount;
    allow.mount.fusefs;

    mount.fstab = "/usr/local/jails/${name}.fstab";
    exec.release += "/usr/local/jails/unmount_nginx_fstab";
}

marketing {
    # hostname
    host.hostname = "redacted";

    # network
    ip4.addr = "192.168.2.1";

    # permissions
    enforce_statfs = 1;
    allow.raw_sockets;
    allow.mlock;
    allow.mount;
    allow.mount.devfs;
    allow.mount.fdescfs;
    allow.mount.fusefs;
    allow.mount.linprocfs;
    allow.mount.linsysfs;
    allow.mount.nullfs;
    allow.mount.procfs;
    allow.mount.tmpfs;

    mount.fstab = "/usr/local/jails/${name}.fstab";
    exec.start += "/var/www/sourcefiles/mountsourcefiles";
    exec.start += "/var/www/sourcefiles/start_linux_service";
    exec.stop += "/var/www/sourcefiles/stop_linux_service";
    exec.stop += "/var/www/sourcefiles/unmountsourcefiles";
    exec.release += "/usr/local/jails/unmount_marketing_fstab";
}

beta {
    # hostname
    host.hostname = "redacted";

    # network
    ip4.addr = "192.168.3.1";

    # permissions
    enforce_statfs = 1;
    allow.raw_sockets;
    allow.mlock;
    allow.mount;
    allow.mount.fusefs;
    allow.mount.nullfs;

    mount.fstab = "/usr/local/jails/${name}.fstab";
    exec.prestart += "sudo -u rsync_user /home/rsync_user/mount_rsync_net_beta_remote_storage";
    exec.start += "/var/www/sourcefiles/mountsourcefiles";
    exec.prestop += "umount -f /usr/local/jails/beta/s/usr/local/mnt/rsync_remote_storage";
    exec.stop += "/var/www/sourcefiles/unmountsourcefiles";
    exec.release += "/usr/local/jails/unmount_beta_fstab";
}

develop {
    # hostname
    host.hostname = "redacted";

    # network
    ip4.addr = "192.168.4.1";

    # permissions
    enforce_statfs = 1;
    allow.raw_sockets;
    allow.mlock;
    allow.mount;
    allow.mount.fusefs;
    allow.mount.nullfs;

    mount.fstab = "/usr/local/jails/${name}.fstab";
    exec.stop += "/var/www/sourcefiles/mountsourcefiles";
    exec.stop += "/var/www/sourcefiles/unmountsourcefiles";
    exec.release += "/usr/local/jails/unmount_develop_fstab";
}
 
Spread these a bit on each system/jail.

/etc/crontab
Code:
# Perform daily/weekly/monthly maintenance.
1       3       *       *       *       root    periodic daily
15      4       *       *       6       root    periodic weekly
30      5       1       *       *       root    periodic monthly
So they don't fire all at exactly the same time.
 
Ok, I think we are on the right track. I could actually see the processes for the periodic cron jobs, but the system was pretty loaded up still. I think I may have cruft and/or incorrect job disabling:

The system was initially version 13 (way back). I have two questions.

1. Regarding older items like these (none are custom) - is it best practice to remove any that do not also appear in /etc/periodic/* ?
Code:
# ls -R /usr/local/etc/periodic
daily    security weekly

/usr/local/etc/periodic/daily:
411.pkg-backup         490.status-pkg-changes

/usr/local/etc/periodic/security:
405.pkg-base-audit 410.pkg-audit      460.pkg-checksum

/usr/local/etc/periodic/weekly:
400.status-pkg

2. periodic.conf() says periodic.conf.local can be used in /etc - can it also be used in /usr/local/etc ? Or does it need to be periodic.conf? I had tried to disable the 110.neggrpperm with security_status_neggrpperm_enable="NO" in /etc/periodic.conf.local for each of the 4 child jails but I saw 4 instances running and it makes me thing it is not working. Maybe it only gets disabled with security_status_neggrpperm_period="NO"?

Or, maybe the host and the vnet jail (parent jail of the 4 children) were each running neggrpperm and the processes showed a command process + the script process. I don't know.

Thanks I appreciate the help a lot.
 
When you disabled security_status_neggrpperm_enable in /etc/periodic.conf.local, it sounds like you concluded /etc/periodic.conf.local didn't work as-advertised, because you still saw some nightly find processes.

I believe the confusion is due to you not disabling all nightly tasks that do a find /

To be clear, I see 2 periodic daily security tasks and 1 periodic weekly task that each do a find / (at least in 14.3).

You could put the following lines in /etc/periodic.conf on your main system and in each jail (where/if cron is enabled) to disable all periodic find traversals:

Code:
security_status_chksetuid_enable="NO"
security_status_neggrpperm_enable="NO"
weekly_locate_enable="NO"

If you reconfigure this, all periodic filesystem traversals should cease, but then the nightly security jobs won't scan-for and report on changes in the list of setuid programs or any files with negative group permissions, and the database for the locate(1) command won't be updated, so please make sure you understand the trade-offs if you decide to disable these security/weekly jobs.

IMHO, the recommendation by SirDice to "stagger" the time of the nightly cron jobs for each jail is the easiest/standard way most sysadmins reduce parallel-contention of filesystems.

You might find it an acceptable trade-off if you only disable the above periodic tasks in your jail(s), because the parent system's periodic scans should traverse into jails.
 
You might find it an acceptable trade-off if you only disable the above periodic tasks in your jail(s), because the parent system's periodic scans should traverse into jails.

I agree - I think the conflict between the system host and the vnet jail (parent jail) may be the issue at this point. No jobs for the vnet jail have been disabled, so I will attempt that next.
 
After another night...

In crontab settings, the host periodic schedule runs first. It appears (Surprisingly) that all of the jail (vnet parent and child jails) reports are done before the host periodic is done with 100.chksetuid.

It had been 3.5 hours and so I killed the host 100.chksetuid and as expected the host 110.neggrpperm began. It seems there is something causing 100.chksetuid on the host system to be blocked from starting (or finishing?). I got an email report from it and there was no indication of termination.

I'm going to have to manually run it to troubleshoot some more. I am still seeing no errors in logs.
 
After another night...
[...]
I'm going to have to manually run it to troubleshoot some more. I am still seeing no errors in logs.

Wow, 3.5 hours does suggest your find is probably hanging or going really slow on some sub-portion of your hierarchy, at least I don't recall ever seeing a full local filesystem find / take that long ever on any of my systems, even on a huge RAID array.

I recommend you run it in the foreground as root, via:

/etc/periodic/security/110.chksetuid

Then periodically typing ^T (CTRL-T) while that job is in the foreground -- find will report the current path it's on ("Scanning: /path"), and it also says how many inodes have been scanned so-far ("Scanned: NNNN"), so you should be able to tell if/when it "stalls".

It makes sense that the nightly security jobs would take longer, because they will always scan the entire system, whereas each jail only scans the root of the jail downward.

FYI, I just ran 110.chksetuid manually on one of my 14.3 systems: it took approx 3 minutes to scan the complete hierarchy with about 13 million inodes in-use (UFS2 on SSD).
 
It definitely is stalling out - but it's not clear on what:
Code:
load: 1.81  cmd: find 83254 [running] 123.72r 0.08u 123.11s 99% 5300k
Scanning: /usr/share/man/man9/ALQ.9.gz
Scanned: 75025

load: 1.83  cmd: find 83254 [running] 129.01r 0.08u 128.39s 99% 5300k
Scanning: /usr/share/openssl/man/man3/ACCESS_DESCRIPTION_free.3.gz
Scanned: 78029

load: 1.85  cmd: find 83254 [running] 138.50r 0.10u 137.82s 99% 5528k
Scanning: /usr/tests/lib/libc/sys/Kyuafile
Scanned: 87315

load: 1.85  cmd: find 83254 [running] 141.94r 0.10u 141.21s 98% 5528k
Scanning: /usr/tests/sys/cddl/zfs/tests/clean_mirror/Kyuafile
Scanned: 88515

load: 1.87  cmd: find 83254 [running] 145.33r 0.10u 144.59s 99% 5528k
Scanning: /usr/tests/sys/netinet/Kyuafile
Scanned: 90229

load: 1.88  cmd: find 83254 [running] 148.82r 0.10u 148.07s 99% 5528k
Scanning: /usr/tests/usr.bin/hexdump/Kyuafile
Scanned: 92006

load: 1.88  cmd: find 83254 [running] 152.13r 0.10u 151.36s 99% 5528k
Scanning: /var/cache/pkg/atop-2.0.2.b3_4.pkg
Scanned: 93397

load: 1.89  cmd: find 83254 [running] 156.02r 0.10u 155.24s 99% 5528k
load: 1.94  cmd: find 83254 [running] 191.41r 0.10u 190.48s 99% 5528k
load: 2.03  cmd: find 83254 [running] 253.77r 0.10u 252.58s 99% 9820k

Each time I run it the last file is different, but it is handing within the first 5 minutes.

Here's another run - interesting that it ends on a rollback.
Code:
load: 1.93  cmd: find 83321 [running] 144.77r 0.11u 144.05s 99% 5496k
Scanning: /usr/tests/sys/cddl/zfs/tests/cli_root/zpool_scrub/Kyuafile
Scanned: 89046

load: 1.93  cmd: find 83321 [running] 146.73r 0.11u 146.00s 99% 5496k
Scanning: /usr/tests/sys/geom/class/uzip/1_endian_big.img.uzip.uue
Scanned: 90065

load: 1.94  cmd: find 83321 [running] 148.72r 0.11u 147.98s 99% 5496k
Scanning: /usr/tests/usr.bin/bmake/archives/fmt_44bsd/Kyuafile
Scanned: 90869

load: 1.94  cmd: find 83321 [running] 149.79r 0.11u 149.04s 99% 5496k
Scanning: /usr/tests/usr.bin/comm/Kyuafile
Scanned: 91308

load: 1.94  cmd: find 83321 [running] 150.95r 0.11u 150.20s 99% 5496k
Scanning: /usr/tests/usr.bin/jot/Kyuafile
Scanned: 92109

load: 1.94  cmd: find 83321 [running] 152.41r 0.11u 151.66s 99% 5496k
Scanning: /usr/tests/usr.bin/xargs/Kyuafile
Scanned: 92849

load: 1.94  cmd: find 83321 [running] 153.70r 0.11u 152.94s 99% 5496k
Scanning: /var/cache/pkg/atop-2.0.2.b3_4.pkg
Scanned: 93397

load: 1.94  cmd: find 83321 [running] 155.85r 0.11u 155.08s 99% 5496k
Scanning: /var/db/etcupdate/current/etc/rc.d/DAEMON
Scanned: 93876

load: 1.94  cmd: find 83321 [running] 157.14r 0.11u 156.37s 99% 5496k
Scanning: /var/db/freebsd-update_bak/1b8a271a986060153e13bcf5acfc5cc6f602c99a73122b00cd2cda3b56c123df-rollback
Scanned: 94544

load: 1.95  cmd: find 83321 [running] 158.62r 0.11u 157.83s 99% 5500k
load: 2.03  cmd: find 83321 [running] 165.71r 0.11u 164.90s 99% 5500k
 
I was able to end it twice on the same file:
Code:
load: 2.12  cmd: find 83380 [running] 171.51r 0.10u 170.70s 99% 5484k
Scanning: /var/db/etcupdate/current/etc/autofs/include_ldap
Scanned: 93675

load: 2.12  cmd: find 83380 [running] 172.24r 0.10u 171.43s 99% 5484k
Scanning: /var/db/freebsd-update_bak/1b8a271a986060153e13bcf5acfc5cc6f602c99a73122b00cd2cda3b56c123df-rollback
Scanned: 94544

load: 2.12  cmd: find 83380 [running] 173.04r 0.10u 172.23s 99% 5484k

I think that I may have created the /var/db/freebsd-update_bak folder when I was attempting to troubleshoot why the system was saying there was a pending update when there wasn't. I am working on removing it and will report back.
 
IMO, it's "very interesting" that find traversed normally for several minutes (ie was reporting "Scanning: [path]"), up to some point, then it appears to go into a mode of just burning system CPU-time, and not able to get to find's normal CTRL-T handler (that prints "Scanning: [path]") after that point. I suspect find has traversed into a different local mount point (ZFS?) that gets stuck in some kind of system CPU-loop. I've never encountered this before myself.

The only suggestion I have at the moment is to run 100.chksetuid again in the foreground, determine the process ID of the find -sx process, then (in another window or virtual console), after it seems to get stuck, run ktrace -dip PID where PID is the find process-ID; wait 20-30 seconds; CTRL-C the ktrace, then run kdump | less to look at the system calls that find was making during the period where ktrace was running -- If there are very few or no system calls reported by ktrace, that would mean that find was completely stuck in the kernel, and might explain why there is no "Scanning: [path]" after some point.

If kdump shows any "NAMI" system calls, the NAMI path should at least tell you which filesystem(s) it was traversing during the 20-30 sec kdump window.

When you run ktrace, you will need to be root, and your current working directory will need to be writable (to hold ktrace.out). Once you're all done looking at kdump output, don't forget to rm ktrace.out

Given the log you have posted, I suspect that the hang has something to do with ZFS, but I don't know enough to determine exactly where all of the burned system-CPU time is going if ktrace does not reveal what find is doing during the apparent hang.
 
Ever since I updated it to 15.0 there's been a noticeable layer of overhead in the filesystem, but I can't pin it to any process yet. Aside from the update, there were no other coincident changes made. Things like changing ownership of a directory - it just takes so much longer now. Jail startups are like 20x as long as they used to be. Even when the load is under 1, there is this residual red in the htop that never used to be there.

I've checked for suspicious things but everything looks fine. It's pretty locked down as it is.

The 100.chksetuid got past the file after it was removed - I am running it again full tomorrow and will report back. Thanks for the tips!
 
Ever since I updated it to 15.0 there's been a noticeable layer of overhead in the filesystem, but I can't pin it to any process yet. Aside from the update, there were no other coincident changes made. Things like changing ownership of a directory - it just takes so much longer now. Jail startups are like 20x as long as they used to be. Even when the load is under 1, there is this residual red in the htop that never used to be there.

I've checked for suspicious things but everything looks fine. It's pretty locked down as it is.

The 100.chksetuid got past the file after it was removed - I am running it again full tomorrow and will report back. Thanks for the tips!

You're welcome.. Thank you for the status update..

If the filesystem for your jails is on a separate disk, and that disk is slow due to hardware reasons (such as marginal sectors), that could explain repeatable slowness when starting your jails. Slowness could also be due to some data inconsistency within your ZFS dataset(s). You're on the right track, just keep your thinking-cap on and "continue digging" until you understand what's actually happening, then I think the answer will become clear.
 
Alright I have good news: things are snappy again.

Three days in a row earlier this week we restarted the server in an attempt to alleviate whatever was happening, but it never helped. For context, setting ownership on a directory with a standard web app framework (and no data files) took 7 minutes! Now it's 3 seconds.

I don't know what fixed it, but here are the changes performed before the most recent reboot which resulted in things being fine (none of them fixed it at the time of their changes):
  • I removed the freedbsd-update_bak directory
  • I found "missing library" errors for node dependencies in two jails - I upgrade node/npm to v24. They were for libcrypto, libssl and libutil: libraries which were in Freebsd 14.3 but not 15.0.
After the reboot I also scrubbed the zpool and there were no errors.

I haven't a clue what the problem was, but things are working fine now. Could it be the system picked up on the naming of freebsd-update_bak and got mixed up?

I will be re-enabling the security reports tonight. We only have 200GB on 400GB.
 
Back
Top