Solved services randomly stopped overnight

This is the 2nd time I've had services die overnight and they all appear to die at the same time via signal 15 or SIGTERM. I am running 2 jails on top of my 'host' system and I've observed:

1. host sshd, syslogd, powerd stopped
2. router [jail] sshd, syslogd, auditd, crowdsec, cron, dhclient stopped
3. workstation [jail] sshd, syslogd, auditd, cron stopped

In addition, the workstation did not explicitly need to be added to my pf tables to ssh or get to the Internet. But, that is how I resolved network connectivity to it.

The only thing I can think of that I have running that would cause this is my nightly update which updates the system in a new Boot Environment and utilizes checkrestart to determine what needs restarted and then email me to reboot into the new BE and restart services. But, it does not restart or stop services and it is only performing updates in a new Boot Environment. Now, on the other hand, the jails are updated in place and then have a snapshot taken of them.

That said, I see a new Boot Environment was created a few days ago, not last night / this morning.

Regarding my workstation jail losing network connectivity without being explicitly added to the table, why was my workstation jail able to SSH to my router jail without being added before? My workstation jail has an epair directly to the router which goes into the bridge with my Local Area Network interface.

Any ideas on what to check for? I also don't see any obvious indications of intrusion, but I'm not an expert. I also don't suspect the system is out of memory or has any hardware issues.

EDIT: I clarified that router and workstaiton are jails.
 
You could try to limit ZFS ARC with vfs.zfs.arc_max in /boot/loader.conf. For example I had to limit ZFS ARC on one of my servers with 16 Gb of RAM like this:
vfs.zfs.arc_max="6G" but the value is your choice regarding storage capacity, RAM size, etc...
 
Any ideas on what to check for? I also don't see any obvious indications of intrusion, but I'm not an expert. I also don't suspect the system is out of memory or has any hardware issues.
Logfiles, and if those don't give you a conclusive answer then increase the logging verbosity a bit. Services like SSH and Syslog all support that.
 
The system in question has 16GB of ram, not the most, but I think it is sufficient for my purposes.

Ok, yeah, I don't see anything other than the services terminated :(. I will see about increasing log verbosity. In that vein, I *should* only need to do that on the host, it would be highly unlikely that a jail is triggering the services to also stop for another jail or host even. And, if it were, if I crank up the logs for the host, I should hopefully see something more there that might shed some light as to where or what initiated the term.
 
This is the 2nd time I've had services die overnight and they all appear to die at the same time via signal 15 or SIGTERM. I am running 2 jails on top of my 'host' system and I've observed:
Are these services running on the host or in the jails?
Have you tried turning off the automatic updating?
 
I updated my original post, both router and workstation are running as jails. I will turn off automatic updating - that is writing explicitly to a different logfile and the best I can tell, it is not the culprit because of the timing and contents of the logs.
 
This is the 2nd time I've had services die overnight and they all appear to die at the same time via signal 15 or SIGTERM.
Screwed up the signals in newsyslog.conf(5) perhaps? Instead of sending SIGHUP (or SIGUSR1) you're sending SIGTERM?

Code:
     signal  This optional field specifies the signal that will be sent to the
             daemon process (or to all processes in a process group, if the U
             flag was specified).  If this field is not present, then a SIGHUP
             signal will be sent.  Signal names must start with “SIG” and be
             the signal name, e.g., SIGUSR1.  Alternatively, signal can be the
             signal number, e.g., 30 for SIGUSR1.
 
Screwed up the signals in newsyslog.conf(5) perhaps? Instead of sending SIGHUP (or SIGUSR1) you're sending SIGTERM?

Code:
     signal  This optional field specifies the signal that will be sent to the
             daemon process (or to all processes in a process group, if the U
             flag was specified).  If this field is not present, then a SIGHUP
             signal will be sent.  Signal names must start with “SIG” and be
             the signal name, e.g., SIGUSR1.  Alternatively, signal can be the
             signal number, e.g., 30 for SIGUSR1.
Good call, but I'm not specfiying any signals here, so it should be the default as above. But it is sporadic., Will keep digging.
 
Yeah, I got triggered by the "appear to die at the same time", and I've done something similar in the past (sending the wrong signal; killing the process instead of restarting) 😁

But if it's more sporadic, it might be OOM (Out-Of-Memory) that's killing random processes. So keep an eye on memory usage (including swap) too.
 
One another thing: to check that periodic (security checks, pkg checksums) and updating goes not in same time.
They're executed sequentially, I'm still checking and will turn up the logs to see if I can figure it out. It is strange because it isn't every night. Hmm, interesting, my zsh history is messed up on my workstation, but the others are intact. It appears this last happened on 04/26.

Yeah, I got triggered by the "appear to die at the same time", and I've done something similar in the past (sending the wrong signal; killing the process instead of restarting) 😁

But if it's more sporadic, it might be OOM (Out-Of-Memory) that's killing random processes. So keep an eye on memory usage (including swap) too.
I'm not using swap, so my system would be more memory constrained possibly.
 
Oh, the other thing I forgot about is my nightly session killer. The idea is that I want to free up resources as well as potentially break any persistent connections:
pkill Xorg
pkill ssh-agent
who | awk {'print$1'} | xargs -L 1 -I _USER_ pkill -u _USER_
who | awk {'print$2'} | xargs -L 1 -I _TTY_ pkill -t _TTY_

I had been using this script for years and haven't had an issue to my knowledge, but that would also kill anything running as the root user.

EDIT: I removed an option in the cmdline that was unset.

EDIT: This must be it because by default, it sends a TERM:
-signal A non-negative decimal number or symbolic signal name
specifying the signal to be sent instead of the default
TERM. This option is valid only when given as the first
argument to pkill.

Again, to reiterate, some incarnation of this had been running for years, I'll go through my history to see what it was earlier.

EDIT: around November of last year, I generalized the script to target all system processes instead of being confined to a jail, the idea being that I'd just run this on the host and the containers would not run this.

 
Oh, the other thing I forgot about is my nightly session killer. The idea is that I want to free up resources as well as potentially break any persistent connections:


I had been using this script for years and haven't had an issue to my knowledge, but that would also kill anything running as the root user.

EDIT: I removed an option in the cmdline that was unset.

EDIT: This must be it because by default, it sends a TERM:


Again, to reiterate, some incarnation of this had been running for years, I'll go through my history to see what it was earlier.

EDIT: around November of last year, I generalized the script to target all system processes instead of being confined to a jail, the idea being that I'd just run this on the host and the containers would not run this.

You should have mentioned this earlier.

To test whether this script is responsible disable it or have it send a different signal.
 
I am fairly confident that is it, I will do some more reading on the signals I can send that would make it more apparent.

I ran the command directly and that was it, it killed everything that I found. I think I need to revisit the original question and design and consider what I'm doing. Even if I revert back to what I thought was working for years, targeting processes within a jail, that too should fail.

I will disable this script and instead pursue updating login.conf as that probably makes the most sense. Rather than write a script, I should try to leverage the tools that are already there and more thoroughly tested and reliable.
 
Back
Top