Shell rc.d script returns status running while the pid is no longer alive

I've based the rc.d script on the odoo one. However I'm installing in a virtualenv.
From time to time the process crashes. However when doing /usr/local/etc/rc.d/odoo status it shows as running.
When I check the pid that is in /var/run there's no process with that pid.

I have set procname to the location of the python3.8 in the virtualenv now to also test that but I don't think that's the intention of this. If I understood correctly this is to prevent that it is detected as running while it's just another binary running than the intended one.

How does this pid get checked in order to verify that it's still running?
 
How does this pid get checked in order to verify that it's still running?
If you copied the rc(8) script from finance/odoo look at the two functions; odoo_status() and odoo_stop(), you'll see they only check for the existence of the PID file. The status function doesn't check to see if the pid itself is still running. When the application crashed that pid file remains, it doesn't get deleted.

Code:
odoo_status()
{
        # If running, show pid
	if [ -f ${pidfile} ]
	then
		echo "${name} is running as pid" `cat ${pidfile}`
	else
		echo "${name} is not running"
	fi
}
 
ok I thought there might be some other check mechanism in the rc.d framework that would do this "by default" perhaps.
So I basically have to adapt that function for it to actually check the pid.
Thanks for that quick answer.
 
I thought there might be some other check mechanism in the rc.d framework that would do this "by default" perhaps.
There is. But the odoo rc script overruled the 'default' functions with its own _status() and _stop() functions.
 
The default behavior is to use check_process (which calls check_pidfile if a pidfile is available) from rc.subr(8) to check whether the process is indeed running. This uses ps(1) to check the pid, so it can be verified it's not some reused pid but indeed the process we're looking for. It includes support for script interpreters.

I can only guess there might be very special situations where that logic fails. But then, not checking the pid at all seems a pretty stupid choice. You could at least check whether a process with that pid exists (e.g. with kill -0 ${pid}), that's far better than just nothing...
 
The pid check is always a heuristic. I think the best version works as follows: When the daemon starts, it deposits its pid and some nonce-like information (like the time it started) in the pid file, or some similar place. When checking whether the daemon is still running, we check: (a) The pid file exists, and is readable, and contains the expected information. (b) A process with that pid is running. (c) That process is running the correct executable. (d) The process, when interrogated via a side channel (like its command socket) confirms its nonce, and it matches the information in the pid file.

The normal rc.subr mechanism does (a) through (c), which in most cases is good enough. The only time the even stricter version (d) is required is if the daemon is stateful, and restarts of the daemon outside the control of rc subsystem could cause damage. Error handling in those cases is tedious. It's usually good practice to combine this with a mechanism to make sure two copies of the daemon are not running, except carefully controlled for debugging (with the debugging instance typically on different ports, with different control and status files, and so on).
 
ralphbsz sure, it's always a heuristic, but if you also check the process name (like rc.subr is doing), it's a pretty good one. It's already unlikely to have exactly the PID of your crashed daemon reused, it's even more unlikely that a new instance of that same daemon gets that PID.

Your idea would of course work, but require extending the protocol a daemon must implement. If you think about implementing that idea, I'd suggest to separate out the nonce, e.g. like
Code:
/var/run/frob.pid: 1234
/var/run/frob.1234.nonce: hello$frob
This would still allow it to be used with init scripts unaware of your new nonce.

Talking about that topic, there's of course a reason systemd doesn't want daemons to "daemonize" at all. They want to perfectly reliably know the state of a daemon they launched, so they want to rely on SIGCHLD to learn about changes. This makes some sense, but I still hate the idea, as it comes with several drawbacks:
  • It's an intrusive change!
  • You either lose the ability to use your daemon without any "service manager" (friendly greetings to Windows' svchost), or you additionally implement classic "daemonizing" and offer a command-line switch.
  • You need an additional protocol to let the service manager know when your daemon's initialization is complete (so it's up and running). With classic daemonizing, this is "implicit": It's done when the parent process exits successfully.
So, if you think a solution is needed for e.g. an init-script to know "is this still the instance we started?", I'd like your idea better, because it could be an optional extension.

But finally: Preventing multiple instances can be implemented in the daemon itself. The way to do it is to put a lock on your pidfile with F_SETLK. This lock is automatically released whenever the owning process exits. So, on startup, you check the lock of your pidfile. If the pidfile is unlocked, it's stale. If the lock owner doesn't match the pid written in the file, you know you hit a bug and log out some "fatal" message. If the lock owner matches, you abort startup, logging out some "info" or "warning" message that the service is already running.
 
TIL ... :) I just commented the status function out and tested the status. All seems to work fine. Let see how it behaves.
Thanks all
 
Back
Top