slurm-wlm: Unkillable slurmctld

Hello,

I deleted slurm-wlm 20.07 and installed 23.11.1_1 built using poudriere, with default options. I am running FreeBSD 13.2.

I can start the slurmctld daemon using onestart, but there are some errors in the log, probably caused by misconfiguration. Ideally it should be possible to stop the daemon, reconfigure and try again. However, onestop does not stop the daemon and I have not found any way to kill it from the command line. I have tried sudo kill $(sudo pgrep slurm) and variation using the PID gleaned from pgrep. I have also tried onerestart but that just starts another instance of the daemon.

If anyone has some ideas, please share them so I don't have to keep rebooting the machine.

Thanks,
sprock
 
You should be able to stop it using kill(1). That's what the rc(8) script does too. But the process might be in a zombie state or stuck waiting for some signal that never comes.

Code:
slurmctld_stop() {
    if [ -e $pidfile ]; then
        checkyesno slurmctld_enable && echo "Stopping $name." && \
	    kill `cat $pidfile`
    else
	killall $name
    fi
}
 
I've tried
Code:
kill
but that does not work - the daemon sails on. It is not in a zombie state:

Code:
sudo ps auwx | grep slurm
slurm   79120 100.0  0.1  17300  5580  -  R    09:18   259:35.37 slurmctld: slurmscriptd (slurmctld)
 
Have you tried any other signals? Like kill -9 <pid>? That should really, really kill a process. It might not respond to a polite SIGTERM (what kill(1) sends by default).

Code:
     9     SIGKILL          terminate process    kill program
Code:
     15    SIGTERM          terminate process    software termination signal
 
I tried 15, not 9.

Since then I see there is a problem with the port: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276001

I have worked through some of the bind calls that are mentioned in the bug report, and found out how to disable cgroups in the config. However, I cannot get the slurmd daemon to connect to the controller. Unfortunately, socket programming is beyond my expertise. I don't think I can fix this.
 
Back
Top