Can't KILL Process in the STOP state

megapearl · May 18, 2016

Hello,

How can I force killing a PID in the STOP state?
kill -9 910 won't work...

Tried different commands, also killall(8), process stay...
I can't reboot the machine at the moment. It's the third time this happened with this process, maybe it's a bug of mono but I don't know.

I'm running

Code:

FreeBSD fileserver.flissinger.local 10.2-RELEASE-p16 FreeBSD 10.2-RELEASE-p16 #65: Fri May 13 15:49:32 CEST 2016     [email]donald@fileserver.flissinger.local[/email]:/usr/obj/usr/src/sys/FILESERVER  amd64

PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
910 sonarr              2     20   0        748M   553M STOP  11  26.2H  0.00% mono-sgen

Any help would be appreciated!
Thanks!

Donald.

SirDice · May 19, 2016

megapearl said:
How can I force killing a PID in the STOP state?
kill -9 910 won't work...

You can't kill a process that's already dead. It's a zombie process.

gpw928 · May 21, 2016

Hmm,

My reading of the ps(1) man page says that STATE should show "Z" for a zombie process.

Zombies only consume a tiny space in system process table, so an anxious parent can wait(2) and get back the exit status.

Your pid has 553M resident, so I don't think it's a zombie.

Have you tried sending SIGCONT (continue after SIGSTOP)? This should get it runnable so it will get scheduled, and the kernel will then notice any other signals waiting for it.

gpw928 · May 21, 2016

A thought on prevention. This looks like a long running process, without interactive tty required.

If this is the case, make sure that mono-sgen does not have a controlling terminal, to avoid getting hit with unexpected signals relating to controlling tty (e.g. SIGHUP, SIGTTIN, SIGTTOU). Also it needs to be a session and process group leader, to avoid getting hit with group signals (see killpg(2)).

If you need to do this, use daemon(8), and redirect file descriptors 0, 1, and 2 appropriately, e.g. to launch with Bourne shell:

daemon mono-sgen </dev/null 1>logfile 2>&1 &

Chris_H · May 21, 2016

Greetings,
kill -HUP 910, or kill -KILL 910, doesn't get it?
Further; trying top -P should list:

Code:

last pid: 62603;  load averages:  0.09,  0.12,  0.08  up 4+06:57:35  20:21:40
99 processes:  1 running, 97 sleeping, 1 zombie
CPU 0:  2.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 97.2% idle
CPU 1:  2.4% user,  0.0% nice,  0.4% system,  0.0% interrupt, 97.2% idle
CPU 2:  1.2% user,  0.0% nice,  1.6% system,  0.0% interrupt, 97.2% idle
Mem: 1150M Active, 1656M Inact, 750M Wired, 87M Cache, 415M Buf, 279M Free

NOTE the indication of zombie, in the second row. Any indication of that on yours?

HTH

--Chris

SirDice · May 23, 2016

It may not be indicated as a zombie process but for all intent and purposes it is one. I've had it happen on a number of occasions. The process tries to STOP and then waits forever on some signal that never comes. Not much you can do about it, it's going to be stuck there. Dropping to single user mode might remove it but then you might as well reboot all the way.

Chris_H · May 23, 2016

Hello, megapearl .
I remembered visiting this topic, myself, awhile back. I ended up (re)creating preap -- Precess REAPer.
I thought it was here, that I brought it up. But the seaches I performed were to no avail. So I went to the mailing lists, and searched. But again, no joy.

Frustrating as h3ll. I spent quite some time creating it, for a situation I had, not unlike yours. Sometime later, it seemed to resolve itself. So I stopped using it, and can now, no longer find my work.

I'll keep looking for it. But in the mean time, performing a search on preap, may give you some clues. As memory serves; it was a tool originally created by Sun Microsystems (now Oracle). I'm pretty sure I can find my work. But my real $job keeps me from giving it the time, I'd like to.

--Chris

megapearl · Jun 28, 2016

Hi,

I restarted FreeBSD to get the process running again, but some weeks later the problem returns,
I can't send any command to the pid in the STOP state, nothing happens.

top -p gives me:

Code:

ast pid: 66410;  load averages:  0.21,  0.23,  0.23                                                                                                                                                                 up 6+10:28:31  22:05:56
87 processes:  1 running, 85 sleeping, 1 stopped
CPU 0:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 2:   0.7% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.3% idle
CPU 3:   0.0% user,  0.0% nice,  0.3% system,  0.0% interrupt, 99.7% idle
CPU 4:   0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 5:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 6:   0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 7:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 8:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 9:   0.0% user,  0.0% nice,  0.0% system,  0.4% interrupt, 99.6% idle
CPU 10:  0.8% user,  0.0% nice,  0.4% system,  0.0% interrupt, 98.8% idle
CPU 11:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 12:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 637M Active, 7113M Inact, 54G Wired, 1287M Free
ARC: 51G Total, 6767M MFU, 42G MRU, 357K Anon, 268M Header, 1910M Other
Swap: 8192M Total, 8192M Free

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
  923 sonarr           2  20    0   858M   680M STOP   15  42.7H   0.00% mono-sgen

kill -HUP 923 and kill -KILL 923 doesn't work either.

I don't know if this is a FreeBSD issue or mono or sonarr one.
It does happens only with the process mono, and that's the only one having this issue.
I will try the solution of gpw928 (daemon mono-sgen </dev/null 1>logfile 2>&1 &) and check preap too.

The rc script of sonarr is:

Code:

#!/bin/sh
#
# Author: Mark Felder <feld@FreeBSD.org>
#
# $FreeBSD$
#

# PROVIDE: sonarr
# REQUIRE: LOGIN
# KEYWORD: shutdown

# Add the following lines to /etc/rc.conf to enable sonarr:
# sonarr_enable="YES"

. /etc/rc.subr

name="sonarr"
rcvar=sonarr_enable

load_rc_config $name

: ${sonarr_enable="NO"}
: ${sonarr_user:="sonarr"}
: ${sonarr_data_dir:="/usr/local/sonarr"}

pidfile="${sonarr_data_dir}/nzbdrone.pid"
procname="/usr/local/bin/mono"
command="/usr/sbin/daemon"
command_args="-f ${procname} /usr/local/share/sonarr/NzbDrone.exe --nobrowser --data=${sonarr_data_dir}"
start_precmd=sonarr_precmd

sonarr_precmd()
{
        export XDG_CONFIG_HOME=${sonarr_data_dir}

        if [ ! -d ${sonarr_data_dir} ]; then
                install -d -o ${sonarr_user} ${sonarr_data_dir}
        fi
}

run_rc_command "$1"

Maybe write the developers of sonarr (nzbdrone) too?

gpw928 · Jun 30, 2016

Hi,

The rc(8) script above is invoking the daemon(8) command in the way I suggested ( /usr/sbin/daemon -f does what's needed). Your process has a state of STOP, suggesting it has been sent the SIGSTOP signal. This places it in an un-runable state until SIGCONT is sent to it. Un-runnable processes can't receive signals (because the scheduler only looks for waiting signals after a process becomes runnable). Try sending SIGCONT to the problem process ( kill -19 <pid>). That might get it running..

I think that you are suffering from an application bug.

Cheers,

dohmniq · Jun 30, 2016

Another possibility is that the process is currently dumping core.
Here's an example of mysql (server): [edited for clarity]

 # ps -augx -p10834

USER     PID  %CPU %MEM     VSZ     RSS TT  STAT STARTED    TIME COMMAND

mysql  10834   0.6 29.8 5238172 4990904  -  T    10:22AM 0:10.14 /usr/local/libexec/mysqld

#

# kill -9 10834

#

# ps -augx -p10834

USER     PID  %CPU %MEM     VSZ     RSS TT  STAT STARTED    TIME COMMAND

mysql  10834   0.6 29.9 5238172 4991928  -  T    10:22AM 0:10.14 /usr/local/libexec/mysqld

#

# ls -lh mysqld.core

-rw-------  1 mysql  mysql   4.9G Jun 30 10:32 mysqld.core

#

As the process dumps core, the "RSS" column will increase to meet the "VSZ" column. (Because virtual pages are paged in to be dumped?)

There seems to be no way to abort the core dump process. Plus dumping core seems to bring ALL virtual pages into memory, even ones that have been dumped already, so if you don't have enough free memory for ALL of VSZ then you hit memory issues and processes die, etc.
I'd love it if I could abort coredumps with a "kill -9". I'd also prefer it if the kernel could somehow "free" memory pages that have been dumped.

Mitigation is by using ulimit(1) to limit coredump sizes or the kern.coredump sysctl.

Can't KILL Process in the STOP state

megapearl

SirDice

Administrator

gpw928

gpw928

Chris_H

SirDice

Administrator

Chris_H

megapearl

gpw928

dohmniq