How to kill a running process?

Traditionally, with kill -9 we could unconditionally kill a running process. This doesn't work anymore:

Code:
# ps ax | grep 91293
91293  -  RJ     1121:57.70 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i
# kill -9 91293
# ps ax | grep 91293
91293  -  RJ     1122:25.71 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i
# kill -9 91293
# ps ax | grep 91293
91293  -  RJ     1122:31.02 pg_dump -U postgres -p 5432 -bF c -f /var/db/pg-i

That thing is in an endless loop, continuously eating 100% of one CPU. How to get rid of it?
 
pkill pg_dump or pkill 91293
Why?

pgrep, pkill – find or signal processes by name
HISTORY
The pkill and pgrep utilities first appeared in NetBSD 1.6. They are
modelled after utilities of the same name that appeared in Sun Solaris 7.
They made their first appearance in FreeBSD 5.3.

Do You think pkill is by any means "better" than old-fashioned kill?

J Marks a process which is in jail(2)
Yes. I tried to kill it with kill -9 from within the jail, and from the host. Neither had any effect. No logging, either.
So, that means, processes running in a jail, cannot be killed, at all? ;)


And I found the, eh, "root cause": that process was writing to a file, on an UFS filesystem. The disk had a bad block that could not be read:

Code:
# dd if=/dev/da6 of=/dev/null bs=64k
dd: /dev/da6: Input/output error
4267+0 records in
4267+0 records out
# dd if=/dev/zero of=/dev/da6 seek=4267 count=1 bs=64k
1+0 records in
1+0 records out
65536 bytes transferred in 1.092641 secs (59979 bytes/sec)
# dd if=/dev/da6 of=/dev/null bs=64k
# echo $?
0

Apparently all fine now, back in business. What I don't like is that it needed a reboot to get rid of these stuck processes.
 
I've seen processes in jails defy killing, BUT, they've always had the state of D for uninterruptible. Eventually, though, they stop being uninterruptible and die as they should. Sometimes that can take minutes. Your process shows no D though.

Your situation seems very odd and possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed.
 
I've seen this happen to processes that were accessing an NFS filesystem. It's been a long while, though. Here's something similar:
 
I've seen processes in jails defy killing, BUT, they've always had the state of D for uninterruptible. Eventually, though, they stop being uninterruptible and die as they should. Sometimes that can take minutes. Your process shows no D though.
Yepp. And I verified this one with top, and there was always some CPU shown with 100%. (In D state it would not eat CPU cycles.)

Your situation seems very odd and possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed.
There was no damage on the filesystem. The process was trying to create a new file; this did not return, and after reboot I did see an inconsitency in fsck - as it should be.
Then I did find and zero the bad block, let the disk do whatever housekeeping it might require, and then fsck had no further errors. Maybe one of the files does now contain wrong data - but the filesystem never had a flaw.

Logfiles say, there were device errors, ending in Error 5, Retries exhausted
And after these, about 100 of such errors apper:
Code:
kernel: GEOM_ELI: g_eli_write_done() failed (error=5) da6p1.eli[WRITE(offset=279674880, length=131072)]
kernel:
kernel: g_vfs_done():da6p1.eli[WRITE(offset=279674880, length=131072)]error = 5

So it seems that Geli might be part of the problem, and does not bother much about device errors.

Anyway, afte not being able to stop the process (or unmount), I unplugged the disk. It was all correctly detached (made it even to syslog), and right after that the machine panicked at some "softdep_" stuff - and was not in the mood to write a coredump.

So there is no bughunting from obtained data. :/

I think the primary cause is that Geli reduces the fault tolerance from the device. But another question is, how can a process receive a hard kill and still continue to obtain CPU time slices (on various cores)? I don't think I've ever seen that before...
 
I think the primary cause is that Geli reduces the fault tolerance from the device. But another question is, how can a process receive a hard kill and still continue to obtain CPU time slices (on various cores)? I don't think I've ever seen that before...

The thing is, though, the kernel shouldn't give it an option.

I wonder if this is a shell built-in command that's running and somehow testing for X before attempting a true kill?
You know something like "If process is alive, kill it otherwise wait for it".
 
…bad block … Input/output error

Any number of unwanted or troublesome behaviours may ensue.

possibly is a bug. Even if the file system is junk, the driver should time out and allow the process to be killed. …

Depending on the context of an error, it's not unusual for an operating system halt to fail in response to shutdown -p now.

shutdown(8)

Ideally: things should be more graceful.

Realistically: it's sometimes necessary to force off the power.
 
Any number of unwanted or troublesome behaviours may ensue.
In the old times this was the case. A device error would likely require a reboot.
But nowadays we have all these modular designs, plugable and hot-plugable components - and that should be more robust - but obviousely it is not perfect/bugfree.
Also, nowadays we have much bigger installments, hundreds of subsystems, dozens of jails, lots of guests on a single physical instance - it's not always fun to crash such a piece in midflight.

I wonder if this is a shell built-in command that's running and somehow testing for X before attempting a true kill?
You know something like "If process is alive, kill it otherwise wait for it".
It is a shell-builtin, in the default root's /bin/csh.That points to /usr/src/contrib/tcsh - and I'm currently not in the mood to read all that. ;)
 
What does happen when a process runs an endless loop in kernel code, within a syscall?
 
What does happen when a process runs an endless loop in kernel code, within a syscall?
I'm not sure what you mean. If the endless loop is in the kernel code, the process won't be killable. If the endless loop is in user space, there will be a possibly infinitesimally small chance of killing it.
 
I'm not sure what you mean. If the endless loop is in the kernel code, the process won't be killable.
I thought so, and yes, this is exactly what I mean.

As it seems, such processes will also ignore rctl/racct. Normally this would periodically stop the process to limit CPU consumtion, if utilized that way.
 
Back
Top