Unkillable process

Seeker · Feb 4, 2011

User domy started processes, which can't be killed even by a root:

Code:

# ps -U domy                                                     0 /root
  PID  TT  STAT      TIME COMMAND
81474  v1- D      0:00.59 /bin/sh /home/domy/bin/gui_logout.sh
82769  v1- D      0:00.49 /bin/sh /home/domy/bin/gui_logout.sh
83381  v1- D      0:00.48 /bin/sh /home/domy/bin/gui_logout.sh
# kill -9 81474 82769 83381                                      0 /root
# ps -U domy                                                     0 /root
  PID  TT  STAT      TIME COMMAND
81474  v1- D      0:00.59 /bin/sh /home/domy/bin/gui_logout.sh
82769  v1- D      0:00.49 /bin/sh /home/domy/bin/gui_logout.sh
83381  v1- D      0:00.48 /bin/sh /home/domy/bin/gui_logout.sh

Well, I did killed it by turning off laptop

Which CMDs would give me insight, in this issue?

vermaden · Feb 4, 2011

I also sometimes have processes after using SAMBA over WiFi on FreeBSD, only reboot helps (sometimes even hard reset requied) unfortunately.

Seeker · Feb 4, 2011

This is not to be tolerated, as it is a serious bug.
How do I find a cause of issue?

MarcoB · Feb 4, 2011

Not even kill -9 worked?

vermaden · Feb 4, 2011

@MarcoB

Yes that also does not helps ...

jalla · Feb 4, 2011

Processes in IOwait (state 'D') are not kill'able, even with -9
The 'D' state is typical when a device is lost, a dead nfs-server, a usb-stick pulled without being umounted first, etc.

Unless you can bring the missing resource back online there's no way out except a reboot, or even a hard reset in some cases.

_martin · Feb 4, 2011

Have a look into ps(1) - you can't kill a process in D state - "uninterruptible wait". Process can reach the state when it doesn't respond to the signals, or better to say it is not awaken so it doesn't have a chance to respond (blocked).
Also check signal(3).

It would help more to see what is that script actually doing.

You can check the process with truss(1) to see what is going on too:
# truss -p <PID>

Pushrod · Feb 4, 2011

There should be a way to kill those processes. What if the NFS server or USB stick is never coming back? The kernel should have the ability to kill off the process no matter what. I know that the current implementation is not a "bug", but it is not a smart policy.

Galactic_Dominator · Feb 4, 2011

Pushrod said:
There should be a way to kill those processes. What if the NFS server or USB stick is never coming back? The kernel should have the ability to kill off the process no matter what. I know that the current implementation is not a "bug", but it is not a smart policy.

I take it you're new to NFS. This a long, long outstanding issue and has been explained in great detail why it is the way it is. Maybe NFSv4 addresses, but you will have that problem with v2 and v3 clients regardless of OS.

gordon@ · Feb 5, 2011

From the mount_nfs(8) manpage:

If the server becomes unresponsive while an NFS file system is mounted, any new or outstanding file operations on that file system will hang uninterruptibly until the server comes back. To modify this default behaviour, see the intr and soft options.

I believe the intr and soft options work on most (or all) remotely mounted file systems.

_martin · Feb 5, 2011

gordon@ said:
From the mount_nfs(8) manpage:

I believe the intr and soft options work on most (or all) remotely mounted file systems.

True, but you pay the price in performance.

vermaden · Feb 5, 2011

gordon@ said:
From the mount_nfs(8) manpage:

I believe the intr and soft options work on most (or all) remotely mounted file systems.

Thanks, that solves that PITA problem for NFS, but what about SMBFS? I havet found such options in mount_smbfs man page, any hints on SMBFS?

Galactic_Dominator · Feb 5, 2011

vermaden said:
Thanks, that solves that PITA problem for NFS, but what about SMBFS? I havet found such options in mount_smbfs man page, any hints on SMBFS?

It's not a performance, but data integrity penalty you pay with such an option. With the hard timeout, at least you have the opportunity to recover. Soft mounts are generally only regarded reasonable for non-critical and read-only data. Just making sure you know what you're getting yourself into.

vermaden · Feb 5, 2011

I use NFS/SMBFS mostly on my home NAS that I use for movies/f1 races/music so yes, it's mostly for read-only content.

chrcol · Feb 9, 2011

this explains why processes stuck in zfs state arent killable?

a bug I seen on 2 servers twice in the past month.

Seeker · Feb 10, 2011

~/bin/gui_logout.sh is executed when I click on icon in cairo-dock.
When I click it relatively soon after starting GUI, there is no problems, so testing always passes.

I've noticed, that after some time, i.e; I play some game in MAME, all graphic becomes sluggish and jumpy, which remains even when I exit MAME.
I look at top -P which reports ~90% idle CPU and 3.5 GB truly free memory.

But everything in GUI is kind of sluggish and jumpy.

Now exactly at THAT STATE when I click on logout icon in cairo-dock, which executes my script, I get mentioned lock, BUT in this form (I click ONCE, but get 2 processes):

Code:

From my head:
# ps -U domy                                                     0 /root
  PID  TT  STAT      TIME COMMAND
[B][color="Red"]I[/color][/B]      0:00.59 /bin/sh /home/domy/bin/gui_logout.sh
D      0:00.49 /bin/sh /home/domy/bin/gui_logout.sh

First is in state I(which I could kill), while second is in already mentioned D state.

I had to manually, kill cairo-dock to exit gui.
And reboot hangs and hangs, but goes sloooowly ...
I remember line where init bltched how there are some non killable processes: ps axl advised!

Nvidia port has 25* drivers, I've manually downloaded and installed NVIDIA-FreeBSD-x64-260.19.36
All remained same!

What am I supposed to do and where to look at?
I'll be putting this in cairo's logout icon: (Prepend it with truss)

Code:

truss -faeo DEBUG_LOGOUT.txt ~/bin/gui_logout.sh

_martin · Feb 11, 2011

Well, you didn't paste any output from the .sh file, so one can only guess what it does.

As far as the trace file go, I would look for what is/was that shell script doing that it got stuck - what is it waiting for.

Seeker · Feb 15, 2011

I've simplified gui_logout.sh, to just call one function which lists devices mounted by user. This script is being called when clicked on a button in cairo-dock.

I suspect a bug in cairo-dock, as the script always returns different output and especially at grep line, as I get random grep's exit codes (0, 2, 127).

If I execute gui_logout.sh directly in terminal, everything is same as expected, no matter how many time I execute it.
Here is a truss-ed trace.

Script stopped at half of execution in above example!

Is this a cairo-dock bug?

_martin · Mar 6, 2011

If I execute gui_logout.sh directly in terminal, everything is same as expected, no matter how many time I execute it.

I see you use fusefs - when do you use it - before or after X is started ? Even better - have you tested this script when fusefs was already used (i.e. FS was already mounted using fuse) ?

Maybe try using this script without fuse (to exclude this to be an issue).

Seeker · Mar 16, 2011

I use it in all cases!

I've fixed it!
Problem was in grep's exit codes

My script uses grep, to get user's mount points and relies on grep's exit code to be 0 on success

However, when executing it through cairo's button, I get a valid output from grep, BUT with exit codes like 2, 127 ...
So valid output from grep is never taken into consideration.

Now I've rewritten my script, to rely on grep's output instead on it's exit codes and all is well now!

Duh!
Why no one told me this!