Solved Zombie process : How to effectively track the cause ?

Hi there,

Since two weeks, one of my services (plexmediaserver) fall into zombie mode one or two times every day, forcing me to reboot the server ( service plexmediaserver restart doesn't work in that case). Apart Plex, everything's still working fine on the server when Plex crashes. Before that, the server was running smoothly without any error for 3 years.
I don't have a clue about what's happening and Plex Media Server's logfiles didn't help even in DEBUG mode.

I already tried to :
- reinstalled Plex Media Server with portmaster (but not all its dependencies, not sure it'll help) ;
- analyze dmesg.today but I noticed no relevant information (no information at all at the time of the Plex's "crash").

Is there a way of tracking down what could possibly happens with that service going crazy every X hours in order to be able to understand the cause ? Like a tracing signal at some point to monitor the Plex Media Server running system which leads me toward the good direction to find the cause ?

Thanks for your help,

--
Léo.
 
You understand what zombie processes are? They are marked in the output of ps as "<defunct>". They are processes that have already exited, but their parent process hasn't "reaped" them yet (received their return code). The problem isn't the zombie process itself, the problem is the parent process, which should be either listening to SIGCHLD and then perform a wait() call to get the return code, or it should be polling for return codes.

So, how can you get rid of zombie processes? It's really hard. You have to find the parent of the zombie; that can be done with some flag on the ps command (read the man page, and find out what flag gets you the parent process ID). Then once you know what the id of the parent process, you need to debug the parent. If it is a Plex media server process, you need to find out why it is broken, and fix whatever the root cause it. You can try killing the parent (first try kill -CHLD to tell it again that it has a child process, then try kill -KILL). But if the parent process is just wedged, you're going to have to reboot to get rid of the zombie.

Another good question is this: Why do you want to get rid of the zombie? They don't hurt, they typically don't use any resources. As long as you don't have thousands of zombies, they shouldn't cause a problem. A bigger issue is this: You said that "service ... restart" doesn't work. What do you mean by that? It doesn't get rid of the zombie process. But does the server work? If yes, you can just restart, and slowly accumulate zombies, which will go away on the next restart. If the server doesn't work after restart, then that is in and of itself a problem. A restart might get you a few hours or days or function, but you should really debug the root cause.

By the way, I don't run any media server on my machine ... used to run daapd many years ago, but it had so many problems, and so little use, I gave up.
 
Hi Ralph,

Thanks for your reply :)
I guess I didn't EXACTLY know what was a zombie process until you proposed me to dig a little more into it.

CONTEXT :
I understand that my question lacked a bit of context : The server is up and running fine for 3 years now, and it's only used to host Plex Media Server in order to serve video files at home. One week after I migrate from FreeBSD 11 to FreeBSD 12, the videos weren't launching anymore - stucked at "Loading" status indefinitely. So I tried to restart Plex by using the service plexmediaserver restart command but it couldn't be completed, stucking at "Waiting for PID: XXXX..." step also indefinitely. So the command doesn't actually restart the server successfully. The commands kill -KILL XXXX or kill -9 XXXX didn't work either.

When I tried to understand what was going on, a quick search on the forums told me that a process which failed to be killed was called a zombie process and that the only solution was to reboot. That's what I knew about zombie processes when I asked my original question :)

Then for now 1 week now, Plex crashes once or twice a day, and the only solution seems to reboot.
To avoid that, I'm trying to find the root cause.

ANSWERING YOUR QUESTIONS :
Thanks for pointing me out some directions to follow.
I ran the ps xao pid,ppid,pgid,sid,comm command to understand what was the parent process id :
Code:
PID     PPID     PGID     SID     COMMAND
1       0        1        1       init
[...]
14650   91352    91352    91352   Plex Transcoder
17131   91352    91352    91352   Plex Script Host
17824   91352    91352    91352   Plex DLNA Server
18344   91352    91352    91352   Plex Script Host
23957   91352    91352    91352   Plex Tuner Service
91352   1        91352    91352   Plex Media Server

Plex Media Server's PID is 91352, and it seems like the PPID is 1, refering at the "init" process.
I'll search on the forum for debugging the root cause with these new informations, but if you have an idea about what could be done to understand what's happening here, that should be useful too :)

Thanks again,

--
Léo.
 
Have you looked at the logs (dmesg output, /var/log/messages) for other problems? If your machine is using hard drives with "spinning rust", check if they could be going bad?
 
It would be interesting to see the output from ps that shows those zombie processes and their parents.

Actually, the term “zombie process” isn’t really appropriate. These aren’t really processes anymore. All of their resources have been released, there is no code or data from them left in memory – that’s why you cannot kill them; there simply isn’t anything to kill, because the process has already terminated. The only thing that remains is a leftover entry in the process table (this is what ps displays with the defunct flag). And the only purpose of that entry is to store the exit code of the process that has terminated, until the parent process fetches that exit code. When this happens, the entry in the process table is finally removed, so the “zombie” disappears.

As ralphbsz explained, zombies are usually not a problem. They indicate a bug in the parent process, though, because it fails to check the child’s exit code in time. So, when you see zombie entries in ps output, look for the parent process. You could try to send it a SIGCHLD signal (kill -CHLD); sometimes this helps to wake up the parent to “reap” the exit codes. However, in many cases the parent process has frozen because of some bug, a resource deadlock or similar things. This is why the normal restart procedure doesn’t work. If everything else fails, kill -KILL the parent process. When it dies, the remaining zombie entries are inherited by the next upper parent process, which is probably the init(8) process (PID 1), and this will fetch the exit codes, thus removing the zombies (a bug in init(8) is very unlikely). There should be no reason to reboot in this situation.

The real problem, of course, is the bug in the parent process (PID 91352 in your above output) that causes it to freeze or deadlock. This is probably difficult to debug, I guess. When it happens, you can try to attach a debugger to the process and look at stack traces, but this requires some developer skills and familiarity with the code in question.

I’m not familiar with the Plex Media Server, but I assume it contains some optional modules. Maybe one of them is causing the problem. You could try to disable those modules, or build the software without them, and check if the problem persists.
 
If I was you, I will upgrade all the third-party software on this machine. It won't bring you an answer why plexmedia server crashes but that could solve the problem. It often works.
 
One thing to note is that multimedia/plexmediaserver is a binary distribution. There is no source code. This will make debugging the code a lot more difficult. The FreeBSD binary is compiled by Plex themselves.
 
Hi everyone,

Thanks for all your insights and further information about the situation I encountered.
I was waiting for the issue to reproduce but.. the problem seems to be gone ; this is a relief of course, but also very frustrating not being able to understand the cause :)

Tingo : Unfortunately, I didn't notice anything unusual from dmesg output :( And I have a SSD hard drive, no spinning rust :)
Emrion : That was my first (unsuccessful) move, and I checked everyday if a new Plex version was out ; I suspect the last update one week ago finally fixed the bug I had.
olli@ : Thanks for your detailed message, I now understand better what implies this kind of situation and how to think about it. Even if this bug is gone now, I'll be more ready to face the next one ;)
SirDice : I wasn't aware of that, thanks for pointing that out !

Thanks again everyone,

--
Léo.
 
Back
Top