C flopen hanging even in non block mode

Hi,

I am using flopen to synchronize two processes. Both the processes keep running for days together but suddenly one process hangs at _openat. This is seen over versions 10.2, 11.2 and 12. So i assume it is something to do with my code. flpoen is called as
flopen("name", O_CREAT|O_WRONLY|O_EXLOCK|O_NONBLOCK, S_IRUSR|S_IWUSR);

The return value for flopen is checked and error EWOULDBLOCK is also handled. The file descriptor is closed upon exit from the critical section. Unfortunately, i cannot bring the debug output here. Any inputs would be of great help.

--Thanks.
 
From the locking subsystem point of view: if you are specifying O_NONBLOCK to flopen, it shall never deadlock from the locking point of view; it shall either succeed and lock the file, or return with the error indicator set. So from purely the locking point of view, the behavior you are seeing is a bug.

From the file system point of view: As far as I can remember, it should not ever be possible to hang an open call. There are atomic file system operations that need to temporarily block open (for example atomic renames), but those should finish in finite time. The only way to deadlock a file system does not apply to FreeBSD UFS an ZFS as far as I know, so again, this should not happen, except if there is a bug.

Could the cause be an IO error? For example, it is possible that the "system" (everything underneath your flopen call) had to start an IO to the disk to perform the operation (for example to the file system metadata), and that IO is simply not finishing, due to a problem in the IO stack (for example a firmware bug in your disk controller). If that's the case, you should find processes stuck in funny states (in Linux those are called D wait, I don't know how to diagnose this on FreeBSD, since it has never happened to me on FreeBSD), and you should be seeing error messages in system logs. On some Unix variants, there are watchdog timers that will abort stalled IOs after some period (often 90 or 300 seconds), so you might have to wait a few minutes. On other Unix variants, hardware-stalled IOs can take infinitely long (I once saw an example of an IO that was still pending after 4 days on a commercial Unix variant). Again, system logs would help diagnose this.

Now, the above statement that this "shall never happen" depends on an accurate definition of the word "deadlock": Theoretically you should always get eventual progress, but under unusual workloads, it might take a very long time. Nothing guarantees fairness, and your process might be the victim of starvation (the words fairness and starvation are terms of art in parallel systems, and have specific meanings). In practice, this will only happen if some other process is keeping the locking or file systems insanely busy. You can try to debug that with top or ps, and see whether there is another process that has gone berserk and is beating up the OS so much that it doesn't have any time to server your flopen request.
 
ralphbsz, Thanks for the detailed response. +1

I am sure the server isn't starving as there are other process which are running fine on the server. Assuming the "bug" as you said is in our program, we wrote toy programs to see the behaviour. All seems just fine, with toy programs run on the same server. As suggested, we shall watch using top and ps. We are planning to move to named semaphores replacing flopen, but that seems to need a server process to first initialize all needed semaphores.

--Thanks.
 
Hello,

Here is an observation which we feel may be connected to this problem. We have a daemon running on the server and we see sockets in closed state. Totally 92 sockets in closed state and one in listen state. Could this have a relation to the problem mentioned above? The daemon is written based on event model. Do not know when this happens but the code has handled EV_EOF and closes the socket. Since it is running in a secure mode using openssl, here is the sequence of calls when EV_EOF is hit.

- ssl shutdown using SSL_shutdown(ssl)
- ssl free using SSL_free(ssl)
- shutdown socket for read/write using shutdown(sd, SHUT_RDWR)
- close the socket using close(sd)

This doesn't happen everyday. We have been monitoring by voluntarily connecting and disconnecting to the daemon, we see connected and time_wait state after which the entry is no longer in the netstat output.

Any pointers are highly appreciated.

--Thanks.
 
Back
Top