Solved Process-shared mutex issues in FreeBSD 12.3?

I am having timing-specific problems with a process-shared mutex in FreeBSD 12.3; sometimes pthread_mutex_lock or pthread_mutex_unlock will throw an exception. I have seen comments that process-shared mutexes were not supported in some older versions of FreeBSD, but it appears it should be supported in version 12. Is there something wrong with how I am using the pthread_mutex_t calls on FreeBSD?

See shm_open() and pthread_mutex_init()

Mutex helper class psuedocode:
Code:
struct My_shared_data_t
{
   pthread_mutex_t mutex;
   pthread_cond_t cond;
   // etc.
};

class My_mutex_helper_class
{
public:
   My_mutex_helper_class(const string &name)
   {
      if (shared_memory_exists(name))
      {
         pointer = get_shared_memory(name); // Uses shm_open
      }
      else
      {
         pointer = create_shared_memory(name); // Uses shm_open
         pthread_mutexattr_t attr;
         pthread_mutexattr_init(&attr);
         pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED);
         pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST);
         pthread_mutex_init(&pointer->mutex, &attr);
      }
   }
   wait() // with different overloads...
   {
      pthread_mutex_lock(&pointer->mutex);
      pthread_cond_wait(&pointer->cond, &pointer->mutex);
      // ... or similar pthread_cond_timedwait
      pthread_mutex_unlock(&pointer->mutex);
   }
   signal()
   {
      pthread_mutex_lock(&pointer->mutex);
      pthread_cond_signal(&pointer->cond);
      pthread_mutex_unlock(&pointer->mutex);
   }
private:
   Shared_memory_wrapper<My_shared_data_t> pointer;
}

Usage code
Code:
TEST(Event, SignalsFromNewProcess)
{
   My_mutex_helper_class helper("shared_name_a");
   int pid = fork();
   EXPECT_NE(pid, -1);
   if (pid == 0)
   {
      char * argv;
      argv = (char *) "signal";
      ::execvp("./some_other_process", &argv);
      // Other process:
      // My_mutex_helper_class helper2("shared_name_a");
      // for (int i = 0; i < 10; i++)
      // {
      //   helper2.signal();
      //   Thread::sleep(10);
      // }
   }
   else
   {
      // Original process
      for (int i = 0; i < 20; i++)
      {
         helper.wait(20);
      }
   }
}
 
This is going to be hard; I haven't done C++ posix locking in >10 years.

First question: Where does the exception occur? Exactly what is the exception?

Second, a question: It was my impression that to use a posix mutex that works across two processes, the mutex and condvar have to be in the shared memory. You do that correctly, I think. In particular, since you do the mutex initialization *before* the exec, the mutex will only be initialized once, good. But: Shouldn't the attr also be in shared memory? Or does the content of attr get copied into the mutex when you call pthread_mutex_init(..., &attr)?
 
This is going to be hard; I haven't done C++ posix locking in >10 years.

First question: Where does the exception occur? Exactly what is the exception?

Second, a question: It was my impression that to use a posix mutex that works across two processes, the mutex and condvar have to be in the shared memory. You do that correctly, I think. In particular, since you do the mutex initialization *before* the exec, the mutex will only be initialized once, good. But: Shouldn't the attr also be in shared memory? Or does the content of attr get copied into the mutex when you call pthread_mutex_init(..., &attr)?
First off, thanks for responding. For some time, I have been unsuccessfully trying to determine the "where"; sometimes simply adding cout statements changes the timing enough to cause things to succeed (or fail somewhere else). Trying to determine "Exactly what is the exception" turned out to be helpful:
  1. Red herring: The attempted TEST(Event, SignalsFromNewProcess) actually encounters a code path where pthread_mutex_destroy is called while other threads were are using the mutex. This is NOT happening in my main application.
  2. It seems EOWNERDEAD is encountered; I am now calling pthread_mutex_consistent() when seen. However, that does not appear to be working, as I am then getting ENOTRECOVERABLE
  3. It seems that the attr gets copied into the mutex, as I should not have seen EOWNERDEAD in another process if PTHREAD_MUTEX_ROBUST was not applied (but I'll test that further.
Relevant output from main application (tried both fork and vfork with same behavior)
pthread_create()
pthread_create()
created shared memory region for "MY_KEY"
process 168d5 thread 18a3f My_mutex_helper_class::My_mutex_helper_class("MY_KEY") 0x800c86940 with pthread_mutex_t 0x800d9c218
pthread_create()
pthread_create()
pthread_create()
process 168d5 thread 18a3f My_mutex_helper_class::My_mutex_helper_class("MY_KEY") 0x800c85320 with pthread_mutex_t 0x800dbd218
===== calling vfork
===== called vfork, pid=168d8
===== calling execvp
===== calling dlopen
process 168d8 thread 18a04 My_mutex_helper_class::My_mutex_helper_class("MY_KEY") 0x800c3a2c0 with pthread_mutex_t 0x800d8a218
===== calling vfork
===== called vfork, pid=168d9
===== calling execvp
process 92373 thread 100927 My_mutex_helper_class("MY_KEY") 0x800c85320 ::pthread_mutex_lock("MY_KEY") failed with EOWNERDEAD. Calling pthread_mutex_consistent
===== calling dlopen
process 168d9 thread 18a32 My_mutex_helper_class::My_mutex_helper_class("MY_KEY")0x800c3a2c0 with pthread_mutex_t 0x800d8a218
process 92373 thread 100751 My_mutex_helper_class("MY_KEY") 0x800c86940 ::pthread_mutex_lock("MY_KEY") failed. The state protected by the mutex is not recoverable.
 
This program contains errors of the "undefined behaviour" kind, and until they are fixed, debugging subtle things like synchronization is pointless.
The execvp(3) function expects as a second argument a NULL-terminated array of pointers. This is not what you have provided (the "array" is not terminated properly) and from there all bets are off.

I assume that you have initialized the condvar in a similar way, but you did not show that part.
You don't need to lock the mutex to signal the condvar (although it is only a performance issue if you do).

Most of these functions return error codes - capture them and check their values. Maybe the problem is visible earlier?

And what do you mean exactly by "exception"? These functions don't throw "exceptions" in the C++ sense, they return error codes. Do you get a signal, for example like SIGSEGV? That could indicate a memory corruption, not related to the use of synchronization on its own.
 
When you say the program throws an exception - do you mean a proper C++ exception?

If yes, who is converting an OS error to an exception?

If yes, try to catch it and print .what().
 
I think I figured this out (pending further testing). Both of the following must be fulfulilled:

  1. Every pthread_mutexattr_t in shared memory must have PTHREAD_PROCESS_SHARED set via pthread_mutexattr_setpshared
  2. Every pthread_cond_t in shared memory must have PTHREAD_PROCESS_SHARED set via pthread_condattr_setpshared
After creating the shared memory:
C++:
      pthread_mutexattr_t mutexattr;
      bool mutex_init_error = ::pthread_mutexattr_init(&mutexattr) != EOK
         || ::pthread_mutexattr_setpshared(&mutexattr, PTHREAD_PROCESS_SHARED) != EOK
         || ::pthread_mutex_init(&pointer->mutex, &mutexattr) != EOK ;
      if (mutex_init_error)
      {
         delete region;
         stringstream ss;
         ss << "::pthread_mutex_init (" << name << ") failed: [" << errno << "] " << strerror(errno);
         std::cout << ss.str() << std::endl;
         throw Event_exception (ss.str());
      }

      pthread_condattr_t condattr;
      bool cond_init_error = ::pthread_condattr_init(&condattr) != EOK
         || ::pthread_condattr_setpshared(&condattr, PTHREAD_PROCESS_SHARED) != EOK
         || ::pthread_cond_init(&pointer->cond, &condattr) != EOK;
      if (cond_init_error)
      {
         ::pthread_mutex_destroy (&pointer->mutex);
         delete region;
         stringstream ss;
         ss << "::pthread_condattr_init (" << name << ") failed";
         std::cout << ss.str() << std::endl;
         throw Event_exception (ss.str());
      }
 
For future reference for anyone else encountering the same issue, based on timing, the symptom could be a crash with core dump, a proper C++ exception, or process just freezing. GDB analysis of core dump or frozen process had
Code:
Thread 1 (LWP 101240 of process 16354):
#0 0x00000008007fc8ca in ?? () from /lib/libthr.so.3
#1 0x00000008007f5d61 in ?? () from /lib/libthr.so.3
#2 0x00000008007f4b39 in pthread_mutex_lock () from /lib/libthr.so.3

I believe previous symptoms (possibly due to other code issues) included:
  1. C++ exception:
    Code:
    Fatal error 'mutex 0x800c89680 own 0x18956 is on list 0x80206f248 0x0' at line 154 in file /usr/src/lib/libthr/thread/thr_mutex.c
  2. Core dump with SEGFAULT in pthread_mutex_unlock()
 
... sometimes simply adding cout statements changes the timing enough to cause things to succeed (or fail somewhere else).

I'm so glad you found the problem!

It's that crazy stuff that you describe which makes debugging locking problems SO difficult. Parallelism is hard. In my (not at all humble) opinion there are two ingredients for surviving it: (a) Implement the basic locking functionality super carefully, and build tracing and debugging facility into it. Then when something goes wrong, turn on traces. Without a reliable trace facility, debugging is hard. (b) Simplify your program as much as possible, and reduce the use of locking as much as possible: Every use of locks that isn't there also has no bugs.
 
Back
Top