Pthread Question: Is Unlocking a Mutex from a Different Thread Supported?

kreyszig · Aug 12, 2009

It seems that unlocking a mutex from a thread that is different from the one where it was locked is not supported by POSIX standards, even if it seems to work in practice:

pthread_mutex_unlock unlocks the given mutex. The mutex is assumed to
be locked and owned by the calling thread on entrance to
pthread_mutex_unlock. If the mutex is of the ``fast'' kind,
pthread_mutex_unlock always returns it to the unlocked state. If it is
of the ``recursive'' kind, it decrements the locking count of the mutex
(number of pthread_mutex_lock operations performed on it by the calling
thread), and only when this count reaches zero is the mutex actually
unlocked.

On ``error checking'' mutexes, pthread_mutex_unlock actually checks at
run-time that the mutex is locked on entrance, and that it was locked
by the same thread that is now calling pthread_mutex_unlock. If these
conditions are not met, an error code is returned and the mutex remains
unchanged. ``Fast'' and ``recursive'' mutexes perform no such checks,
thus allowing a locked mutex to be unlocked by a thread other than its
owner. This is non-portable behavior and must not be relied upon.

If I actually need to do something equivalent, what is the most efficient way to do it?

For example, let's say that I have a main thread and 2 "slave" threads and that I want to activate the slave threads at some point in the main thread and wait for them to complete before continuing. Right now I am doing the following:

Main Thread:
Lock Mutex Ai
Lock Mutex Bi
Lock Mutex Af
Lock Mutex Bf
CONTINUE=true
Start Thread A
Start Thread B
...
FOR LOOP {
//Ask the slave threads to start
Unlock Mutex Ai
Unlock Mutex Bi
...
//Do some stuff while the threads are running
...
//Wait for the threads to complete
Lock Mutex Af
Lock Mutex Bf
Read Thread A result
Read Thread B result
Combine Thread A & B results
...
}
...
//Stop the threads
CONTINUE=false
Unlock Mutex Ai
Unlock Mutex Bi
Unlock Mutex Af
Unlock Mutex Bf
Wait for Thread A to terminate
Wait for Thread B to terminate

Thread A:
INFINITE LOOP {
Lock Mutex Ai
IF(CONTINUE) {
...
//Compute stuff
...
Store result
} ELSE return
Unlock Mutex Af
}

Thread B:
INFINITE LOOP {
Lock Mutex Bi
IF(CONTINUE) {
...
//Compute stuff
...
Store result
} ELSE return
Unlock Mutex Bf
}

To me it seems to be an efficient way to do parallel processing (for a particular application where the results from both slave threads need to be combined after each iteration). Is there another way that is as efficient, but POSIX-compliant, to do the same thing? My current code seems to be deadlock and race condition free, but I would not like things to break up in the case of an API change...

Thanks!

anemos · Aug 13, 2009

Indeed, according to POSIX, attempts to unlock PTHREAD_MUTEX_NORMAL and PTHREAD_MUTEX_DEFAULT by a thread that has not locked them, results in undefined behaviour whereas PTHREAD_MUTEX_ERRORCHECK and PTHREAD_MUTEX_RECURSIVE both return an error.
Have you ever thought about using POSIX semaphores? A thread can sem_post() a semaphore which some other thread has been sem_wait()'ed on.

kreyszig · Aug 14, 2009

anemos said:
Indeed, according to POSIX, attempts to unlock PTHREAD_MUTEX_NORMAL and PTHREAD_MUTEX_DEFAULT by a thread that has not locked them, results in undefined behaviour whereas PTHREAD_MUTEX_ERRORCHECK and PTHREAD_MUTEX_RECURSIVE both return an error.
Have you ever thought about using POSIX semaphores? A thread can sem_post() a semaphore which some other thread has been sem_wait()'ed on.

POSIX semaphores certainly seem to be very interesting. What would be ideal I think would be to have a way to block the main thread as long as some counter is non-zero and decrement this counter in threads A & B (some kind of "anti-semaphore"). Is there a POSIX "tool" that allows to do that directly?

mart · Aug 14, 2009

kreyszig said:
To me it seems to be an efficient way to do parallel processing (for a particular application where the results from both slave threads need to be combined after each iteration).

Depends on what you're happy to accept as 'efficient enough'. If this is a one-shot computation, and you have no need to scale beyond two threads, and the work being done by each thread is sufficient to hide all overhead, then it may well be worth correcting the code above.

If not (i.e. if performance and scalability are important), then you might want to consider reworking things.

Semaphores (counting or binary) and/or pthread condition variables can handle your use-case cleanly and correctly - it's a common pattern (specialization of 'single producer multiple consumer'). The 'specialization' in this case should simplify the model even further, and if you're really just looking to process a single block of work as quickly as possible (by distributing work units over multiple threads) then it can be done very efficiently if your infrastructure is set up correctly (thread pool, lock-free/atomic ops etc).

If, for whatever reason, you want to keep to the model you currently have, then it'd be great to have details of what you're doing - particularly what your mutexes are protecting (without those details it looks like you have possible race conditions, many more locks than necessary, are holding locks much longer than necessary and/or using them backwards).

kreyszig · Aug 15, 2009

mart said:
Depends on what you're happy to accept as 'efficient enough'. If this is a one-shot computation, and you have no need to scale beyond two threads, and the work being done by each thread is sufficient to hide all overhead, then it may well be worth correcting the code above.

If not (i.e. if performance and scalability are important), then you might want to consider reworking things.

Semaphores (counting or binary) and/or pthread condition variables can handle your use-case cleanly and correctly - it's a common pattern (specialization of 'single producer multiple consumer'). The 'specialization' in this case should simplify the model even further, and if you're really just looking to process a single block of work as quickly as possible (by distributing work units over multiple threads) then it can be done very efficiently if your infrastructure is set up correctly (thread pool, lock-free/atomic ops etc).

If, for whatever reason, you want to keep to the model you currently have, then it'd be great to have details of what you're doing - particularly what your mutexes are protecting (without those details it looks like you have possible race conditions, many more locks than necessary, are holding locks much longer than necessary and/or using them backwards).

What I am trying to do in fact is more complicated than the situation I described in my first message. What I wanted to emphasis I guess is that I need synchronisation between the main thread and the other (dependent) threads very often. It is however not a one time computation since the main thread will request computation from the dependent threads in a loop. The dependent threads have to perform operations that can be quite simple or quite time consuming and they have dependencies between each other, so some of them have to be executed in a sequential order, but some of them are independent. This code is intended to be very flexible, so these dependencies and the number of threads are not fixed by design. What I do is that prior to this code, I have a function that analyzes all the operations to be performed and all their dependencies, and then it bunches them in a list of threads that are defined such that the number of operations executed in parallel at a given time is always maximized, but no thread starts until all the operations it depends on are completed, avoiding race conditions.

So, to put it in other words, what I am trying to do is very similar to a "make" program when a Makefile contains a single final "target", the main difference being that it is executed many thousands of times in a loop (with obviously some dependencies modified for every iteration) and that the number of "operations" to complete is typically much smaller (~20-30)

kreyszig · Aug 15, 2009

To be more accurate in my analogy with make, I should have said that I can have multiple final targets, but I need to build them all every time

anemos · Aug 15, 2009

If I understood correctly you will certainly need to implement some synchronization mechanism as well as some way to block the main thread until other threads have done a specific thing. You can use signals. Check out the following functions - they can be proven very useful but handle them with care.
If you need further help post some (simplified) code.

pthread_sigmask(3)
pthread_kill(3)

mart · Aug 15, 2009

@kreyszig

Thanks for providing more details. The dependencies do make it more involved than the 'distribute a single chunk of work' case, but you can remove those details from the thread handling itself, and still simplify synchronization.

[ For simplicity, I'll assume you have prior knowledge of all required tasks (have already constructed the full dependency graph/DAG), and so can easily determine dependencies across tasks. ]

Ignore threading for a second, lets just deal with dependencies.

All tasks with zero dependencies can be processed immediately, independent of all other tasks. As tasks with zero dependencies are completed, the number of dependencies held by their parents decreases - creating more tasks with zero dependencies that can be processed independently.

Imagine a simple framework that processes this data serially. There are any number of ways, but let's imagine you maintain a single 'zero dependency' list/queue. You can pick any task from the zero queue, blindly, in any order, and execute it. Once that task is complete, you can (now, or at some time in the future) decrement the dependency count of any parent tasks. If any parents count is now zero, you can (now, or at some time in the future) add it to the 'zero' queue.

Process the zero queue this way, and all tasks will be complete in the correct order.

Now imagine you have 16 'workers', how does it scale? Easy - nothing changes. They simply have to process the zero queue. They don't have to worry about syncronization or dependencies across tasks. If there's something in the zero queue, they process it. If the zero queue is empty, they wait. Simple.

So, now the mechanics of the threading...

Maintain a simple thread pool (reuse your workers, don't hire/fire them after each task). The number of threads should equal the number of available processors (any more or any less is inefficient). The worker thread function should be trivial - if work is available in the zero queue, take it, and execute it. If not, wait. Have your main thread do nothing, just wait on completion of the workers.

'But you haven't dealt with how to access the shared queue! I need locks!'.

Deliberately avoided, as there are many ways to handle this - and the choice really depends on the workload. Arguably, the most efficient way is to avoid sharing, and locks, completely (maintain per-worker zero queues). Arguably the most flexible way is a lock-free queue (by definition, no locks). Arguably the most simple way is a lock. Importantly, you can change this without breaking the rest of the framework.

My personal choice in this case would be a lock-free queue. Your workload sounds unbalanced, so you'd probably have to add work-stealing to a per-worker queue system, which gets ugly. A lock-free queue keeps things simple, independent and is still very efficient.

'But there's a possible race with the dependency decrement! I need locks!'.

Good catch, but no cigar - atomic ops are your friend. No locks required, no races possible.

'Why did you add '...or at some time in the future...' to the flow, above?'.

Some architectures will make the dependency updating more complex than on PC/x86, where it can be as simple as a single instruction atomic op. In such cases, or in cases where you simply can't avoid handling billions of tiny tasks, it might be preferable to defer the update until enough have been pooled. [ It's preferable, but not required, to add to the zero queue immediately, making a lazy update possible. ]

'So exactly where do I use explicit synchronization functions, how many, and which ones?'

Across tasks, none. To control the workers, you need to inform them that work is available, and they need to inform you when they are fully done (i.e. all tasks are complete). They wait on the 'deploy/have work' signal, you wait on the 'dock/all done' signal. No need to be fancy here as they're called so infrequently. Semaphores would work, as would pthread cond vars. Just be careful to handle spurious wakeups in both cases (from signals etc).

[ Off the top of my head, it should be possible with 2 counting semaphores total - for any number of tasks, any number of workers, any number of dependencies. However, I'd personally do it with nWorkers+1 binary semaphores, and an atomic op (again, off the top of my head). One of the nicer things about this kind of framework is that you can build up to it in stages, verifying each stage before moving on to anything more complex - helps defeaturing...

]