C pthreads

Ok now so here's a question for you:

I'm trying to find out how much of a problem it will be to traverse a directory hierarchy with opendir and readdir with multiple threads.

What I have gathered is this: that opendir is thread safe, readdir only on different streams.

So what I don't understand, and maybe one of you fellows can clarify, is this: can I have multiple streams open on a single directory? If, say, a function that is called by several threads opens a stream to a given directory with opendir, do several open streams now exist on that one directory?

fopen has the handy flockfile, but I can't find an alternative with opendir. Trying to avoid mutexes as much as possible, because I think they are the kind of thing a newb can overuse by simple ignorance of when they actually are necessary.

Thankee kindly.
 
  • FYI Languages :

  • Pony: The most strict of the group. It is designed to be data-race free by design. It achieves this through Reference Capabilities (e.g., iso, val, ref) and an Actor model, preventing shared mutable state between threads.
  • F#: While not strictly race-free in the way Pony is, its immutability-by-default model and functional-first approach significantly reduce the likelihood of race conditions. It manages concurrency safely through high-level abstractions like asynchronous workflows.
 
I'm fine with hard drive storage being the bottleneck, threads are meant to parallelize everything_else.

For now, I am going to use a mutex on all sections of code that use opendir. But if somebody can give me an answer about parallel DIR streams for the same directory, I'll be much obliged.

Or if somebody knows of something equivalent to flockfile for directories, even better.

As for switching languages, it is not a possibility at this juncture.
 
So what I don't understand, and maybe one of you fellows can clarify, is this: can I have multiple streams open on a single directory? If, say, a function that is called by several threads opens a stream to a given directory with opendir, do several open streams now exist on that one directory?
So opendir() returns a bit of data representing a directory, then readdir() uses that to return a bit of data representing an entity in that directory. Long time since I've looked at source, but I think readir() may have some internal static data which is why it is not thread safe.
If your model is "a single thread does opendir() and then readdir() on that return" you maintain data consistency/conherency within that thread.

Can you have more than one thread doing "opendir()/readdir()" on the same directory? Sure; you can have multiple terminal windows open on $HOME and doing ls. It's not exactly the same but semantically close.

What you want to avoid is X = opendir() on thread A, Y = opendir() on thread B and then trying to pass X to thread B or Y to thread A. If you need to merge the readdir() data from thread A and thread B (like putting the the output of readdir() into a single list) you'll need to figure out the best way to do that (perhaps a queue/list that multiple threads post to with a single reader thread that manipulates things).

I think answers may be driven by what you want to do.
 
Thank you guys so much.

The opendir() and readdir()s would both happen on the same thread, each readdir() operating on the opendir() in the thread in which it was called. So I probably won't need the fancy footwork. I was about to close the book at this conclusion, as there seems to be no issue if I can have multiple opendir()s on a single directory, but actually, now that I think of it, other threads might manipulate content of that directory while it is being traversed.

Which, by the way, would not be solved by a mutex.

So I probably will look into the flock solution.

I looked at getdirentries back when I was designing this part of my code, and for some reason decided that readdir was definitely preferable for my needs. I can't remember what that reason was, though I remember it sounded weighty at the time. But if things get complicated, it might get down to redisigning the traversal using getdirentries().

Thanks again, friends.
 
Just mentioning if you are using FreeBSD-15.0 or later.

inotify(2) support was added to FreeBSD 15 that will greatly aid in watching changes to a file or directory (or directory structure) on a file system. From the (above) man page: "they aim to be compatible with the Linux inotify(2) interface."

Hm. This actually might be just the thing. Looking at flock, and it seems like it might work, but it's not clear. Whereas with this, it sounds like it would be simple to just watch for any changes as I traverse, and restart or do something with the traversal routine if something does happen. I forsee this conflict as a relative rarity, I'm not bracing for a storm of changes to the directory as the traversal routine executes, so something like this might be the optimal solution.
 
can I have multiple streams open on a single directory? If, say, a function that is called by several threads opens a stream to a given directory with opendir, do several open streams now exist on that one directory?
Yes. Each opendir uses a separate file descriptor and buffer. This means readdir on each will return the same entries. If you want each thread to work on separate entries (& not the same ones) give the result of one opendir to each thread and let them each use readdir_r. If you look at the IMPLEMENTATION NOTES section in pthread manpage, it tells you that libc is threadsafe when you use pthread. Also, when in doubt, look at the sources (/usr/src/lib/libc/gen/readdir.c). By navigating the sources you will learn a lot more (as opposed to random bits by asking people, most of whom don't always have the most uptodate info).
 
Also, when in doubt, look at the sources (/usr/src/lib/libc/gen/readdir.c). By navigating the sources you will learn a lot more (as opposed to random bits by asking people, most of whom don't always have the most uptodate info).

Already life changing advice. For some reason I had thought this would be a lot harder to do. But actually I was able to easily find this and related code, read it, and understand it. Also gives you an appreciation for the way these things are programmed, I really like how they did it.

But problems. Because I read and re-read the code, and I can't find anything readdir_r does different than readdir other than store in a buffer provided by the user. And I see nothing anywhere to suggest to me the existence of {NAME_MAX}, about which there is a big warning in red letters in the man page. So I can't figure out what the man page is trying to warn about. It reminds me of when I went digging through the source for zfs, and thought I was understanding everything, but there was a bunch of stuff behind a curtain I didn't even see that was actually essential.

As for the work, what I am trying to do is actually a lot simpler. I just want as many threads as wish to exist to be able to open the same directory on a different stream and return all entries. I don't need any particular thing done on each entry, they all are just part of a map. Whatever work I need done, I do through absolute paths and fopen or whatnot. But, if I understand what you are saying and what I am reading in the source here right, readdirs on independent opendirs can go on unmolested by readdirs on other DIR streams per given DIR stream. Which for my current purposes would be sufficient.

As for the thread safe advisory on the pthreads man page, I did see it, but I don't understand threads nearly well enough yet to take a phrase so vast at the face value it would have to a total noob like me. I have to take it as meaningless and answer the question per case as it comes up, until I finally undestand what a library being thread safe actually means.

For example, in the code for readdir here, I can see that actually a pthread mutex is in fact used. Of course, that wouldn't solve the initial problem Iw as trying to solve, but it tells me how they guarantee atomicity (if I am using the word right) per specific readdir occurence in a threaded environment.

Thanks again, doing in hours what would take years without you all's illustrious help.
 
But, if I understand what you are saying and what I am reading in the source here right, readdirs on independent opendirs can go on unmolested by readdirs on other DIR streams per given DIR stream.
correct.
Because I read and re-read the code, and I can't find anything readdir_r does different than readdir other than store in a buffer provided by the user.
This way each thread can copy a dir entry in its own buffer. The buffer associated with a DIR may get reused by another thread while the first one is using it. The readdir_r call is thread safe but if you share the DIR buffer, that use is not thread safe.
As for the thread safe advisory on the pthreads man page, I did see it, but I don't understand threads nearly well enough yet to take a phrase so vast at the face value it would have to a total noob like me. I have to take it as meaningless and answer the question per case as it comes up, until I finally undestand what a library being thread safe actually means.
Thread safe means if two threads make the same call at the same time on the same underlying object, their side effect on the object won’t be intermingled but in some order. Concurrency is a vast subject and a bit tricky so I’d recommend reading up on it and writing some small programs to get a feel for it. When I was a newbie I internalized what I read from man pages by writing lots of small test programs. As Ronald Reagan once said, “Trust but verify”! Writing clear, unambiguous and complete documentation is very hard for most of us but test programs can help.
 
I was going to use kqueue, but inotify is just clearly superior in terms of monitoring a directory. Kqueue doesn't seem to understand the concept of a directory at all, so you have to track down every entry and monitor it individually (if I got it right).

So I ended up using it as an excuse for finally upgrading to FreeBSD 15. Happy camper.
Thread safe means if two threads make the same call at the same time on the same underlying object, their side effect on the object won’t be intermingled but in some order. Concurrency is a vast subject and a bit tricky so I’d recommend reading up on it and writing some small programs to get a feel for it. When I was a newbie I internalized what I read from man pages by writing lots of small test programs. As Ronald Reagan once said, “Trust but verify”! Writing clear, unambiguous and complete documentation is very hard for most of us but test programs can help.

This is generally also my school of thought. Because this project has somehow captivated me and gathered its own momentum, some of the discipline has been laxed. But definitely with the awareness that I am taking (by my standards) shortcuts, and will pay for it in a number of ways. Still, whenever I run into a problem I can't understand, my solution is still to make a small test program, and modify it until I understand what I was doing wrong. This project is basically a lego contraption with pieces that I constructed myself. Well, and with generous input from a frankly unlikely group of good-willed experts.

We'll see. The design itself is simple enough that it is very newb-forgiving and prone to learning by doing. Nothing fancy.
 
Back
Top