Best formula for calculating how many threads to use for building?

Snurg · Jan 15, 2021

My postinstaller needs to compile some stuff.
Of course, the compile part should be finished as quickly as possible.

In the Handbook, section 24.5.4.2., there is something written about the -j directive.

Now I am thinking about how an good formula that determines the number of compile threads could/should look.
My first thought looks like this:

Code:

my $mymemory = `sysctl hw.physmem`;
my $mycpus = `sysctl hw.ncpu`;

my $mymemorypercompilethread = 500000000;
my $myresidualmem = 1000000000;

my $maxthreadsinmem = ($mymemory - $myresidualmem) / $mymemorypercompilethread;
my $maxthreadslimit = $mycpus * 3;
my $threadstocompilewith =
    ($maxthreadslimit > $maxthreadsinmem)
        ? $maxthreadsinmem
        : $maxthreadslimit;

Maybe somebody already has done research into that topic and found a good formula to determine how many threads can be used without thrashing?

ralphbsz · Jan 16, 2021

There is no single formula. It depends on the balance of CPU and IO. That in turn depends crucially on what you are compiling, and how much CPU time is used in compiling. Which changes from compiler to compiler, version to version, and depends crucially on the software itself; for example, some numerical codes are known to use a lot of CPU time in the compiler when optimizing. Monitor memory usage as you do this; swapping is exceedingly inefficient (fortunately, these days most serious computer users have 1/4 or 1/2 TB of RAM on their development machine). The IO system also depends on lots of things: how is the compiler implemented: does it write temporary files, or does it keep intermediate files in memory? Used to be that C++ compilers expanded template instantiations into files in /tmp/ and then compiled those, so you might be thrashing in /tmp. It depends crucially on caching, because in most compilation workflows, files are written (like .o files), and then immediately read back (for linking); if they're still in memory, things will go fast. It also depends on file system specifics; for example some linkers are (in-)famous among file system implementors for creating large sparse files and then filling in the gaps, and depending on how the file system implements pre-allocation, this can be efficient or not.

The correct answer is: Measure it for your situation, and find the optimum. Having done that a few times, I can tell you that the answer may be all over. A few data points: About 20 years ago, on PA-RISC systems running HP-UX and the HP compiler (not gcc), the optimum was 8 compile processes per CPU (meaning on an 8-CPU SMP machine, use 64 processes). The file system was local disk with an excellent local filesystem (VxFS). Counter-example: 30 node cluster, each node is a 24- or 32-core Intel high-end CPU, but the nodes are all using a common NFS file server implemented using ext2 over gigE. There the optimum was at 2 compiles per server (not 2 per core, 2 per machine); anything else, and the NFS server beat itself to death. So we replaced the NFS file system with a cluster file system using Infiniband with direct access to disks (I think we had a few dozen or a few hundred disk drives), and then the optimal number was extremely large: running many hundred compiles in parallel was faster than running just 100 or 200. On my laptop (standard Intel Mac with only 16gig of RAM), the optimum with an SSD for large C++ projects seems to be 8 or 16 compiles, but the last stage (linking) is always the bottleneck and ends up running in a single process.

If you are looking for bottlenecks, the biggest factors are not swapping, using cache efficiently in the file system, and having enough workload so the file system can perform disk accesses sequentially (write-behind, prefetching).

Best formula for calculating how many threads to use for building?

Snurg

ralphbsz