FreeBSD libc much slow than Linux

zirias@ · Mar 8, 2023

astyle said:
Yeah, but that's definitely not a typical use case.

It certainly is the usecase for functions that virtually any program will use, which is true for the string functions in libc.

astyle said:
If you run the stress test A on a Threadripper and on an Athlon, of course you'll notice that Threadripper will show far better performance.

I'd have some doubts about this looking at the straight-forward implementation of e.g. strcpy(): for (char *tmp = dest; *tmp++ = *src++;); return dest;. Short of optimizations done by the compiler, this copies one byte at a time with no opportunity to parallelize anything.

But still what Jose said: "real workloads" is what matters. If they show a relevant difference between FreeBSD's libc and glibc then it's time for action. (edit: as mentioned earlier, there are optimized machine-specific implementations available in FreeBSD's libc, and maybe they're just good enough ....)

cracauer@ · Mar 8, 2023

The following function...

Code:

const char *buf1 = "lkaskdjkasdjkajsdkjasddlasdjlaskdjklajsdljasldjlasdjlkajsldjlajsdl";
void test_strncpy()
{
  char buf2[8192];
  strncpy(buf2, buf1, sizeof(buf2)-1);
  if (buf2[1] == 'a')
    exit(1);
}

... in a tight loop takes 72.4 nsec on Linux Ubuntu 20.04.5 LTS and 166.1 nsec on FreeBSD-14-current.

ETA: processor is Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz in a dual-boot laptop.

zirias@ · Mar 8, 2023

cracauer@ still the question remains: is this relevant for real usage scenarios? That's something a benchmark just testing these functions in isolation can never answer....

jb82 · Mar 8, 2023

strncpy boils down to memcpy in glibc

astyle · Mar 8, 2023

cracauer@ said:
The following function...

Code:

const char *buf1 = "lkaskdjkasdjkajsdkjasddlasdjlaskdjklajsdljasldjlasdjlkajsldjlajsdl"; void test_strncpy() { char buf2[8192]; strncpy(buf2, buf1, sizeof(buf2)-1); if (buf2[1] == 'a') exit(1); }

... in a tight loop takes 72.4 nsec on Linux Ubuntu 20.04.5 LTS and 166.1 nsec on FreeBSD-14-current.

And processor speed doesn't matter in this case? like a 1.6 GHz Celeron vs a 4 GHz Xeon? Even if you can't parallelize execution or boil down instructions...

cracauer@ · Mar 8, 2023

zirias@ said:
cracauer@ still the question remains: is this relevant for real usage scenarios? That's something a benchmark just testing these functions in isolation can never answer....

Yes, my benchmark suffers from the same problems. It's a L1 cache burst runner. I was just curious whether I could reproduce the 10x with even a useless benchmark. I could not. For me it's more like 2x.

cracauer@ · Mar 8, 2023

astyle said:
And processor speed doesn't matter in this case? like a 1.6 GHz Celeron vs a 4 GHz Xeon? Even if you can't parallelize execution or boil down instructions...

Processor for my benchmark is Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz in a dual-boot laptop.

jb82 · Mar 8, 2023

cracauer@ said:
Yes, my benchmark suffers from the same problems. It's a L1 cache burster. I was just curious whether I could reproduce the 10x with even a useless benchmark. I could not. For me it's more like 2x.

Sorry, this is a wrong deduction. Having useless benchmark with factor 2 doesn't say anything with about a little bit more complicated stuff like what stress-ng does. It still can be 10x, well, probably not. Again, I'm not saying that FreeBSD implementation is rubbish. I'm just asking whether we shouldn't try a tiny bit harder and whether this isn't a low hanging fruit. Checking whether compiler properly vectorized those loops and such things.

Asking for more complex benchmarks is fair, but there's no universal answer to that, so it partially ruins this effort a priori.

zirias@ · Mar 8, 2023

jb82 said:
little bit more complicated stuff like what stress-ng does.

So, what does it do exactly? Did you have a look? Why do you think it's relevant?

jb82 said:
I'm just asking whether we shouldn't try a tiny bit harder and whether this isn't a low hanging fruit.

As for the low-hanging fruit, it certainly isn't. We're already in the area of custom assembly code. This is hard to read/understand/maintain and therefore hard to ensure correctness.

As for the question itself, whether there's a necessity, I still say: have a look at real scenarios, not isolated benchmarks. Otherwise you'd never know whether it would be worth the effort.

cracauer@ · Mar 8, 2023

jb82 said:
Sorry, this is a wrong deduction. Having useless benchmark with factor 2 doesn't say anything with about a little bit more complicated stuff like what stress-ng does. It still can be 10x, well, probably not. Again, I'm not saying that FreeBSD implementation is rubbish. I'm just asking whether we shouldn't try a tiny bit harder and whether this isn't a low hanging fruit. Checking whether compiler properly vectorized those loops and such things.

Asking for more complex benchmarks is fair, but there's no universal answer to that, so it partially ruins this effort a priori.

Well, the string routines are written in assembler on either OS. So the compiler has little influence other than tricking it into not optimizing the whole thing away.

jb82 · Mar 8, 2023

cracauer@ said:
Well, the string routines are written in assembler on either OS. So the compiler has little influence other than tricking it into not optimizing the whole thing away.

You're right. glibc e.g. provides specialized AVX2 implementations using vpcmpeqd, vpmovmskb and such.

Crivens · Mar 8, 2023

cracauer@ said:
Processor for my benchmark is Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz in a dual-boot laptop.

Just to verify: same compiler? Impact of ld.so dynamic lookup? (Avoid by static linkage or call func once before meassurement start)

ralphbsz · Mar 8, 2023

cracauer@ said:
That can go very quickly if you work on software that is running on thousands of CPUs.

Exactly. A very large fraction of all computer usage today (worldwide) is on machines used/owned by the hyperscalers. And they do modify standard libraries for performance gains. But those performance gains, on real-world workloads, tend to be very small: Single-digit percent, sometimes smaller. Having worked for some of the large computer companies and hyperscalers, I've seen enough projects where a "small" modification such as the one proposed by the OP is evaluated and put into production, because it creates savings, often on the order of 1% of overall CPU usage. Commonly this is actually not string handling (which is usually already "optimized enough"), but more malloc, crypto/checksum, and network protocol handling. But at the scale of something like Amazon or Google, saving a few percent of all CPU usage can get close to a billion dollars per year. Usually other changes (such as changing algorithms to make the code more cache friendly) have much bigger effects.

My educated guess: For the typical amateur system (server or desktop), the overall effect of this change is probably far below 1%, probably unmeasurably small (as systems are not repeatable at the sub-0.1% level). For specific workloads that are string-intensive, it is probably single digit percent.

zirias@ said:
I'd have some doubts about this looking at the straight-forward implementation of e.g. strcpy(): for (char *tmp = dest; *tmp++ = *src++;); return dest;. Short of optimizations done by the compiler, this copies one byte at a time with no opportunity to parallelize anything.

Actually, on modern CPUs (where the CPU cores are much faster than the memory interfaces, and spend a significant fraction of time waiting for memory), such an implementation is probably pretty good. The perceived inefficiency of moving one byte at a time will be hidden by the L1 cache, and repeated instructions spinning very fast. And for small strings (fewer than dozens of bytes), it may even be optimal, since setting up vector registers takes time too.

ralphbsz · Mar 8, 2023

To answer Crivens' question: This code was run in a tight loop, so things like ld.so should be amortized out. Code generation in both gcc and clang should be near optimal for short fragments.

cracauer@ said:
The following function...
const char *buf1 = "lkaskdjkasdjkajsdkjasddlasdjlaskdjklajsdljasldjlasdjlkajsldjlajsdl";
... in a tight loop takes 72.4 nsec on Linux Ubuntu 20.04.5 LTS and 166.1 nsec on FreeBSD-14-current.

Given the CPU speed of 2.2 GHz, and the fact that the string is 66 bytes long, that works out to 2.4 cycles per byte on Linux, and 5.5 on FreeBSD. Given that simple looped instructions tend to run single cycle (after decoding and assignment to execution units), this implementation is probably close to the simple loop given as an example above, suitably optimized.

cracauer@ · Mar 8, 2023

Crivens said:
Just to verify: same compiler? Impact of ld.so dynamic lookup? (Avoid by static linkage or call func once before meassurement start)

FreeBSD: gcc version 12.2.0 (FreeBSD Ports Collection)
Linux: gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)

I run for half a second in a loop, so lookup shouldn't matter.

cracauer@ · Mar 8, 2023

ralphbsz said:
To answer Crivens' question: This code was run in a tight loop, so things like ld.so should be amortized out. Code generation in both gcc and clang should be near optimal for short fragments.

Given the CPU speed of 2.2 GHz, and the fact that the string is 66 bytes long, that works out to 2.4 cycles per byte on Linux, and 5.5 on FreeBSD. Given that simple looped instructions tend to run single cycle (after decoding and assignment to execution units), this implementation is probably close to the simple loop given as an example above, suitably optimized.

The processor actually goes into 4.1 GHz turbo mode on the core the benchmark is running on.

cracauer@ · Mar 8, 2023

Code:

const char *buf1 = "lkaskdjkasdjkajsdkjasddlasdjlaskdjklajsdljaszdjlasdjlkajsldjlajsdl";

void test_strchr()
{
  const char *res;
  res = strchr(buf1, 'z');
  if (*res == 'a')
    exit(1);
}

Linux: 3.8 ns. (Linux has a bazillion assembly implementations, dunno which is in effect)
FreeBSD: 11 ns. (C implementation)

Link to glibc files: https://sourceware.org/git/?p=glibc.git;a=tree;f=sysdeps/x86_64/multiarch;hb=HEAD

FreeBSD actually has a special gcc implementation, but libc is compiled with llvm these days, so it is just pure C even though I use gcc to compile the benchmark.

Jose · Mar 9, 2023

ralphbsz said:
Exactly. A very large fraction of all computer usage today (worldwide) is on machines used/owned by the hyperscalers. And they do modify standard libraries for performance gains. But those performance gains, on real-world workloads, tend to be very small: Single-digit percent, sometimes smaller...

So true. A taste:

View: https://www.youtube.com/watch?v=ncHmEUmJZf4&t=218s

The whole talk is good, BTW.

Crivens · Mar 9, 2023

ralphbsz said:
To answer Crivens' question: This code was run in a tight loop, so things like ld.so should be amortized out

No. There is late binding, which would resolve the symbol at first use. If that is inside the loop, your nano second measurements will be meaningless.

Crivens · Mar 9, 2023

Benchmarking this more meaningfull:

Static linkage.
Disassemble to verify the compiler did not hose your loop (like moving that call out of that loop).
Use performance counters.
Try valgrind.

FreeBSD libc much slow than Linux

zirias@

cracauer@

zirias@

jb82

astyle

cracauer@

cracauer@

jb82

zirias@

cracauer@

jb82

Crivens

Administrator

ralphbsz

ralphbsz

cracauer@

cracauer@

cracauer@

Jose

Crivens

Administrator

Crivens

Administrator