FreeBSD libc much slow than Linux

Hey,
benchmarked string functions from libc on FreeBSD and they seem to be substantially slower than the current ones on Linux.

Please, where can one find the specific version/implementation which has been used e.g. for 13.1-RELEASE.
 
That said, please read the code all you want, and suggesting actual improvements is most likely welcome. But as you didn't find the relevant code yourself (which surprises me a bit), I also have to add: Please read any coding guidelines of the FreeBSD project before suggesting changes, otherwise the time you invest might just be wasted....
 
Please, feel free to check this bug report for stress-ng. Totally agree that proper benchmarking is very tricky. I just was a bit disappointed by such a huge difference.
Perhaps we do have space for improvement for libc used on FreeBSD. I'm just curious. Not blaming anybody.

 
jb82 I had a quick look at a few implementations in the repo I linked above. What I found is, most of all, "straight forward", not "optimized".

I personally don't see a problem with that. Straight-forward and clear code should always be the first choice. It's robust, portable, avoids "esoteric" error conditions. You should never start optimizing unless you proved something actually is a performance bottleneck. Of course, that's pretty hard to judge for library functions, it very much depends on how they're used.

So, again, if you can come up with better code for some of these, with better meaning faster while still being readable, portable, etc ... then please do so ;)
 
jb82 I had a quick look at a few implementations in the repo I linked above. What I found is, most of all, "straight forward", not "optimized".

I personally don't see a problem with that. Straight-forward and clear code should always be the first choice. It's robust, portable, avoids "esoteric" error conditions. You should never start optimizing unless you proved something actually is a performance bottleneck. Of course, that's pretty hard to judge for library functions, it very much depends on how they're used.

So, again, if you can come up with better code for some of these, with better meaning faster while still being readable, portable, etc ... then please do so ;)
Sounds like an acceptable modus operandi ;-)
On the other hand, stuff like libc string functions are so prevalent that many consumers could easily benefit from that and IMHO a tiny fraction of esotericism could be definitely acceptable as well :cool:
Ok, I shall try a couple of experiments.
 
I digged a bit deeper. FreeBSD's libc also has machine-specific optimized functions, just like glibc. Of course the implementations differ. Let's look for example where you would end up calling strcpy(3) on an amd64 machine:

FreeBSD: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string/stpcpy.S?h=releng/13.1
Glibc: https://sourceware.org/git/?p=glibc...70474d66bd22e0c4d6c8af4ef75994f386794;hb=HEAD

I guess the difference can be seen here:
FreeBSD: https://cgit.freebsd.org/src/tree/lib/libc/amd64/string?h=releng/13.1
Glibc: https://sourceware.org/git/?p=glibc...2748c6609fc62e4eefc9f217197ea8e5bca3c;hb=HEAD

Be aware the link to Glibc above contains more than just optimized string implementations, still there's more concerning strings than in FreeBSD's libc. There are even alternative implementations just for strcpy() I used as an example above.

Most of these optimized implementations use assembler. That's certainly already a burden concerning maintainability and readability (and, therefore, ensuring correctness).

So, the key question remains: Where does it really offer a benefit to "optimize"? I have no doubts a benchmark just measuring libc string functions will show glibc as the "winner". But is it really relevant? I don't remember many "generic" benchmarks showing a serious edge for a GNU/Linux system over FreeBSD.

That all said, sure, do your experiments ;)
 
The next step is to run some CPU performance counters, but only if there isn't an obvious difference to start. In the links above GNU libc is using SSE2 to scan 8 bytes at a time, and does that in an unrolled loop, whereas FreeBSD has a comment about maybe unrolling the loop :). SSE2 is present in all amd64 chips, so it is fair game.

Not having checked how the above benchmark does it, there should be a benchmark that doesn't just hit a single string in a loop. Having the whole benchmark including all its data fit in the L1 cache does not represent real world usage of string functions such as processing texts. Normally it wouldn't matter that much to do single steps of 8 bytes at a time because the CPU would constantly stall. Once you hit main memory a few instructions more or less disappear in the noise. And using the same string over and over leads to unrealistic precision in branch prediction.

It would also be interesting to look at the kernel functions for e.g. memcpy. They might be different from the userland implementations.
 
PMC data is what I'm after. (Not)-fitting stuff in L1 is IMHO not that clear because prefetchers these days can do tremendous work. Agree that microbenchmarking is pretty much worthless and having stuff done in assembler is a step backwards. On the other hand, I was just curious, I wasn't objecting or blaming anybody for smaller performance results in stress-ng -str.
 
cracauer@ I'm confused.

Wouldn't taking advantage of SSE2 in FreeBSD's amd64 strcpy(3) help make it significantly faster? I can't imagine it would run slower than what the current implementation is in a real benchmark. Why not have it in there to help possibly optimize other parts of FreeBSD?
 
cracauer@ I'm confused.

Wouldn't taking advantage of SSE2 in FreeBSD's amd64 strcpy(3) help make it significantly faster? I can't imagine it would run slower than what the current implementation is in a real benchmark. Why not have it in there to help possibly optimize other parts of FreeBSD?

Well, the fact of the matter is that Linux has a lot more resources pumped into it. Sometimes it leads to a lot more code for little measured benefit (e.g. the CPU scheduler), but sometimes it is arguably better. We need a volunteer or somebody to finance our own sse2 optimized routines.

Just to clarify, we can't take the glibc implementation because of the GPL.

But nothing prevents you from using the glibc routines if the license doesn't pose a problem for you. Compile them into a shared library and use them via $LD_PRELOAD in a string-heavy application.
 
I'd like to add that it really helps to have identical hardware configs when comparing performance of anything on FreeBSD vs. Linux. If OP reads Phoronix, they should have a good idea of what it takes to have even a decent and reasonably fair comparison. Otherwise, there's just too many holes in the analysis, and it doesn't hold water. And getting those ducks in a row is a major challenge for most benchmarkers out there, even when they're backed by educational or corporate interests. If you pay enough attention, you can poke holes through about 90-95% of all the benchmarks you read about on the Internet blogs.
 
"lies, damned lies and benchmarks"
Well, it's just an exaggerated way to tell being careful with the interpretation ;)

To sum it up from my perspective:
  • It's always best to start with simple, readable (correct) code before thinking about optimizations
  • Optimizing functions that are used a lot definitely makes sense
  • Both FreeBSD's libc and glibc have machine-specific optimized string functions, but glibc has more of them and goes to greater lengths
  • Maintenance, ensuring correctness etc of such optimized code is a lot of work. There's room for improvement, but as already mentioned, FreeBSD has limited project resources
  • To decide whether it's worth optimizing further, a benchmark just measuring these functions doesn't really help, you need performance data about real usage in realistic scenarios
 
Also, take reference to Amdahls Law. If you manage to shave some cycles off the string functions and spend a day on that, how many calls to that function does it need untill you get a day in return? Also, even when SSE can take less core cycles to do things, it does not magically speed up the memory interface. Even worse, the code with more optimizations is likely bigger and so needs more memory cycles to be in L1 and then the pipeline. Your benchmark data may tell you the new code takes only 20% of the cycles, but the wall clock disagrees. Optimizations are NP complete.
 
The trouble is not NP-completeness. The trouble is that the optimizations we are talking about are not well-defined.
 
The test the OP mentions is not even a benchmark:
(S)tress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces...

(S)tress-ng was originally intended to make a machine work hard and trip hardware issues such as thermal overruns as well as operating system bugs that only occur when a system is being thrashed hard. Use stress-ng with caution as some of the tests can make a system run hot on poorly designed hardware and also can cause excessive system thrashing which may be difficult to stop.

(S)tress-ng can also measure test throughput rates; this can be useful to observe performance changes across different operating system releases or types of hardware. However, it has never been intended to be used as a precise benchmark test suite, so do NOT use it in this manner.
(Emphasis mine.)

It also appears to be very Linux-centric. I doubt these results have any relevance under real workloads.
 
It's correct. stress-ng isn't primarily a benchmark. But the difference isn't some 15-20%, but 5-10x, so it was interesting to me. That's all.
 
That can go very quickly if you work on software that is running on thousands of CPUs.
Yeah, but that's definitely not a typical use case. And even stress limits comparison only makes sense if run on identical hardware setups. If you run the stress test A on a Threadripper and on an Athlon, of course you'll notice that Threadripper will show far better performance.
 
If you manage to shave some cycles off the string functions and spend a day on that, how many calls to that function does it need untill you get a day in return?
considering we get an optimized libc in FreeBSD saving cpu cycles, that day is earned very quickly, even more so on gnu libc because it is used on many more machines.
 
Back
Top