Other Why are FreeBSD binaries smaller than Linux ones?

I did some more checks, and folks - the difference is in the used tool chains, more specifically in the linker C uses. GCC normally uses GNU ld by default, and has the option to use other linkers (BFD, gold, mold, LLD) as well.

LLVM on the other hand users lld by default.

Here we go, using a simple "Hello World" program.

Code:
void main()
{
    printf("Hello World.\n");
}

Compiled with GCC 12.2.0 under recent Debian, standard parameters: 15925 bytes unstripped.

Compiled with same setup, but LLD linker instead (gcc hello.c -o hello -fuse-ld=lld): 5928 bytes unstripped.

Compiled with standard tool chain under FreeBSD 14.0-RELEASE: 9664 bytes unstripped.

So its the linker, baby!

Just for fun, I also tested your "hello wold" example, with the flags I added when compiling QBE and the results are the following:

FreeBSD: 4,5K
Linux: 8,0K
 
FreeBSD version 10 released in 2014, for certain architectures, started using LLVM/Clang, and versions of FreeBSD before that used GCC. Gradually since then, other parts of the GCC toolchain were replaced. I wonder what's the comparison of those on a recent FreeBSD using GCC to LLVM with the different linkers?


Good thought! 2014 was a decade ago, so I don't know if lld (as clang doesn't seem to matter based on our tests) was good enough back then. But, it would be interesting if someone who used FreeBSD back in those version could tell us if they noticed any important file size changes!
 
Good thought! 2014 was a decade ago, so I don't know if lld (as clang doesn't seem to matter based on our tests) was good enough back then. But, it would be interesting if someone who used FreeBSD back in those version could tell us if they noticed any important file size changes!
I meant, comparing GCC to LLVM using LLD and GNU LD all on a recent (production) FreeBSD. Since at one time, GCC was the default on FreeBSD.

But then, it was different. FreeBSD compile times and unnecessary dependencies have gotten much better with using LLVM/Clang, since that became the replacement in the base system. Then, a lot of those improvements made their way to GCC on FreeBSD.
 
I meant, comparing GCC to LLVM using LLD and GNU LD all on a recent (production) FreeBSD. Since at one time, GCC was the default on FreeBSD.

Oh! Well in that case, if you see the discussion from the beginning, I do mention that an internet search says that there shouldn't be a difference between compilers but based on our tests, the linker seemed to do all the "magic".

But then, it was different. FreeBSD compile times and unnecessary dependencies have gotten much better with using LLVM/Clang, since that became the replacement in the base system. Then, a lot of those improvements made their way to GCC on FreeBSD.

I'm happy to hear that! FreeBSD seems nice and for now, I have only tried it in a VM and I liked it so much that in some days when 14.1 releases (if everything goes as scheduled), I plan to use it in real hardware and if I can transfer everything from my workflow, I'll migrate to FreeBSD as my main OS!
 
I don't think the results of the different ld tests on the previous page account for the difference in values reported by size(1).
 
I don't think the results of the different ld tests on the previous page account for the difference in values reported by size(1).
Having written a smart linker in the past, I certainly do. I would need to take the resulting binaries apart to check for dead code included, but smart linking can make a big difference. And it is harder than it seems.
 
I don't think the results of the different ld tests on the previous page account for the difference in values reported by size(1).
So far they are the most plausible reason for the observed size differences between Linux and FreeBSD binaries. Since the behaviour is reproducable, even more so.

So if you doubt it, whats your hypothesis instead of that?
 
Having written a smart linker in the past, I certainly do. I would need to take the resulting binaries apart to check for dead code included, but smart linking can make a big difference. And it is harder than it seems.
What's "smart linking". You mean link time optimizations?
 
What's "smart linking". You mean link time optimizations?
Yes and no.
There are things a linker can do that the compiler can't. But the area gets blurry with llvm and WPO, where the linker puts together the low level code of llvm and passes that to the code generator to make one binary. There you can then have global register allocations and more, but I digress.

Traditionally, a linker was operating on object files as a unit. It took the object files from the command line, resolved open symbols, made several passes over the libraries to resolve and pick up the rest. The libraries were also sets of object files.

A smart linker loads only what is needed. You start with the first code section in crt.o, which is the first object file on the command line, and mark that as used. Then you only resolve what that needs, not the whole object file. If an object contains a big static data structure, that one would be included in old style linking no matter what. A smart linker checks if that symbol is referenced, and if not, leaves it out. You need to make many passes over the objects and libraries, which is why this takes up more memory. Don't forget, back in the days the PDP came with memory in the kilobyte range.

This approach can give you very small binaries, but it also can expose sloppy coding which depends on side effects. Take this as an example:
Code:
static int some_result = setup_my_signal_handlers();
int main(int ac, char** av)
{
....

Someone uses the C++ constructors to set up his signal handlers. Only, if he never touches the "some_result" in his main(), or any other used code, the variable will be dropped and the constructor never called. Speaking of constructors, that is some more magic. How do you order them and what can you do if they form a circle? Using stream io to report an error in the memory allocator will be bad, because memory allocation needs to be initialized before IO, so what now? You have at least to detect this and report it.

What else can a smart linker do, except throwing out the trash (and sometimes your car keys with it when you messed up)?

Coalesc common code. Template expansions will create equal code for different parameters. A list template which is instanciated with an int, unsigned int, long, unsigned long and pointers to different data structures will produce the same code provided all the data types are 32 bits and only checked for equality. You can throw out all the copies and keep one.

You can check which function calls which other function and place them next to each other, so they end up in the same cache line if small enough or same memory page when bigger. Computing this call graph and deciding who goes where takes time, but may save you a lot in the runtime later.

Now do the same with data and bss. But keep the entry point nailed down.

The black magic comes when you start to patch the instructions to make them PC relative, if possible. PC relative addressing is often limited to smaller displacements, so you pay with a NOP and win one less relocation entry to be processed later. In shared objects, that can save you touching a complete page of code in the runtime memory image, so the efford is not to be sneezed at. You may end up with a complete PIC binary without setting one option in the compiler.

Real black magic comes up when you detect that the source of a call and the destination do not share the same calling convention and you dig into the debug information to auto-magically create the stub code and insert it into the working set. I have not seen this outside of what I did, but it would make so many problems with modern languages and their bindings go away. Guess why I made that.
 
Yes and no.
There are things a linker can do that the compiler can't. But the area gets blurry with llvm and WPO, where the linker puts together the low level code of llvm and passes that to the code generator to make one binary. There you can then have global register allocations and more, but I digress.

Traditionally, a linker was operating on object files as a unit. It took the object files from the command line, resolved open symbols, made several passes over the libraries to resolve and pick up the rest. The libraries were also sets of object files.

A smart linker loads only what is needed. You start with the first code section in crt.o, which is the first object file on the command line, and mark that as used. Then you only resolve what that needs, not the whole object file. If an object contains a big static data structure, that one would be included in old style linking no matter what. A smart linker checks if that symbol is referenced, and if not, leaves it out. You need to make many passes over the objects and libraries, which is why this takes up more memory. Don't forget, back in the days the PDP came with memory in the kilobyte range.

This approach can give you very small binaries, but it also can expose sloppy coding which depends on side effects. Take this as an example:
Code:
static int some_result = setup_my_signal_handlers();
int main(int ac, char** av)
{
....

Someone uses the C++ constructors to set up his signal handlers. Only, if he never touches the "some_result" in his main(), or any other used code, the variable will be dropped and the constructor never called. Speaking of constructors, that is some more magic. How do you order them and what can you do if they form a circle? Using stream io to report an error in the memory allocator will be bad, because memory allocation needs to be initialized before IO, so what now? You have at least to detect this and report it.

What else can a smart linker do, except throwing out the trash (and sometimes your car keys with it when you messed up)?

Just WOW! That detailed explanation was amazing! I know understand who enjoys what they are doing, and I'm happy to see that! Thanks a lot!

Coalesc common code. Template expansions will create equal code for different parameters. A list template which is instanciated with an int, unsigned int, long, unsigned long and pointers to different data structures will produce the same code provided all the data types are 32 bits and only checked for equality. You can throw out all the copies and keep one.

Yeah, but what if you do NOT want to do that? What if you want to inline the code, or you want to have different code in every executable (for whatever reason)? I think that an option should be given.

You can check which function calls which other function and place them next to each other, so they end up in the same cache line if small enough or same memory page when bigger. Computing this call graph and deciding who goes where takes time, but may save you a lot in the runtime later.

Now do the same with data and bss. But keep the entry point nailed down.

Wow!!! That's actually neat, and I haven't thought about that! I will surely implement it in my compiler and give you credits! Thank you!

The black magic comes when you start to patch the instructions to make them PC relative, if possible. PC relative addressing is often limited to smaller displacements, so you pay with a NOP and win one less relocation entry to be processed later. In shared objects, that can save you touching a complete page of code in the runtime memory image, so the efford is not to be sneezed at. You may end up with a complete PIC binary without setting one option in the compiler.

When you say that it's limited to a smaller displacement, you are saying that because you have to use a signed number instead of an unsigned that you would use for an absolute address? And in that case, why do you pay with a NOP where does NOP have to do with that? I don't know what a "relocation entry" is, I'll search it up.

Real black magic comes up when you detect that the source of a call and the destination do not share the same calling convention and you dig into the debug information to auto-magically create the stub code and insert it into the working set. I have not seen this outside of what I did, but it would make so many problems with modern languages and their bindings go away. Guess why I made that.

I mean, like you said, you would need debug information to do that and it would be very complicated. I like what languages like D and C++ do where, they allow you to choose the desired linkage for each symbol. That makes sense and assures that you won't have bugs and other problems. Yes, you have to do things "manually" but at least they are correct. However, the best thing is if we had one language for everything and create my language to try and do that, even tho, I know that 99% it will not happen.
 
When you say that it's limited to a smaller displacement, you are saying that because you have to use a signed number instead of an unsigned that you would use for an absolute address?
That depends on the architecture. On the m68k, the "jsr" can take 32 bits absolute address. You can shorten that to a "bsr d16(pc)", and now you have two bytes in the instruction stream with nonsense. So you place a NOP instruction there. x86 instruction formats are, well, as I read quoted by authority "Difficult to explain and impossible to love".
 
That depends on the architecture. On the m68k, the "jsr" can take 32 bits absolute address. You can shorten that to a "bsr d16(pc)", and now you have two bytes in the instruction stream with nonsense. So you place a NOP instruction there.

Wait. Aren't "jsr" and "bsr" different instructions? So, why do you need to place a NOP?

x86 instruction formats are, well, as I read quoted by authority "Difficult to explain and impossible to love".

Well, I wouldn't say I love them, but so far, I do find them more clear, and I prefer them over RISC. RISC is harder for me to understand, and I cannot find good tutorials and docs for neither RISCV nor ARM. Especially the first one is of much interest to me as it's future-proof and will at some point be the only ISA but like I said, I can't even find good sources that translate OP codes to binary format because my compiler will directly produce machine code and I won't use an assembler, so Assembly isn't the only thing I'm interested in. X86 on the other hand, is straightforward and has many more sources.
 
Let's look at a few examples from my Steam installation (that is just what I have at hand; those are Debian/Ubuntu binaries):
Code:
steam@desktop:/ % ls -lh {/home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu,/usr/local/lib}/libfreetype.so*
lrw-------  1 steam  steam    21B  1 янв.   1970 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libfreetype.so.6 -> libfreetype.so.6.17.4
-rw-r--r--  1 steam  steam   774K 28 апр.   2022 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libfreetype.so.6.17.4
lrwxr-xr-x  1 root   wheel    16B 30 марта 04:29 /usr/local/lib/libfreetype.so -> libfreetype.so.6
lrwxr-xr-x  1 root   wheel    21B 30 марта 04:29 /usr/local/lib/libfreetype.so.6 -> libfreetype.so.6.20.1
-rwxr-xr-x  1 root   wheel   789K 30 марта 04:29 /usr/local/lib/libfreetype.so.6.20.1
steam@desktop:/ % ls -lh {/home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu,/usr/local/lib}/libzstd.so.1*
lrw-------  1 steam  steam    16B  1 янв.   1970 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libzstd.so.1 -> libzstd.so.1.4.8
-rw-r--r--  1 steam  steam   870K  1 марта  2021 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libzstd.so.1.4.8
lrwxr-xr-x  1 root   wheel    16B  9 апр.  04:21 /usr/local/lib/libzstd.so.1 -> libzstd.so.1.5.6
-r-xr-xr-x  1 root   wheel   853K  9 апр.  04:21 /usr/local/lib/libzstd.so.1.5.6
steam@desktop:/ % ls -lh {/home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu,/usr/local/lib}/libGL.so*
lrwxr-xr-x  1 steam  steam    14B  4 янв.   2022 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libGL.so.1 -> libGL.so.1.7.0
-rw-r--r--  1 steam  steam   530K  4 янв.   2022 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/lib/x86_64-linux-gnu/libGL.so.1.7.0
lrwxr-xr-x  1 root   wheel    10B 30 марта 04:15 /usr/local/lib/libGL.so -> libGL.so.1
lrwxr-xr-x  1 root   wheel    14B 30 марта 04:15 /usr/local/lib/libGL.so.1 -> libGL.so.1.7.0
-rwxr-xr-x  1 root   wheel   548K 30 марта 04:15 /usr/local/lib/libGL.so.1.7.0
No noticeable difference whatsoever. Are we done with that circlejerk yet?
 
shkhln You are showing shared objects, totally different bucket of fish. There you can't see what will be needed.

rempas the bsr/jsr do almost the same. The difference is:
"Bsr d16(PC)" is 4 bytes.
"Jsr 32imm" is 6 bytes and requires relocation. This is not x86, mind you.

Without knowing who jumps where in the blob of code, you can not shorten the Code blob. So you need to make that "bsr d16; nop" to be at 6 bytes again.
 
rempas the bsr/jsr do almost the same. The difference is:
"Bsr d16(PC)" is 4 bytes.
"Jsr 32imm" is 6 bytes and requires relocation.

Relocation by whom? The linker?

This is not x86, mind you.

What this supposed to mean? What would x86 do differently? I mean, I know that it has different instructions, but what would it do differently when it comes to addresses and how they'd be treated and how the linker would work?

Without knowing who jumps where in the blob of code, you can not shorten the Code blob. So you need to make that "bsr d16; nop" to be at 6 bytes again.

What do you mean "without knowing who jumps where"? BSR takes a displacement from PC. Hence, we do know WHERE we will jump based on my understanding. But I suppose my understanding is wrong...
 
Relocation by whom? The linker?



What this supposed to mean? What would x86 do differently? I mean, I know that it has different instructions, but what would it do differently when it comes to addresses and how they'd be treated and how the linker would work?



What do you mean "without knowing who jumps where"? BSR takes a displacement from PC. Hence, we do know WHERE we will jump based on my understanding. But I suppose my understanding is wrong...
We don't know the surrounding.
Imagine:
Code:
CMP d0,a0 
beq L1 
...
Jsr somesthing 
...
L1:
You need to keep the length of the just part or fix up all execution passes over it. Which you most likely can't. But the compiler May start out using the smaller instruction and the linker will insert jump pads when needed.
 
We don't know the surrounding.
Imagine:
Code:
CMP d0,a0
beq L1
...
Jsr somesthing
...
L1:
You need to keep the length of the just part or fix up all execution passes over it. Which you most likely can't. But the compiler May start out using the smaller instruction and the linker will insert jump pads when needed.

I think I get it! Thanks a lot!
 
shkhln You are showing shared objects, totally different bucket of fish. There you can't see what will be needed.
If you insist:
Code:
steam@desktop:~ % ls -lh {/home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin,/usr/local/bin}/{aplay,bash,curl,glxgears,gzip,vkcube*,vulkaninfo,zenity}
-rwxr-xr-x  1 steam  steam    65K  2 июня  16:18 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/aplay
-rwxr-xr-x  1 steam  steam   1,2M 27 марта  2022 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/bash
-rwxr-xr-x  1 steam  steam   222K  3 янв.  23:33 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/curl
-rwxr-xr-x  1 steam  steam    24K  2 июня  16:18 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/glxgears
-rwxr-xr-x  1 steam  steam    96K 10 апр.   2022 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/gzip
-rwxr-xr-x  1 steam  steam   259K 15 февр.  2023 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/vkcube
-rwxr-xr-x  1 steam  steam   263K 15 февр.  2023 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/vkcube-wayland
-rwxr-xr-x  1 steam  steam   292K 15 февр.  2023 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/vkcubepp
-rwxr-xr-x  1 steam  steam   615K 15 февр.  2023 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/vulkaninfo
-rwxr-xr-x  1 steam  steam   112K 21 июня   2021 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/zenity
-rwxr-xr-x  1 root   wheel    75K  9 апр.  16:07 /usr/local/bin/aplay
-rwxr-xr-x  1 root   wheel   957K  9 апр.  04:21 /usr/local/bin/bash
-rwxr-xr-x  1 root   wheel   245K  9 апр.  04:23 /usr/local/bin/curl
-r-xr-xr-x  1 root   wheel    19K 17 июля   2021 /usr/local/bin/glxgears
-r-xr-xr-x  1 root   wheel   130K  9 апр.  19:28 /usr/local/bin/gzip
-r-xr-xr-x  1 root   wheel   302K 18 апр.  04:16 /usr/local/bin/vkcube-display
-r-xr-xr-x  1 root   wheel   310K 18 апр.  04:16 /usr/local/bin/vkcube-wayland
-r-xr-xr-x  1 root   wheel   305K 18 апр.  04:16 /usr/local/bin/vkcube-xcb
-r-xr-xr-x  1 root   wheel   304K 18 апр.  04:16 /usr/local/bin/vkcube-xlib
-r-xr-xr-x  1 root   wheel   366K 18 апр.  04:16 /usr/local/bin/vkcubepp-display
-r-xr-xr-x  1 root   wheel   369K 18 апр.  04:16 /usr/local/bin/vkcubepp-wayland
-r-xr-xr-x  1 root   wheel   364K 18 апр.  04:16 /usr/local/bin/vkcubepp-xcb
-r-xr-xr-x  1 root   wheel   364K 18 апр.  04:16 /usr/local/bin/vkcubepp-xlib
-rwxr-xr-x  1 root   wheel   941K 18 апр.  04:16 /usr/local/bin/vulkaninfo
-rwxr-xr-x  1 root   wheel    93K 13 апр.  06:48 /usr/local/bin/zenity
steam@desktop:~ % ls -lh {/home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin,/usr/bin}/openssl
-rwxr-xr-x  1 steam  steam   720K 13 сент.  2023 /home/steam/.steam/tmp/SteamLinuxRuntime_sniper/usr/bin/openssl
-r-xr-xr-x  1 root   wheel   724K  7 марта 22:18 /usr/bin/openssl
There is naturally more variability with apps because there are more features that can be toggled at the compilation time, but there is still no reason to suspect some kind of whole program optimization that is applied only on FreeBSD. That would be just nonsensical.
 
Anyway, rempas, if you had installed FreeBSD with the default filesystem settings (ZFS) and insist on using du, you are measuring the lz4 compression performance.
 
These are all trivial differences, and you are comparing very differing things (compilers, storage type, file systems and their compression). However, out of boredom I did a few different runs to demonstrate.

FreeBSD fbsd14 14.1-STABLE FreeBSD 14.1-STABLE stable/14-5c20fc180d GENERIC amd64
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: x86_64-unknown-freebsd14.1
Thread model: posix
InstalledDir: /usr/bin

#include <stdio.h>

int main() {
printf("Hello World.\n");
}

[root@fbsd14 /tmp]# cc hello.c -o hello
[root@fbsd14 /tmp]# file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 14.1 (1401501), FreeBSD-style, with debug_info, not stripped
[root@fbsd14 /tmp]# ls -l hello
-rwxr-xr-x 1 root wheel 9768 Jun 18 20:42 hello
[root@fbsd14 /tmp]# du -h hello
8.5K hello
[root@fbsd14 /tmp]# size hello
text data bss dec hex filename
1135 424 1888 3447 0xd77 hello

You want it super small? Sure.

[root@fbsd14 /tmp]# cc -Os hello.c -o hello && strip hello
[root@fbsd14 /tmp]# file hello
hello: ELF 64-bit LSB executable, x86-64, version 1 (FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1, for FreeBSD 14.1 (1401501), FreeBSD-style, stripped
[root@fbsd14 /tmp]# ls -l hello
-rwxr-xr-x 1 root wheel 4472 Jun 18 20:43 hello
[root@fbsd14 /tmp]# du -h hello
4.5K hello
[root@fbsd14 /tmp]# size hello
text data bss dec hex filename
1116 424 1904 3444 0xd74 hello

[root@fbsd14 /tmp]# zfs get all zroot|grep compres
zroot compressratio 1.87x -
zroot compression lz4 local
zroot refcompressratio 1.00x -

And just for the heck of it, here is the same file on NTFS.

Size: 4.36 KB (4,472 bytes)
Size on disk: 8.00 KB (8,192 bytes)
 
These are all trivial differences, and you are comparing very differing things (compilers, storage type, file systems and their compression). However, out of boredom I did a few different runs to demonstrate.








You want it super small? Sure.





And just for the heck of it, here is the same file on NTFS.

Oh, so If I understand correctly? Files on FreeBSD are shown as "full" but they are compressed from ZFS under the hood?
 
Back
Top