src.conf WITH_BIND_NOW


Staff member
             Build all binaries with the DF_BIND_NOW flag set to indicate that
             the run-time loader should perform all relocation processing at
             process startup rather than on demand.
See src.conf(5)
I doubt it's used much and this question is very much a computer science one; it involves the O-notation. Are you familiar with that?

Do you understand symbol relocation at runtime?

Anyway, my clumsy attempt is:

At runtime the linker has to establish symbol addresses (to not just shared libraries and their functions, pointers etc. called directly but other libraries called by shared libraries called by shared libraries etc etc...). It does this using a hash algorithym that is steadily hampered by the number of symbols and their length (think c++ long names), but using this flag makes it attempt it all at the start before the program is even running. It gets even more complicated when you have one shared object that references an interface and another that also references an interface and defines it. See LD_PRELOAD.

Why? Well, the alternative is so-called lazy relocations where when the program encounters a new symbol it resolves it on-the-fly, and load-up times are potentially 0(R*n) better. (Don't quote me on that O-notation, please ;))

So, why do this? Primarily to stop one big thing: Programs that have LONG run times, like web servers, can encounter a shared library hours or days after invocation, and what happens if the symbol is missing? Crash.
Better to crash before hand at runtime? Perhaps, it's a value judgement.

I guess it could also help prevent access to the Global offset table, because I believe once DF_BIND_NOW is placed in the ELF header of the shared library, the GOT becomes read-only? Stops nasty little hackers.

And so you can see why I never became a teacher. :D
I pose myself a very simple question. Will enabling the option make your O.S. faster or slower and why ?
It depends. On the workload.

Disclaimer: Not at all an expert on loading and binding, and shareable libraries is something I use, not something I understand.

Assuming mark_j's explanation is correct (and I have every reason to believe it), this is an example of the classic computer engineering cache tradeoff: Do work now (efficiently and al at once, but there is lots to do), versus do work only when needed, perhaps don't even have to do all the work, but each piece is less efficient. If you set this option, then at the time the program is loaded, there is a lot of work to do: All functions that call other functions (across shareable library borders) need to be cross-linked, which requires creating the complete table of all functions that exist, and tracing down all possible call paths (which the linker has gracefully marked in the shareable object/executable file). For a large executable (the apache server is a good example), this can take a long time. What does "long" mean? I have no idea, and that depends on CPU power and memory speed. My bet is on the order of milliseconds on modern machines. On the other hand, the code that does this table building / updating / linking doesn't get interrupted by any other work, the tables and the piece of code that works with them stay in cache, and all this gets done very efficiently.

With the flag off, these function calling tables only get built/updated when functions actually are called. If the executable, during its run, actually were to exercise all possible call graphs, this would be strictly less efficient (from the total runtime point of view) than doing the work up front. On the other hand, large executables (examples include apache, bind, sendmail) typically are highly configurable, and in any given installation, only a fraction of their functionality is used. For example, my mailer never routes to bitnet or uucp, my apache never serves TLS 1.0, and my bind never does domain transfers. So perhaps not all the work needs to be done, which now means that the delay becomes more efficient. Which case applies to you? Depends on your executables, their run time, and what matters.

To make things more complicated: One can not in all cases do all the load-linking at startup time, because executable can dynamically load shared libraries. Do they do that in practice, at your installation? I have no idea.

Another thing to consider is the real-time aspect. If you turn this flag off, the executable starts running sooner. If you really care that the greeting line "Hallo, program V123.45 starting now ..." shows up really fast, this matters to you. Or if you are running a giant executable (like Apache) just to show the help line to remind you what command-line parameters it takes, and only a small part of the code is actually used. On the other hand, if you are interested in the program having very predictable latency once it is running (perhaps it's doing real-time data acquisition, or it is part of a cluster and any delay will cause hundreds of other computers to have to wait), you're better off paying the latency price once at the beginning, and then be done with it.

The answer to all performance questions is: You'll have to benchmark that, for your workload, yourself. If the answer was nearly always obvious, you could be sure the default setting would be correct, and the switch only existed for unusual situations.
Will it run faster? My bet would be a no. Why? Imagine something as simple as:

find / -type f -exec grep "widget" {} \; -print

Every time these programs are called, the entire symbol table will be built, in theory, even for those symbols not directly invoked, like most of grep's functionality. That's wasted milliseconds for each iteration. Does grep use shared libraries? I'm sure it does.

So, with respect, I think your question is the wrong question to ask, because I am not sure speed is the end result of any of this.

For example, each architecture will (most likely?) have its own way of processing the PLT (Procedure link table) into the GOT (Global offset table), thereby meaning one architecture might take X instructions, and another X+1 or even X+n.
Now, this area of expertise (CPU architecture) is not mine, but a quick look at assembly on AMD64:

Write a program that access an extern int:

movq    <<variable>>@GOTPCREL(%rip), %rax
movl    (%rax), %eax
popq    %rbp
.cfi_def_cfa %rsp, 8

Where <<variable>> is an external variable in the program. So, that's a few instructions.

Now, look at a call to an external (by inference, shared) object/function:

callq   <<external_call>>@PLT
addl    $1, %eax
popq    %rbp
.cfi_def_cfa %rsp, 8

Where <<external_call>> is an external function.

So, that's a few. Yes, a CPU runs damn fast, but it adds up if you're doing it all up front compared to only when needed.

It also depends on the number of external variables being referenced AND the number of shared functions. I mean, how do you produce a metric for this, short of going through the entire kernel (static linked) and userland (dynamic except /rescue programs)?

Personally, I can see some good enabling this if you're running tests on FreeBSD, but otherwise, I just don't see how it could speed up the OS (and frankly, if it did wouldn't you think it would be enabled by default?)

But as ralphbsz says, the only definitive way to determine speed is to enable it and test.

I'm sorry I can't give you a black-and-white YES or NO. I can give you a "I don't know, but my gut says no". :)

P.S. Edit:
I just did a check on an ArmV7 external variable call:

adrp    x0, :got:<<external>>
ldr     x0, [x0, #:got_lo12:<<external>>]
ldr     w0, [x0]

ldr is a quite costly instruction, much more than mov. So, again, architecture matters.