dhclient gives up with 'No working leases in persistent database' after power outage and slow isp network restart

ultramage · Saturday at 9:21 AM

After a power outage, my freebsd router is up within 30 seconds, but the ISP's modem takes a good 5 minutes before it turns on its LAN ports, and 5 more minutes for the ISP's dhcp infrastructure and internet routing to start working again. Previously (12.0), freebsd handled this outage gracefully, but subsequent code changes to dhclient and startup scripts broke it in multiple places.

I have already patched out a hardcoded 10-second timeout for link wait (previously fbsd either booted slow enough, or its interfaces started in 'active' state?). There is also the dhcprequest timeout, which if set too low, will make dhclient give up and reuse a valid saved lease. I've just discovered that previously this worked, and now it doesn't.

12.0

14.0

Code:

Starting dhclient.
re1: no link .....
......
got link
DHCPREQUEST on re1 to 255.255.255.255 port 67
DHCPDISCOVER on re1 to 255.255.255.255 port 67 interval 5
DHCPDISCOVER on re1 to 255.255.255.255 port 67 interval 6
No DHCPOFFERS received.
Trying recorded lease ***.***.***.***
bound: renewal in 38469 seconds.

Code:

Starting dhclient.
re1: no link .....
.. got link
DHCPREQUEST on re1 to 255.255.255.255 port 67
DHCPREQUEST on re1 to 255.255.255.255 port 67
DHCPDISCOVER on re1 to 255.255.255.255 port 67 interval 5
DHCPDISCOVER on re1 to 255.255.255.255 port 67 interval 6
No DHCPOFFERS received.
Trying recorded lease ***.***.***.***
No working leases in persistent database - sleeping.

I wish to know why this is happening. have checked the sbin/dhclient source history but did not immediately see a relevant change. The function 'state_panic()' hasn't been touched in 19 years, however it is complex and relies on global state, so one of the recent changes may have affected how it executes. It also doesn't do things directly but sets up a 'script' and executes it, meaning there's more complexity underneath and it's impossible to follow via cursory code review.

Cath O'Deray · Saturday at 2:08 PM

Which version of FreeBSD, exactly?

Packages of ports from quarterly, or latest?

freebsd--version -kru ; uname -aKU

pkg -vv | grep -B 1 -e url -e priority

ultramage · Saturday at 4:57 PM

FreeBSD srv 14.0-RELEASE-p6 FreeBSD 14.0-RELEASE-p6 #0 releng/14.0-n265417-d338712beb1-dirty: Wed Apr 17 19:53:37 CEST 2024

ports from github latest tree, last update apr 19; irrelevant since issue seems to be with base OS functionality (/sbin/dhclient, possibly rc startup scripts).
I have a single src edit that now makes dhclient wait for link longer than 10s. This is the reason why I don't have a log sample from 13.0 or 13.1, because the issue went unaddressed for 2 years and resulted in "no link... giving up" after every outage (causing a bunch of interface-dependent services to fail to start properly, and requiring an additional reboot to fix).

Cath O'Deray · Saturday at 6:26 PM

I sped through release notes for 12.1–14.0 inclusive, no relevant mention of DHCP or dhclient.

ultramage said:
…
12.0 …

re1: …

Have you tried net/realtek-re-kmod?

The BUGS section of re(4) concludes:

… If this driver is causing problems then an updated driver from the vendor can be found in ports under net/realtek-re-kmod.

a9a35c9e35370d09be5cbcd32f287b12f40812a5 was on release/14.0.0 in the src tree (not on any earlier releng branch).

ultramage · Saturday at 8:42 PM

I have never had any issues with the re module.
I always read updating notes and also did not see anything relevant.
I have checked and /var/db/dhclient.leases.re1 contains 2 valid leases (one shortly before the other).
I suspect that this is an old bug that went unnoticed because it might require specific network conditions that do not happen during normal operation:
- saved leases but no active lease (system reboot)
- network link up but dhcp server unreachable (or off-host firewalled)

I have installed a test vm and tried to reproduce. Reproduction was successful using the abovementioned setup, both in synchronous and background mode dhclient. The test system was a stock 14.0 vhd release vm. I first let it connect, then firewalled the host off on the hypervisor level (local pf is not enough) and rebooted.

I have also reproduced it with the 12.4, 12.2 and 12.1 vms.
The 12.0 vm does not reproduce and binds fine, confirming my original observations from historical logs. Hence the breakage happened between 12.0 and 12.1 release. Either directly in dhclient, or the way it's used in rc.

Cath O'Deray · Sunday at 7:21 AM

ultramage said:
… between 12.0 and 12.1 release. Either directly in dhclient, or the way it's used in rc.

A FreshBSD search for dhclient finds seven commits on releng/12.1. At a glance, I don't see an explanation.

~~Given that 12.1 is long gone, and that the manual page acknowledges bugs, would you be willing to try the ported driver?~~

ultramage · Sunday at 9:38 AM

It is clearly not a driver issue, the reproduction VMs use a virtual hyper-v nic (hn0) on top of brand new server hardware. I have already spent 3 years and double digit hours trying to make this stack of startup issues go away. Thanks to my research for this thread I now know it is not an issue specific to my custom configuration. I have enough spare resources to start a bunch of parallel buildworld instances to locate the exact faulty commit, but I figured I'd ask just in case before I waste 10 more hours on this.

Cath O'Deray · Sunday at 11:14 AM

I'll ping src developers on social media.

ultramage · Sunday at 11:19 AM

Ah, no need to go that far. Since it went unnoticed for 3+ years then bisection is probably the simplest way to locate the cause. I'll get everything set up and then loop through the tedious process of buildworld - merge - test. Once I have the exact commit, I'll open an issue on the bugtracker.

Andriy · Monday at 6:11 AM

I'd start with checking your /var/db/dhclient.leases.*, particularly various times recorded in the last (active) lease.
Note that they are in UTC.

ultramage · Monday at 6:20 AM

last reboot time: 2024/6/24 05:09:42

Code:

option dhcp-renewal-time 86400;
option dhcp-rebinding-time 151200;
renew 2 2024/6/25 05:07:04;
rebind 2 2024/6/25 23:07:04;
expire 3 2024/6/26 05:07:04;

looks normal

bakul · Monday at 6:49 AM

What does your /etc/dhclient.conf look like? One suggestion is to read its manpage carefully and see if you can set it up to deal with this situation.

ultramage · Monday at 8:06 AM

The file is empty, as I already stated this reproduces immediately and out of the box with at least the following:

Code:

FreeBSD-15.0-CURRENT-amd64-ufs-20240620-42c545c5f8a2-270847.vhd
FreeBSD-14.0-RELEASE-amd64.vhd
FreeBSD-12.4-RELEASE-amd64.vhd
FreeBSD-12.1-RELEASE-amd64.vhd
FreeBSD-12.1-RELEASE-amd64-disc1.iso

I have also managed to set up a build&test environment and have so far narrowed it down to a 1-month window:
PASS - Feb 1, 2019 - 5dedda2e27344c56cb3094e4602a565182e2bb53
FAIL - Mar 1, 2019 - 424fb3963f1223b823db03187ee67b2328baad1b

I will post a link to an issue report once I'm done with the search.

ultramage · Monday at 3:45 PM

279951 – dhclient unable to reuse recorded lease after timeout, since 12.1

bugs.freebsd.org

The failure is caused by a poorly thought out code change to dhclient in Feb 2019, resulting in the TIMEOUT callback to dhclient-script being treated as failed even though it succeeded, and the code giving up trying to pick a saved lease.
Ironically, openbsd experienced this same mistake in 2008 and also kept it around for 4 years, unnoticed, until a dev got fed up with the poor code structure and removed all that scripting stuff.

Cath O'Deray · Monday at 6:38 PM

Nice work!

Cath O'Deray said:
A FreshBSD search for dhclient finds seven commits on releng/12.1. At a glance, I don't see an explanation.

It was there, amongst the search results – FreeBSD/src b6c2f6e (r344237) — sbin/dhclient dhclient.c – but I'd never had made the connection.

ultramage · Monday at 7:35 PM

I started looking into this by searching for the status messages in the source code, to see what sort of circumstances lead to them. I could immediately tell that script_go's return value was the cause, but at that point I was missing all context as to what 'script' even meant. Couldn't even determine the execution flow because I didn't realize buf_close actually meant buf_send and that that thing was doing message-based ipc via pipe to a worker fork. Elaborate functionality, but very rough 1995 coding. That openbsd dev that removed it all probably saw the complexity as a huge security risk, and/or figured that core functionality shouldn't hinge on an editable script. I don't know how much use this functionality got in practice.

Cath O'Deray · Tuesday at 5:20 AM

Incidentally, 'Import dhcpcd into base' is wanted in the planning document for FreeBSD 15.0.

Andriy · Tuesday at 11:14 AM

I believe that the change that you called "poorly thought out" is actually correct and honors the documentation:

TIMEOUT The DHCP client has been unable to contact any DHCP servers.
However, an old lease has been identified, and its parameters
have been passed in as with BOUND. The client configuration
script should test these parameters and, if it has reason to
believe they are valid, should exit with a value of zero. If
not, it should exit with a nonzero value.

So, it's dhclient-script and its TIMEOUT handling that "thinks" that the saved lease is not valid (as you correctly deduced yourself).
The change just made dhclient correctly see that exit status as it was always meant to.

I suppose that the real problem comes from the ping test with a single ping.
If that check is not suitable for your environment (understandably) then you can modify the script or override it with an exit hook.

ultramage · Tuesday at 11:53 AM

The poorly thought out change was changing the return type / meaning of the return value of script_go() from "process completion state, 0=sucessfully exited normally" to "return/exit value of invoked script (OR shifted signal code)" without amending the if() statements that call it.

However, I have not spent time studying dhclient-script as it was a late-stage discovery. As you correctly point out, dhclient-script appears to be designed and documented for all of its invokable commands to return 0 on success. Hence the change is sound, in the sense that the return values of dhclient-script are meant to align with the return value checks in dhclient.c.

Close review of current dhclient-script reveals that all other commands just fall through to 0 (no error checking / no failure modes), so the only part that matters is "TIMEOUT". That script code was utterly incomprehensible to me and I have no idea why certain places return 0. So I just assumed that the script is written correctly and that since its execution eventually flows towards the end part, the intermittent exit points are failure cases. This seems to be incorrect.

The real problem is not with 'ping' - although I don't understand why this is used, and I don't know who "$1" is that is being pinged. Note that the entire block is conditional to if [ -n "$new_routers" ]; then. When TIMEOUT is invoked from state_panic(), this parameter is not set, meaning this part doesn't run, only the final portion runs - calling ifconfig and exiting with '1'. Which is not the right value.

ultramage · Tuesday at 1:22 PM

Code:

               Any options acquired from the server are
               passed using the option name described in dhcp-options(5),
               except that dashes (‘-’) are replaced by underscores (‘_’) in
               order to make valid shell variables, and the variable names
               start with “new_”.  So for example, the new subnet mask would
               be passed in $new_subnet_mask.

I misunderstood. Seems that script_write_params does var-dump the entire lease into the script state. Therefore, $new_routers should exist.

Code:

TIMEOUT)
...
    eval "$IFCONFIG $interface inet -alias $new_ip_address $medium"
    delete_old_routes
    exit_with_hooks 1
    ;;
esac

exit_with_hooks 0

I also misunderstood this end part. Paying attention, -alias means deleting an alias. So that last part does indeed behave as a failure handler, and just tries to clean up before returning. This code structure is just baffling. What it atually does:

Code:

TIMEOUT)
    add_new_address

    if $new_routers is not set then
        exit 1

    $LOGGER "New Routers ($interface): $new_routers"
    set "$new_routers"  <-- I think this puts them into $1 $2 $3...

    if ping "$1" doesn't get a response then
        exit 1

    if $new_ip_address != $alias_ip_address then
        add_new_alias
    add_new_routes

    if is_default_interface() and !add_new_resolv_conf()
        exit 1

    exit 0

I guess it couldn't be written like this because exit 1 required cleanup. Still could have been doable using a fake scope or goto, but w/e.

So, uh... is the issue simply that dhclient-script requires the saved gateway to be pingable, and that due to a past coding bug this requirement was being unintentionally ignored by dhclient.c until 12.1? I can imagine this requirement being meaningful when bringing a laptop from one network to another. That said, if dhcp is not responding then this sort of check is not sufficient due to frequently used private subnets.

Andriy · 2024-06-26T13:21:39+0100

ultramage said:
The poorly thought out change was changing the return type / meaning of the return value of script_go() from "process completion state, 0=sucessfully exited normally" to "return/exit value of invoked script (OR shifted signal code)" without amending the if() statements that call it.

Those things are equivalent if you think about it.
If a process exits normally then it has zero exit status.

Andriy · 2024-06-26T13:25:26+0100

ultramage said:
So, uh... is the issue simply that dhclient-script requires the saved gateway to be pingable, and that due to a past coding bug this requirement was being unintentionally ignored by dhclient.c until 12.1?

That's my impression as well.
Good thing is that the stuff should be customizable, although customization is not for the faint of heart

given all the required knowledge and shell script complexities.

ultramage · 2024-06-26T17:21:29+0100

Interestingly, the person who originally patched ISC's dhclient in 2003 to add dhclient-script and scriptable hooks, removed all that functionality in 2012.

Andriy · 2024-06-27T06:15:31+0100

Without checking any references, I'd guess that they didn't just remove it but replaced it with something better.

ultramage · 2024-06-27T07:18:45+0100

Looking further, the scripting/hooks system was part of isc-dhcp-client (freebsd script added in 1997), and was just ported/reformatted, not authored. Previously I was relying on the file's copyright header too much.
Can't tell at a glance if an alternative way of customizing the client's logic was added, or if the thing was simply deemed unnecessary. ISC discontinued development in 2022 and they're switching to dhcpleased, which so far looks like it only has basic configurability.