DHCP and PPP not working after upgrade from 13.0-RELEASE-p11 to 13.1-RELEASE

I've got a setup with a pair of routers running FreeBSD 13. Tonight I attempted an upgrade from 13.0-RELEASE-p11 to 13.1-RELEASE. However upon doing so, I was suddenly unable to do any DHCP or PPP(oE) out my WAN interfaces on either router.

It looks like the same underlying problem as Thread install-13-1-rc-3-not-getting-ip-from-dhcp-with-atheros-ar8161-and-realtek-wifi.84927, but the OP there doesn't appear to have done anything specific to fix it, it just started working. But in my case it fails consistently across both machines.

In my case, the 3 interfaces are vLANs on an underlying LACP LAGG of two Intel 82576 interfaces. Two of them attempt DHCP, and the third attempts PPPoE via mpd5.

Relevant parts of my rc.conf are:

Bash:
cloned_interfaces="lagg0 vlan11 vlan12 vlan13 [several more too...]"
ifconfig_igb0="-vlanhwtag mtu 9000 up"
ifconfig_igb1="-vlanhwtag mtu 9000 up"
ifconfig_igb2="-vlanhwtag mtu 9000 up"
ifconfig_igb3="-vlanhwtag mtu 9000 up"
ifconfig_lagg0="mtu 9000 laggproto lacp lagghash l2 laggport igb0 laggport igb1"

# Virtual Interfaces (WAN)
# WAN1 DOCSIS
create_args_vlan11="vlan 11 vlandev lagg0"
ifconfig_vlan11="mtu 1500 dhcp"
ifconfig_vlan11_alias0="link 54:e1:ad:15:ba:61"
# WAN2 FWA
create_args_vlan12="vlan 12 vlandev lagg0"
ifconfig_vlan12="mtu 1500 dhcp"
ifconfig_vlan12_alias0="link 54:e1:ad:15:ba:62"
# WAN3 DSL
create_args_vlan13="vlan 13 vlandev lagg0"
ifconfig_vlan13="mtu 1500 up"
mpd_enable="YES"
mpd_flags="-b WAN3"

I handle the actual DHCP up/down via a CARP script, but regardless of which host is MASTER, the result was the same:

On 13.0, the DHCP requests complete immediately and the PPPoE connection completes successfully.

On 13.1, the DHCP requests do DHCPDISCOVER indefinitely and time out when running dhclient either automatically (part of my CARP scripts) or manually, and the PPPoE connection fails with repeated timeouts and no specific error, e.g.

Code:
Jun  4 22:42:17 dcr2 mpd[34258]: [vlan13_link0] PPPoE: Connecting to 'WAN3'
Jun  4 22:42:26 dcr2 mpd[34258]: [vlan13_link0] PPPoE connection timeout after 9 seconds
Jun  4 22:42:26 dcr2 mpd[34258]: [vlan13_link0] Link: DOWN event
Jun  4 22:42:26 dcr2 mpd[34258]: [vlan13_link0] LCP: Down event

Nothing else changed on my network, just the system upgrade from 13.0 to 13.1.

The only thing in the release notes I could see related to networking was the change of "net.inet.ip.broadcast_lowest", but adjusting this had no effect on either type of connection.

I did attempt some packet captures at my modem side, and while I do see the DHCPDISCOVER packets, I get zero response from any of my upstream providers (two completely separate ones) like I do normally, which potentially points to a malformed packet of some kind, but I couldn't see any errors in them. And this doesn't explain why both dhclient *and* mpd5 are failing in a similar way. It seems like it could be an issue with the network drivers at some level (perhaps vLANs?), but...

All my other vLANs (12 of them) worked flawlessly, including DHCP requests *in* to the router from client devices. It was only these 3 outbound interfaces that seemed to have problems, which makes it even more confusing.

I'm not really sure where else to look or what else would be useful to troubleshoot further; I'm a relative FreeBSD newbie aside from these routers, but very well-versed in Linux so hit me with the advanced commands. Does anyone have any advice, either for how to find more information about what's going on or a potential cause?
 
Bumping this - due to another incompatibility (with python 3.9), my 13.0 systems are now unusable (i.e., I can't manage them with Ansible any more). This effectively forces me to upgrade to 13.1 to continue managing them. But this problem still remains.

I did a bit more testing, and can continue to replicate the problem even on a physical interface, and with 2 different kinds of physical interfaces. Even with a direct connect between the modem and one of these interfaces - no switch, no LAGG, no vLANs, no pf - it's still failing.

Does anyone have any advice?
 
And in the most bizarre turn of events, manually setting my MAC address to that of the other (working) host, suddenly fixes it. At least for one DHCP connection. I'm at a complete loss to explain what is going on here, but at least it seems like I can get one of my connections going.
 
The weirdness continues: I figured out what I did to make it work. At some point I ran `sudo /etc/rc.d/netif restart <iface>` on each interface, and it seems like as soon as I do that, it suddenly starts working.

That is, when the system first boots, all the interfaces are broken. Then, once I run `netif restart` on each one, they suddenly start working. Note that running a *general* `netif restart` doesn't help, it seems like I have to do it on each vLAN individually. When I do so, suddenly DHCP completes and PPPoE can come up. The visible config of each interface doesn't change at all though, for example:

Code:
joshua@dcr2.m.bonilan.net ~ $ sudo /etc/rc.d/netif restart vlan11                     
Stopping dhclient.                                                                                                     
Waiting for PIDS: 36808.                                                                                               
Stopping Network: vlan11.                                                                                              
vlan11: flags=8842<BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500                                              
        options=4000700<TSO4,TSO6,LRO,NOMAP>                                                                           
        ether 54:e1:ad:15:ba:61                                                                                        
        groups: vlan                                                                                                   
        vlan: 11 vlanproto: 802.1q vlanpcp: 0 parent interface: lagg0       
        media: Ethernet autoselect                                                                                     
        status: active                                                                                                 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>                               
Destroyed clone interfaces: vlan11.                                                                                    
Created clone interfaces: vlan11.                                                                                      
Starting Network: vlan11.                                                                                              
vlan11: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4000700<TSO4,TSO6,LRO,NOMAP>                                                                           
        ether 54:e1:ad:15:ba:61                                                                                        
        groups: vlan                                                                                                   
        vlan: 11 vlanproto: 802.1q vlanpcp: 0 parent interface: lagg0                          
        media: Ethernet autoselect                                                                                     
        status: active                                                                                                 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>

So clearly *something* is wrong with my `rc.conf` that changed between 13.0 and 13.1, but I don't see what it is.

At least I now have a workaround, but the idea of having to manually trigger these interface restarts on boot seems cumbersome, so I'd definitely like to get to the bottom of what might be going on.
 
i have rc.conf:
Code:
dhclient_enable="NO"

& /etc/ppp/ppp.linkup :
Code:
PROXIMUS:
    add! default HISADDR
    add! default HISADDR6
    !bg /usr/sbin/service ipfw          onestop
    !bg /usr/sbin/service ntpd          onestop
    !bg /usr/sbin/service local_unbound onestop
    !bg /usr/sbin/service rtsold        onestop
    !bg /usr/sbin/service rtsold        onestart
    !bg /usr/local/bin/gsleep 0.5
    !bg /usr/sbin/service local_unbound onestart
    !bg /usr/sbin/ntpdate 0.europe.pool.ntp.org
    !bg /usr/sbin/ntpdate 1.europe.pool.ntp.org
    !bg /usr/sbin/service ntpd          onestart
    !bg /usr/sbin/service ipfw          onestart

Code:
netstat -rn
might give some interesting information.
 
I don't explicitly set `dhclient_enable` in my /etc/rc.conf - I'm actually doing dhclient manually in scripts on a CARP failover (thanks, ISPs that refuse to give out more than one public IP...)

I also don't have an /etc/ppp/ppp.linkup, but I'm using mpd5 which has its own config. But it never even gets to that stage, it just times out.

And `netstat` doesn't really show anything while it's failing since I have no upstream routes.

I have two comments that were still hidden when you wrote that Alain, which mention how I "fixed" it, so it definitely seems like rc.conf wonkiness. So, for completeness, here's the *whole* thing, raw and unedited. Maybe my flaw that forces triggering the netif restart after this is all set is obvious to someone with more experience :)

Code:
#######################                                                                                               
#                     #                                                                                               
#  BLSE dcrX rc.conf  #                                                                                               
#                     #                                                                                               
#######################                                                                                                                                                                                                                       
# Ansible managed - last modified on 2022-10-30 03:23:12                                                              
hostname="dcr2.m.bonilan.net"                                                                                         
clear_tmp_enable="YES"                                                                                                
local_unbound_enable="YES"                                                                                            
sshd_enable="YES"                    
sshguard_enable="YES"                 
ntpd_enable="YES"                                                                                                     
powerd_enable="YES"                                                                                                   
dumpdev="AUTO"                 
zfs_enable="YES"                 
devd_enable="YES"                                                                                                     
inetd_enable="YES"                                                                                                    
ifstated_enable="YES"                                                                                                 
openbgpd_enable="YES"                                                                                                 
                                                          
# Wireguard site-to-site (managed via devd)
wireguard_enable="NO"                           
wireguard_interfaces="vpn1 vpn2 vpn3 "          
                                                          
# DHCPD                                         
dhcpd_enable="YES"                              
dhcpd_conf="/usr/local/etc/dhcpd.conf"          
dhcpd_ifaces="em0 vlan41 vlan42 vlan44 vlan45 vlan101 vlan102 vlan103 vlan114 "
dhcpd_chuser_enable="YES"                                 
dhcpd_withuser="dhcpd"                          
dhcpd_withgroup="dhcpd"                         
                                                          
# softflowd                                     
softflowd_enable="YES"                          
softflowd_interfaces="vlan11 vlan12 pppoe13 tun1 tun2 tun3"
softflowd_vlan11_collector="10.22.0.11:9590"              
softflowd_vlan12_collector="10.22.0.11:9590"
softflowd_pppoe13_collector="10.22.0.11:9590"
softflowd_tun1_collector="10.22.0.11:9590"
softflowd_tun2_collector="10.22.0.11:9590"
softflowd_tun3_collector="10.22.0.11:9590"
softflowd_vlan11_extra_args="-v9"
softflowd_vlan12_extra_args="-v9"
softflowd_pppoe13_extra_args="-v9"
softflowd_tun1_extra_args="-v9"
softflowd_tun2_extra_args="-v9"
softflowd_tun3_extra_args="-v9"

# PF
gateway_enable="YES"
pf_enable="YES"
pf_rules="/etc/pf.conf"
pflog_enable="YES"
pflog_logfile="/var/log/pflog"
pfsync_enable="YES"
pfsync_syncdev="em1"

# Cloned interfaces
cloned_interfaces="lagg0 vlan11 vlan12 vlan13 vlan41 vlan42 vlan43 vlan44 vlan45 vlan100 vlan101 vlan102 vlan103 vlan104 vlan105 vlan111 vlan112 vlan113 vlan114 vlan115"

# LACP
ifconfig_igb0="-vlanhwtag mtu 9000 up"
ifconfig_igb1="-vlanhwtag mtu 9000 up"
ifconfig_igb2="-vlanhwtag mtu 9000 up"
ifconfig_igb3="-vlanhwtag mtu 9000 up"
ifconfig_lagg0="mtu 9000 laggproto lacp lagghash l2 laggport igb0 laggport igb1"

# Native Interfaces (MGMT, CARP)
ifconfig_em0="inet 10.22.0.3/24"
ifconfig_em0_alias0="inet vhid 22 advskew 100 pass blsecarp alias 10.22.0.1/24"
ifconfig_em1="inet 10.21.0.2/30"

# Virtual Interfaces (WAN)
# Start.ca DOCSIS
create_args_vlan11="vlan 11 vlandev lagg0"
ifconfig_vlan11="mtu 1500 dhcp"
ifconfig_vlan11_alias0="link 54:e1:ad:15:ba:61"

# Rogers FWA
create_args_vlan12="vlan 12 vlandev lagg0"
ifconfig_vlan12="mtu 1500 dhcp"
ifconfig_vlan12_alias0="link 54:e1:ad:15:ba:62"

# TekSavvy DSL
create_args_vlan13="vlan 13 vlandev lagg0"
ifconfig_vlan13="mtu 1500 up"
ifconfig_vlan13_alias0="link 00:1b:21:6c:6c:f3"
mpd_enable="YES"
mpd_flags="-b WAN3"

# Virtual Interfaces (CLIENT)
ifconfig_vlan41="inet 10.41.0.3/24 vlan 41 vlandev lagg0"
ifconfig_vlan41_alias0="inet vhid 41 advskew 100 pass blsecarp alias 10.41.0.1/24"
ifconfig_vlan42="inet 10.42.0.3/24 vlan 42 vlandev lagg0"
ifconfig_vlan42_alias0="inet vhid 42 advskew 100 pass blsecarp alias 10.42.0.1/24"
ifconfig_vlan43="inet 10.43.0.3/24 vlan 43 vlandev lagg0"
ifconfig_vlan43_alias0="inet vhid 43 advskew 100 pass blsecarp alias 10.43.0.1/24"
ifconfig_vlan44="inet 10.44.0.3/24 vlan 44 vlandev lagg0"
ifconfig_vlan44_alias0="inet vhid 44 advskew 100 pass blsecarp alias 10.44.0.1/24"
ifconfig_vlan45="inet 10.45.0.3/24 vlan 45 vlandev lagg0"
ifconfig_vlan45_alias0="inet vhid 45 advskew 100 pass blsecarp alias 10.45.0.1/24"

# Virtual Interfaces (PVC_CLUSTER)
ifconfig_vlan100="inet 10.100.0.3/24 vlan 100 vlandev lagg0"
ifconfig_vlan100_alias0="inet vhid 100 advskew 100 pass blsecarp alias 10.100.0.1/24"

# Virtual Interfaces (PROD_SERVER)
ifconfig_vlan101="inet 10.101.0.3/24 vlan 101 vlandev lagg0"
ifconfig_vlan101_alias0="inet vhid 101 advskew 100 pass blsecarp alias 10.101.0.1/24"
ifconfig_vlan102="inet 10.102.0.3/24 vlan 102 vlandev lagg0"
ifconfig_vlan102_alias0="inet vhid 102 advskew 100 pass blsecarp alias 10.102.0.1/24"
ifconfig_vlan103="inet 10.103.0.3/24 vlan 103 vlandev lagg0"
ifconfig_vlan103_alias0="inet vhid 103 advskew 100 pass blsecarp alias 10.103.0.1/24"
ifconfig_vlan104="inet 10.104.0.3/24 vlan 104 vlandev lagg0"
ifconfig_vlan104_alias0="inet vhid 104 advskew 100 pass blsecarp alias 10.104.0.1/24"
ifconfig_vlan105="inet 10.105.0.3/24 vlan 105 vlandev lagg0"
ifconfig_vlan105_alias0="inet vhid 105 advskew 100 pass blsecarp alias 10.105.0.1/24"

# Virtual Interfaces (TEST_PVC_CLUSTER)
ifconfig_vlan111="inet 10.111.0.3/24 vlan 111 vlandev lagg0"
ifconfig_vlan111_alias0="inet vhid 111 advskew 100 pass blsecarp alias 10.111.0.1/24"

# Virtual Interfaces (TEST_SERVER)
ifconfig_vlan114="inet 10.114.0.3/24 vlan 114 vlandev lagg0"
ifconfig_vlan114_alias0="inet vhid 114 advskew 100 pass blsecarp alias 10.114.0.1/24"
ifconfig_vlan115="inet 10.115.0.3/24 vlan 115 vlandev lagg0"
ifconfig_vlan115_alias0="inet vhid 115 advskew 100 pass blsecarp alias 10.115.0.1/24"

# Loopback for public IPs
ifconfig_lo0_alias01="inet alias 198.55.48.48/28"
ifconfig_lo0_alias02="inet alias 198.55.48.49/28"
ifconfig_lo0_alias03="inet alias 198.55.48.50/28"
ifconfig_lo0_alias04="inet alias 198.55.48.51/28"
ifconfig_lo0_alias05="inet alias 198.55.48.52/28"
ifconfig_lo0_alias06="inet alias 198.55.48.53/28"
ifconfig_lo0_alias07="inet alias 198.55.48.54/28"
ifconfig_lo0_alias08="inet alias 198.55.48.55/28"
ifconfig_lo0_alias09="inet alias 198.55.48.56/28"
ifconfig_lo0_alias10="inet alias 198.55.48.57/28"
ifconfig_lo0_alias11="inet alias 198.55.48.58/28"
ifconfig_lo0_alias12="inet alias 198.55.48.59/28"
ifconfig_lo0_alias13="inet alias 198.55.48.60/28"
ifconfig_lo0_alias14="inet alias 198.55.48.61/28"
ifconfig_lo0_alias15="inet alias 198.55.48.62/28"

Then to fix it right now, I have this guy as /etc/rc.local which does solve the problems I have after the system reboots - i.e. with this, my DHCPs and PPPoEs complete flawlessly, but without it, they time out as mentioned in the original post, but only on 13.1-RELEASE not 13.0-RELEASE.

Code:
#!/bin/sh
# Ansible managed - last modified on 2022-10-30 03:39:58

# Work around bug with vLANs not starting properly
# https://forums.freebsd.org/threads/dhcp-and-ppp-not-working-after-upgrade-from-13-0-release-p11-to-13-1-release.85403/
/etc/rc.d/netif restart vlan11
/etc/rc.d/netif restart vlan12
/etc/rc.d/netif restart vlan13
 
Back
Top