lagg interfaces not coming up properly after update to 11.2

I finally updated our storage server from 11.1-RELEASE to 11.2-RELEASE and now lagg interfaces won't come up at boot:

Code:
# grep lagg /etc/rc.conf
cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24 up"
[...]

# ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=800000<>
        ether 00:00:00:00:00:00
        inet 10.50.0.101 netmask 0xffffff00 broadcast 10.50.0.255 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        groups: lagg 
        laggproto failover lagghash l2,l3,l4

the ifconfig_lagg0 entry seems to be read from /etc/rc.conf, because the IP is set, but all lagg options are ignored.

creating the lagg interface manually works as expected:
Code:
# ifconfig lagg0 create laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24
# ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=6403bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 90:e2:ba:14:ca:18
        inet 10.50.0.101 netmask 0xffffff00 broadcast 10.50.0.255 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        groups: lagg 
        laggproto lacp lagghash l2,l3,l4
        laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb3 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb4 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb5 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>

Have there been any changes to how lagg interfaces are set up at boot or their config is read from rc.conf? This config worked since 10.x days and there were never any warnings during updates about the syntax being changed...
 
I doubt this is something in 11.2. My server has been running 11-STABLE up until this weekend when I updated it to 12-STABLE. It has a lagg(4) interface and it's been working as expected.

Code:
cloned_interfaces="lagg0 bridge10"
ifconfig_igb0="up mtu 9014"
ifconfig_igb1="up mtu 9014"
ifconfig_lagg0="laggproto lacp laggport igb0 laggport igb1"
vlans_lagg0="10"
ifconfig_lagg0_10="inet 192.168.10.180 netmask 255.255.255.0"
Code:
dice@hosaka:~ % ifconfig lagg0
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9014
        options=e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
        ether 00:25:90:f1:58:38
        laggproto lacp lagghash l2,l3,l4
        laggport: igb0 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
dice@hosaka:~ % ifconfig lagg0.10
lagg0.10: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9014
        options=200401<RXCSUM,LRO,RXCSUM_IPV6>
        ether 00:25:90:f1:58:38
        inet 192.168.10.180 netmask 0xffffff00 broadcast 192.168.10.255
        groups: vlan
        vlan: 10 vlanpcp: 0 parent interface: lagg0
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
 
I doubt this is something in 11.2. My server has been running 11-STABLE up until this weekend when I updated it to 12-STABLE. It has a lagg(4) interface and it's been working as expected.

The update from 11.1-RELEASE-p6 to 11.2-RELEASE-p4 was the only thing changed on that host...

Every other configuration I might add to the ifconfig_lagg0 entry is properly processed; only laggproto/laggport seems to be ignored. No errors on dmesg or /var/log/messages:
Code:
# grep lagg /var/log/messages                                                                                                                                                   
Oct 24 14:09:27 stor1 kernel: lagg0: link state changed to UP

I also had to deal with a dying disk on that server today (occasionally disappeared for a few seconds, sending the zfs pool it is part of to holidays until it magically comes back and the pool is responsive again...) so hadn't that much time to look deeper into that problem other than verifying it doesn't work any more via rc.conf but only manually with exactly the same syntax (essentially copy&paste).
 
Check if the associated rc(8) scripts have been merged properly. It might have had a merge conflict somewhere that's preventing the script from working properly. Although I would have expected to see some error messages during boot if that was the case.

The update from 11.1-RELEASE-p6 to 11.2-RELEASE-p4 was the only thing changed on that host...
My server was originally installed with 11.1 if I remember correctly. It had been steadily updated with 11-STABLE up until this weekend. I've had zero issues with regards to the lagg(4) interface configuration.
 
It seems there were no changes to /etc/rc.d/netif:

Code:
affected host with 11.2-RELEASE:
% grep '$FreeBSD' /etc/rc.d/netif                                                                                                                                      
# $FreeBSD: releng/11.2/etc/rc.d/netif 300931 2016-05-29 02:59:03Z ngie $

11.1-RELEASE:
% grep '$FreeBSD' /etc/rc.d/netif
# $FreeBSD: releng/11.1/etc/rc.d/netif 300931 2016-05-29 02:59:03Z ngie $

excerp from the /var/log/messages from last boot:
Code:
Oct 24 14:08:35 stor1 kernel: ix0: link state changed to UP
Oct 24 14:08:35 stor1 kernel: igb0: link state changed to UP
Oct 24 14:08:35 stor1 kernel: igb5: link state changed to UP
Oct 24 14:08:35 stor1 kernel: igb2: link state changed to UP
Oct 24 14:08:35 stor1 kernel: igb4: link state changed to UP
Oct 24 14:08:35 stor1 kernel: igb3: link state changed to UP
[...]
Oct 24 14:09:27 stor1 kernel: lagg0: link state changed to UP
Oct 24 14:09:27 stor1 kernel: vlan3: link state changed to UP
Oct 24 14:09:27 stor1 kernel: vlan4: link state changed to UP
Oct 24 14:09:27 stor1 kernel: igb3: link state changed to DOWN
Oct 24 14:09:27 stor1 kernel: igb4: link state changed to DOWN
Oct 24 14:09:27 stor1 kernel: igb5: link state changed to DOWN
Oct 24 14:09:30 stor1 kernel: igb4: link state changed to UP
Oct 24 14:09:30 stor1 kernel: igb3: link state changed to UP
Oct 24 14:09:31 stor1 kernel: igb5: link state changed to UP

I had suspected lagg0 was set up before igb2 was ready, hence the missing ether address on the lagg interface; but nothing unusual on this side regarding to the logs...
 
Have you tried a verbose boot? That might show more information.
 
Show us all the network-related ifconfig_* lines from /etc/rc.conf.

Most likely, you're missing the "up" lines for each of the members of the lagg.

Code:
cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"
ifconfig_igb2="up"
ifconfig_igb3="up"
ifconfig_igb4="up"
ifconfig_igb5="up"
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24"

If that doesn't work, you can split the lagg configuration in two, separating the LACP stuff from the IP stuff:
Code:
cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5"
ifconfig_lagg0_alias0="inet 10.50.0.101/24"
 
Most likely, you're missing the "up" lines for each of the members of the lagg.
I suspect this is the problem.

However, if it's not, and even if there were no changes to rc.d/netif, it could be other parts of the rc.d system were not properly merged. Namely, rc.subr comes to mind (though that may have had no updates either between 11.1 and 11.2), and I'm sure there are probably others.
 
Show us all the network-related ifconfig_* lines from /etc/rc.conf.

Most likely, you're missing the "up" lines for each of the members of the lagg.

Code:
cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"
ifconfig_igb2="up"
ifconfig_igb3="up"
ifconfig_igb4="up"
ifconfig_igb5="up"
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24"

All interfaces are being up'ed:

Code:
ifconfig_igb2="mtu 9000 up"
ifconfig_igb3="mtu 9000 up"
ifconfig_igb4="mtu 9000 up"
ifconfig_igb5="mtu 9000 up"

If that doesn't work, you can split the lagg configuration in two, separating the LACP stuff from the IP stuff:
Code:
cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5"
ifconfig_lagg0_alias0="inet 10.50.0.101/24"


That's what I initially tried; also leaving out the ip configuration and adding the 'up' to either line or both, but all without any effect.

As said: the whole network config on that host hasn't been touched since ~10.3. For completeness here is the whole network configuration for that host:
Code:
### Networking

cloned_interfaces="lagg0 vlan3 vlan4 bridge0 bridge4 bridge5"

## onboard NICs
# igb0 = mgmt net
ifconfig_igb0="inet 10.50.50.101/24 -lro -tso up"
## 10GbE
ifconfig_ix0="inet 10.10.2.101/24 mtu 9000 up"

## quad port i350 PCIe NIC -> lagg0
ifconfig_igb2="mtu 9000 up"
ifconfig_igb3="mtu 9000 up"
ifconfig_igb4="mtu 9000 up"
ifconfig_igb5="mtu 9000 up"

ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24 up"

ifconfig_vlan3="inet 10.50.52.101/24 vlan 3 vlandev lagg0 up"
ifconfig_vlan4="inet 10.50.51.101/24 vlan 4 vlandev lagg0 up"

(the bridges were once used for vnic jails but are no longer needed; so they have been removed now. I kept them in the paste to avoid confusion because they also were in the output in my initial post)

I've enabled verbose boot, but have to wait till after hours until I can take that host down again (yes, why grant budget for something silly like a redundant storage server...)
 
Don't mix tagged and untagged vlans on an interface, bad things will happen. Either do everything untagged, or do everything tagged.

IOW:
Code:
cloned_interfaces="lagg0 vlan1 vlan3 vlan4"

ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5"

ifconfig_vlan1="vlan 1 vlandev lagg0 inet 10.50.0.101/24"
ifconfig_vlan3="vlan 3 vlandev lagg0 inet 10.50.52.101/24"
ifconfig_vlan4="vlan 4 vlandev lagg0 inet 10.50.51.101/24"

FreeBSD tends to handle hybrid port configurations (mixing tagged and untagged vlans on a single wire) better than other OSes (Linux doesn't work with that setup anymore), but odd things will still happen. Been down that road and spent way too much time trying to get it working reliably. Switch-to-switch connections work okay in a hybird config, but switch-to-client really should be either trunk (all tagged vlans) or access (just the untagged vlan) configs.

I'd comment out the vlan3/vlan4 lines and test if lagg0 comes up normally. If it does, then uncomment the vlan3/vlan4 lines and see if things break again. If they do, then try switching it to a vlan1/vlan3/vlan4 setup.
 
Code:
ifconfig_lagg0="laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 inet 10.50.0.101/24 up"
In circumstances like this, I've always put the IP configuration before the laggproto/laggport (i.e. ifconfig_lagg0="inet 10.50.0.101/24 laggproto lacp laggport igb2 laggport igb3 laggport igb4 laggport igb5 up"). Not sure if that could cause the issue but it may be worth trying. I certainly wouldn't use (and haven't tested) an *_aliasX without any IP configuration in the base line, so it doesn't surprise me if that doesn't work.
Don't mix tagged and untagged vlans on an interface, bad things will happen. Either do everything untagged, or do everything tagged.
While I agree with this, if primarily for security reasons, his config should still work. I've done it in the past (due to a driver bug affecting VLANs that I was troubleshooting), and it worked fine. Still, it may be worth trying without for debugging the issue.

Lastly, I'm curious--if you simply run service netif restart or service netif start (not sure which one you'll need) after booting, do things come up properly? I'm assuming they don't...
 
I am seeing the same issue on FreeBSD 12.1-RELEASE. The lagg interface is created at boot however networking does not work and the server cannot ping the default gateway. Here are the lines from my rc.conf file.

ifconfig_ql0="up"

ifconfig_ql1="up"

ifconfig_ql2="up"

ifconfig_ql3="up"

cloned_interfaces="lagg0"

ifconfig_lagg0="laggproto lacp laggport ql0 laggport ql1 laggport ql2 laggport ql3 192.168.0.4/24 up"


One interesting thing to note is that networking *will* start working after I run a tcpdump on the interface.
 
I've had one running ever since 12-STABLE became available (shortly before 12.0-RELEASE):
Code:
lagg0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=8120b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,WOL_MAGIC,VLAN_HWFILTER>
        ether 00:25:90:f1:58:39
        laggproto lacp lagghash l2,l3,l4
        laggport: igb1 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        laggport: igb2 flags=1c<ACTIVE,COLLECTING,DISTRIBUTING>
        groups: lagg
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
 
I've always had the same issue. It's a problem with some specific network interface hardware I believe where if the MAC address is set to something else, as is the case with lagg, it drops packets. When the interface is in promisc mode then it can receive and respond to these packets. That's why tcpdump makes it work.

I work around this with the following in rc.conf

Code:
ifconfig_re0="ether aa:bb:cc:dd:ee:ff promisc up"
 
I've always had the same issue. It's a problem with some specific network interface hardware I believe where if the MAC address is set to something else, as is the case with lagg, it drops packets. When the interface is in promisc mode then it can receive and respond to these packets. That's why tcpdump makes it work.

I work around this with the following in rc.conf

Code:
ifconfig_re0="ether aa:bb:cc:dd:ee:ff promisc up"

Does this mean that the interface will always be in promiscuous mode? Also, are you configuring the MAC address to match the real hardware address or just using a random ID?
 
It does mean the interface will always be in promiscuous mode yes. And I am using the real hardware address. I am configuring the re0 ethernet interface to have the same MAC address as the wifi0 wifi interface. And then they are both put into a lagg0 interface. But the problem is if it's not in promisc mode the re0 interface drops all packets without recognising that they are destined for itself.

If I configure it the opposite way around where I set the MAC of re0 on wifi0 then it works fine without promisc mode. But as I use wifi0 as my main interface I wanted to make wifi0 the master interface.
 
It does mean the interface will always be in promiscuous mode yes. And I am using the real hardware address. I am configuring the re0 ethernet interface to have the same MAC address as the wifi0 wifi interface. And then they are both put into a lagg0 interface. But the problem is if it's not in promisc mode the re0 interface drops all packets without recognising that they are destined for itself.

If I configure it the opposite way around where I set the MAC of re0 on wifi0 then it works fine without promisc mode. But as I use wifi0 as my main interface I wanted to make wifi0 the master interface.

Do I need to do this for *every* interface or just ql0? Can you post your full networking configuration here?
 
So my full networking configuration is this. Where aa:bb:cc:dd:ee:ff is the MAC address that is normally on the wlan0 interface.

Code:
wlans_ath0="wlan0"
create_args_wlan0="regdomain ETSI country GB"
ifconfig_re0="ether aa:bb:cc:dd:ee:ff promisc up"
ifconfig_wlan0="WPA up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto failover laggport re0 laggport wlan0 SYNCDHCP"
ifconfig_lagg0_ipv6="inet6 accept_rtadv"
rtsold_enable="YES"
 
So my full networking configuration is this. Where aa:bb:cc:dd:ee:ff is the MAC address that is normally on the wlan0 interface.

Code:
wlans_ath0="wlan0"
create_args_wlan0="regdomain ETSI country GB"
ifconfig_re0="ether aa:bb:cc:dd:ee:ff promisc up"
ifconfig_wlan0="WPA up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto failover laggport re0 laggport wlan0 SYNCDHCP"
ifconfig_lagg0_ipv6="inet6 accept_rtadv"
rtsold_enable="YES"

Thanks. Looks like my issue might actually be something different. I'm able to get the lagg interface up now without the NICs being in promiscuous mode but when I reboot the server my network interfaces don't come up. I'm able to get the system online by logging in and running /etc/rc.d/netif restart but I really don't want to do this every time the server is rebooted.

Any idea what would cause this? Is there a way to tell lagg0 to be configured *after* the physical NICs are up?

For example, here is a screenshot of the configuration after a fresh reboot.

rpviewer (3).png
 
Do you have ifconfig_ql0="UP" and ifconfig_ql2="UP" in your rc.conf? I've found you always have to manually specify that any member of a lagg is "up" unless you have some other configuration that automatically brings it up.

That said, I've never needed to specify promiscuous mode, but in some cases I've needed to specify the MAC address (though not always IIRC).
 
Yes, the individual node interfaces need to be explicitly set to up. Not configured interfaces are, by default, down.
 
Do you have ifconfig_ql0="UP" and ifconfig_ql2="UP" in your rc.conf? I've found you always have to manually specify that any member of a lagg is "up" unless you have some other configuration that automatically brings it up.

That said, I've never needed to specify promiscuous mode, but in some cases I've needed to specify the MAC address (though not always IIRC).

Yes, I have the interfaces configured to come up at boot. Here is the current configuration.

ifconfig_ql0="up polling"
ifconfig_ql2="up polling"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport ql0 laggport ql2 10.201.64.4/22"
defaultrouter="10.201.67.254"


I'm not sure what "polling" does or if it's even necessary though.

With this configuration lagg0 works if I log in to the console and run `/etc/rc.d/netif restart` but it does not work at boot time. Both interfaces will show flags as 0 as in the screenshot posted above. Unfortunately I don't have access to the switch but our network admins say there are no errors on the switch and the port channel is configured correctly.
 
Remove them, you don't need it. See polling(7).
It's probably worth thinking about the benefits of polling(4).
I used to use an exactly similar technique in ancient versions of Unix (System III and System V) for RS232 (tty(4)) devices running networks, prior to the advent of Ethernet.
The benefit was that the receiver interrupts and associated context switching could be easily reduced by significant amounts (90% for one 9600 baud line).
I don't know what the comparable numbers are for polling Ethernet, but I would certainly investigate if I was dealing with constant heavy network traffic.
 
It's probably worth thinking about the benefits of polling(4).
I used to use an exactly similar technique in ancient versions of Unix (System III and System V) for RS232 (tty(4)) devices running networks, prior to the advent of Ethernet.
The benefit was that the receiver interrupts and associated context switching could be easily reduced by significant amounts (90% for one 9600 baud line).
I don't know what the comparable numbers are for polling Ethernet, but I would certainly investigate if I was dealing with constant heavy network traffic.
I'm not saying it's useless, it's certainly useful. But if you don't know what polling(4) is or does then you probably don't need it. People that actually need it know what it does and what it's for.
 
Back
Top