Solved LACP with FreeBSD - Problem

Hello everybody,

We have installed FreeBSD 10.3 on a new server which shall be used as backup and file storage node. All set up was successful, however a problem appeared as we tried to activate LACP on the server. We wanted to combine the throughput of two 10G links.

Our switch is a Netgear S3300-52X. On some of our OpenSuse machines LACP/LAG works with that switch, so I do not think that the problem is caused by the switch itself (or maybe only in combination with FreeBSD).

The internet connection without LACP on the FreeBSD machine worked well using one ethernet port.

As we configured LACP following the usual descriptions on the web the server was unable to establish any connection to other machines in our subnet.

ifconfig shows the following output:

Code:
ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500  
options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
      ether yy:yy:yy:yy:yy:yy
      nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
      media:Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)
      status=active
ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500    
options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
      ether yy:yy:yy:yy:yy:yy
      nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
      media:Ethernet autoselect (10Gbase-T <full-duplex,rxpause,txpause>)
      status=active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 metu 16384
      options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
      inet6 ::1 prefixlen 128
      inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
      inet 127.0.0.1 netmask 0xff000000
      nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500  
options=e407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6>
      ether yy:yy:yy:yy:yy:yy
      inet xxx.xxx.xxx.xxx netmask 0xffffff80 broadcast xxx.xxx.xxx.xxx
      nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
      media: Ethernet autoselect
      status: active
      laggproto lacp lagghash l2,l3,l4
      laggport: ix0 flags=0<>
      laggport: ix1 flags=0<>

As you can see we have tried to bundle the ix0 and ix1 adapters to one LACP connection (lagg0). The ethernet adresses of all the ports are identical, the IP and subnetmask worked well on a single port without using LACP.

The rc.conf file looks as follows:

Code:
hostname="name"
keymap="german.iso.acc.kbd"
ifconfig_ix0="up"
ifconfig_ix1="up"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport ix0 laggport ix1 xxx.xxx.xxx.xxx netmask 255.255.255.128"
defaultrouter="xxx.xxx.xxx.xxx"
sshd_enable="YES"
ntpd_enable="YES"
powerd_enable="YES"
dumpdev="AUTO"
zfs_enable="YES"
ezjail_enable="YES"

The loader.conf file includes the needed line for loading the drivers for using lagg interfaces:

Code:
kern.geom.label.gptid.enable="0"
zfs_load="YES"
if_lagg_load="YES"

On the Netgear switch all we needed to do was to create a LAG group and choose both ports to be part of it which are connected to the FreeBSD machines.

I have searched all available forum entries for lagg/LACP/LAG with FreeBSD and unfortunately did not found any reported problem which looks to be (nearly) identical to ours.
My considerations concerning the reason for our problems were that maybe too many option flags are set on the adapters ix0 and ix1 (in all other seen examples there were much less flags) or that IPv6 is (partially) enabled on them (we only use IPv4 addresses).
I assume that no flags on the laggports are set (which probably shows that they are not used by lagg0 right now) indicates that the connection does not work.

I would be very grateful if any of you have a idea what causes our problem or what could we try next to fix it.
 
Apparently Netgear makes a royal mess of things by using confusing terminology. I've seen LAG being referred to as a trunk which is a typical Cisco term for 802.11q VLAN tagging and has nothing to do with link aggregation. That said, a trunk on HP is a bundle of ports and is probably more like LAG on Netgear. Totally confusing.

I think you need to figure out what protocol the Netgear actually supports. As you have a working OpenSuSe system it might be worthwhile to check there too. On FreeBSD LACP (as used by lagg(4)) is IEEE 802.1AX (formerly 802.3ad). There's also fec/loadbalance and this is probably more like Netgear's LAG.
 
As you have a working OpenSuSe system it might be worthwhile to check there too. On FreeBSD LACP (as used by lagg(4)) is IEEE 802.1AX (formerly 802.3ad). There's also fec/loadbalance and this is probably more like Netgear's LAG.

Thank you for your reply! Indeed it seems that the FreeBSD LACP protocol was somewhat incompatible to Netgear´s LAG. We are using now loadbalance and the problem seems to be solved (the network connection works as usual and more than 10G throughput could be measured).
 
I once had to mess around with a Netgear ProSafe switch and link aggregation (before throwing it out the window...).
It appears these switches have a broken/incomplete/crappy LACP implementation and fail to negotiate on several connection options - IIRC flow control had to be disabled on the other end to get it working *sometimes*. However, losing one link always dropped the whole connection, so there was no point in using aggregation/LACP with the Netgear switch.
Netgear LAGs without LACP also never worked with either cisco or dell "passive" or "static" aggregations (IIRC they were only using one link and fallback to passive links also didn't work). I ended up using some kind of loadbalance (round-robin IIRC) on the client side (debian linux back then), and a single link from one cisco to the Netgear.
Haven't looked further into this problem then - I just got rid of the Netgear and replaced it with another cisco sg300.
 
Back
Top