lagg and bonding with Debian

gpw928 · Feb 17, 2013

Hi,

I am trying to set up an aggregated Ethernet connection between a ZFS server running FreeBSD 9.1 and my MythTV server running Ubuntu 11.04. The connection is direct. i.e. two Intel Gigabit interfaces on each host with UTP cable directly connecting them.

My eventual goal is to run NFS traffic with jumbo packets on the aggregated link. But first, I have to crawl...

I have tested the interfaces by configuring them in a single back-to-back network connection. All four interfaces and both cables work.

Not sure if I'm being naive regarding how lagg/bonding works, or just have technical problems.

The FreeBSD /etc/rc.conf says:

Code:

ifconfig_em0="up mtu=1500"
ifconfig_em1="up mtu=1500"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto roundrobin em0 laggport em1 inet 10.0.0.1 netmask 255.255.255.0"

The Linux /etc/network/interfaces says:

Code:

auto eth1
iface eth1 inet manual
        bond-master bond0

auto eth2
iface eth2 inet manual
        bond-master bond0

auto bond0
iface bond0 inet static
        address 10.0.0.3
        network 10.0.0.0
        netmask 255.255.255.0
        mtu 1500
        bond_mode balance-rr
        bond_miimon 100
        bond_downdelay 200
        bond_updelay 200
        slaves eth1 eth2

After rebooting the FreeBSD system:

Code:

[orac#145] ifconfig -a
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
        ether 50:46:5d:76:25:9b
        inet6 fe80::5246:5dff:fe76:259b%em0 prefixlen 64 scopeid 0x2 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
em1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
        ether 68:05:ca:11:58:33
        inet6 fe80::6a05:caff:fe11:5833%em1 prefixlen 64 scopeid 0x6 
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect (1000baseT <full-duplex>)
        status: active
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 00:00:00:00:00:00
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        laggproto roundrobin lagghash l2,l3,l4

Note that the lagg0 interface has no IP address configured. Maybe that's a clue. It's easy enough to add:

Code:

[orac#146] ifconfig lagg0 inet orac10 netmask 255.255.255.0
[orac#147] ifconfig lagg0
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        ether 00:00:00:00:00:00
        inet 10.0.0.1 netmask 0xffffff00 broadcast 10.0.0.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: no carrier
        laggproto roundrobin lagghash l2,l3,l4

Note there is "no carrier" for lagg0. Maybe that's because of negotiation failure?

Meanwhile, on the Linux side:

Code:

[myth#145] ifconfig -a
bond0     Link encap:Ethernet  HWaddr 68:05:ca:11:4f:05  
          inet addr:10.0.0.3  Bcast:10.0.0.255  Mask:255.255.255.0
          inet6 addr: fe80::6a05:caff:fe11:4f05/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:52 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:9702 (9.7 KB)
eth1      Link encap:Ethernet  HWaddr 68:05:ca:11:4f:05  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:52 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:9702 (9.7 KB)
          Interrupt:16 Memory:fcfe0000-fd000000 

eth2      Link encap:Ethernet  HWaddr 68:05:ca:11:4f:05  
          BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
          Interrupt:17 Memory:f9fe0000-fa000000

FreeBSD netstat:

Code:

[orac#147] netstat -ni
Name    Mtu Network       Address              Ipkts Ierrs Idrop    Opkts Oerrs  Coll
usbus     0 <Link#1>                               0     0     0        0     0     0
em0    1500 <Link#2>      50:46:5d:76:25:9b       24     0     0        0     0     0
em0    1500 fe80::5246:5d fe80::5246:5dff:f        0     -     -        3     -     -
usbus     0 <Link#3>                               0     0     0        0     0     0
usbus     0 <Link#4>                               0     0     0        0     0     0
usbus     0 <Link#5>                               0     0     0        0     0     0
em1    1500 <Link#6>      68:05:ca:11:58:33        0     0     0        0     0     0
em1    1500 fe80::6a05:ca fe80::6a05:caff:f        0     -     -        1     -     -
re0    1500 <Link#7>      50:46:5d:76:22:ad     2617     0     0      775     0     0
re0    1500 192.168.1.0/2 192.168.1.26           894     -     -      764     -     -
re0    1500 fe80::5246:5d fe80::5246:5dff:f        0     -     -        1     -     -
usbus     0 <Link#8>                               0     0     0        0     0     0
lo0   16384 <Link#9>                              46     0     0       46     0     0
lo0   16384 ::1/128       ::1                      4     -     -        4     -     -
lo0   16384 fe80::1%lo0/6 fe80::1                  0     -     -        0     -     -
lo0   16384 127.0.0.0/8   127.0.0.1               40     -     -       42     -     -
lagg0  1500 <Link#10>     00:00:00:00:00:00        0     0     0        0     0     0
lagg0  1500 10.0.0.0/24   10.0.0.1                 2     -     -        4     -     -

Linux netstat:

Code:

[myth#148] netstat -ni
Kernel Interface table
Iface   MTU Met   RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0      1500 0         0      0      0 0            62      0      0      0 BMmRU
eth0       1500 0    285529      0      0 0          1382      0      0      0 BMRU
eth1       1500 0         0      0      0 0            62      0      0      0 BMsRU
lo        16436 0       149      0      0 0           149      0      0      0 LRU

It's not working. Any clues appreciated.

Cheers,

--
Phil

gpw928 · Feb 23, 2013

Here is my progress on the problem for those that follow.

I changed the syntax of the lagg stuff in /etc/rc.conf to:

Code:

ifconfig_em0="up mtu=1500"
ifconfig_em1="up mtu=1500"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto roundrobin laggport em0 laggport em1"
ipv4_addrs_lagg0="10.0.0.1/24"

This got the lagg0 interface up:

Code:

lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=4219b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4,WOL_MAGIC,VLAN_HWTSO>
        ether 50:46:5d:76:25:9b
        inet 10.0.0.1 netmask 0xffffff00 broadcast 10.0.0.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet autoselect
        status: active
        laggproto roundrobin lagghash l2,l3,l4
        laggport: em1 flags=4<ACTIVE>
        laggport: em0 flags=4<ACTIVE>

There's still a bit of work to do, as a lot of packets are being dropped:

Code:

[orac#132] ping myth10      
PING myth10 (10.0.0.3): 56 data bytes
64 bytes from 10.0.0.3: icmp_seq=0 ttl=64 time=0.318 ms
64 bytes from 10.0.0.3: icmp_seq=2 ttl=64 time=0.369 ms
64 bytes from 10.0.0.3: icmp_seq=4 ttl=64 time=0.250 ms
64 bytes from 10.0.0.3: icmp_seq=5 ttl=64 time=0.287 ms
64 bytes from 10.0.0.3: icmp_seq=6 ttl=64 time=0.233 ms
64 bytes from 10.0.0.3: icmp_seq=7 ttl=64 time=0.229 ms
^C
--- myth10 ping statistics ---
10 packets transmitted, 6 packets received, 40.0% packet loss
round-trip min/avg/max/stddev = 0.229/0.281/0.369/0.050 ms

Will follow up as soon as I figure out the fix.

Cheers,

gpw928 · Feb 23, 2013

Finally read the Ubuntu bonding docs in /usr/share/doc/ifenslave-2.6/README.Debian.

The bonding syntax has changed. Ubuntu 11.04 /etc/network/interfaces now looks like this (note that I have changed from roundrobin to 802.3ad bonding):

Code:

auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
        address 192.168.1.31
        netmask 255.255.255.0
        gateway 192.168.1.254

auto bond0
iface bond0 inet static
        address 10.0.0.3
        network 10.0.0.0
        netmask 255.255.255.0
        mtu 1500
        bond-mode 802.3ad
        bond-miimon 100
        bond-slaves none

auto eth1
iface eth1 inet manual
        bond-master bond0
        bond-primary eth1 eth2

auto eth2
iface eth2 inet manual
        bond-master bond0
        bond-primary eth1 eth2

FreeBSD 9.1. /etc/rc.conf looks like this (lacp == 802.3ad):

Code:

ifconfig_em0="up mtu=1500"
ifconfig_em1="up mtu=1500"
cloned_interfaces="lagg0"
ifconfig_lagg0="laggproto lacp laggport em0 laggport em1"
ipv4_addrs_lagg0="10.0.0.1/24"

And it all works a treat. Jumbo packets and NFS over ZFS is next on the agenda.

Cheers,

jalla · Feb 24, 2013

You shouldn't expect increased speed from this, as a lagg between two hosts will only use one link.

You have reduncancy though, in case of a faulty interface or cable.

gpw928 · Feb 25, 2013

Hi,

I know that the lagg/bonding will always transmit packets for a given TCP connection on the same physical wire (to avoid packets arriving out-of-order), so no single connection can ever get more than what is possible with one wire, but the community of connections should multiplex over all available wires (assuming the mode selected is not just fail-over).

I must admit that I don't know how this will pan out over NFS, but it will be fun to see.

Cheers,

phoenix · Feb 25, 2013

Most link-aggregation protocols, especially the simpler ones, use either the destination MAC or destination IP as part of the hash. IOW, all communications between the two hosts will go through one wire. The second wire will only be used if the first drops the link.

gpw928 · Feb 26, 2013

Hi Freddie,

I think that there is some hope for balance-rr mode. This is instructive.

Cheers,

jalla · Feb 26, 2013

If I understand you correctly, the assumption is that nfs over UDP should work well with round-robin?

I may be wrong, but I think round-robin is discouraged in general. If you do some test it would be interesting to see the results.

gpw928 · Feb 27, 2013

Hi Jalla,

I will try both TCP and UDP for the NFS.

First task is to figure out if there is any mode which provides lagg aggregation benefit between two directly connected hosts. If that's not possible, then I will try using a separate network on each wire, with an NFS mount for each.

There's a lot of variables to test - jumbo Ethernet packets, NFS (V3, V4, TCP, UDP), ZFS RAID1Z (SSD ZIL and L2ARC). My main interest is in sequential reading and writing.

I'll post the results, but not before next weekend.

Cheers,

gpw928 · Mar 2, 2013

I have now run some basic network benchmarks with lagg using 2 x 1 Gbit Intel interfaces on each host.

The executive summary is that for my situation, roundrobin (balanced-rr) lagg (bonding) with 2 x 1 Gbit connections (mtu=1500) seems worthwhile.

The two hosts were:

Code:

orac: CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (3605.74-MHz K8-class CPU)
      O/S: FreeBSD 9.1-RELEASE
      Memory: 32 GB

myth: CPU: AMD Phenom(tm) II X4 840 Processor (3210.537 MHz processor)
      O/S: Ubuntu 11.04
      Memory: 4 GB

I have to confess that I only bought three Intel EXPI9301CT 1x PCI Express cards (with Intel 82574L Ethernet chip) and that orac's em0 is an on-board Intel 82579V chip:

Code:

em0: <Intel(R) PRO/1000 Network Connection 7.3.2> port 0xf080-0xf09f mem 0xf7e00
000-0xf7e1ffff,0xf7e39000-0xf7e39fff irq 20 at device 25.0 on pci0
em0: Using an MSI interrupt
em1: <Intel(R) PRO/1000 Network Connection 7.3.2> port 0xe000-0xe01f mem 0xf7ac0
000-0xf7adffff,0xf7a00000-0xf7a7ffff,0xf7ae0000-0xf7ae3fff irq 16 at device 0.0
on pci7
em1: Using MSIX interrupts with 3 vectors

em0 can't seem to keep up with jumbo packets on the full duplex tests, though it does better than em1 with ordinary (mtu 1500) packets on the same full duplex tests!

Each host was direct connected with Cat-6 Ethernet cable to the other, twice.

I had the interfaces configured with the following variations:

two stand alone networks;
one lagg/bonded lacp connection;
one lagg/bonded roundrobin;
mtu 1500; and
mtu 9K.

I used "systat -ifstat" to observe network throughput from orac.

I used netcat to generate the network traffic, typically for a half duplex test of one network connection:

Code:

[orac] nc -l 1234 >/dev/null
[myth] nc orac10a 1234 </dev/zero

with up to 8 netcats running for full duplex testing of four interfaces provisioning two networks.

Here is the most significant result:

Code:

roundrobin (balanced-rr)  bonding.  2x1Gbit connections.  mtu=1500.
Full duplex read/write test of lagg0.  "lagg0 in" and "lagg0 out" matters.
[orac#138] nc -l 1234 >/dev/null
[myth#135] nc orac10a 1234 </dev/zero
[orac#129] nc myth10a 1235 </dev/zero
[myth.129] nc -l 1235 >/dev/null

Interface           Traffic               Peak                Total
  lagg0  in    206.319 MB/s        223.121 MB/s           44.634 GB
         out    82.597 MB/s         86.452 MB/s           10.877 GB

    em1  in    103.377 MB/s        111.364 MB/s           22.326 GB
         out    41.233 MB/s         43.269 MB/s            5.439 GB

    em0  in    102.949 MB/s        111.768 MB/s           22.311 GB
         out    41.365 MB/s         43.183 MB/s            5.437 GB

Serial write bandwidth to orac ("lagg0 in") is by far most important to me, and it's nearly double what you get with one NIC! Serial read bandwidth ("lagg0 out") is adequate, at 15% above what's possible with one NIC.

This compares favourably to what's possile on (the best) single interface:

Code:

No bonding.  2x1Gbit connections.  mtu=1500.
Full duplex test of em0.  "em0 in" and "em0 out" matter.
[orac#138] nc -l 1234 >/dev/null
[myth#135] nc orac10a 1234 </dev/zero
[orac#129] nc myth10a 1235 </dev/zero
[myth.129] nc -l 1235 >/dev/null

Interface           Traffic               Peak                Total
    em0  in    114.113 MB/s        116.826 MB/s           75.011 GB
         out    70.729 MB/s        110.547 MB/s           45.710 GB

Pushing the MTU to 9216 increases output at great cost to input:

Code:

roundrobin (balanced-rr)  bonding.  2x1Gbit connections.  mtu=9216.
Full duplex read/write test of lagg0.  "lagg0 in" and "lagg0 out" matters.
[orac#138] nc -l 1234 >/dev/null
[myth#135] nc orac10a 1234 </dev/zero
[orac#129] nc myth10a 1235 </dev/zero
[myth.129] nc -l 1235 >/dev/null

Interface           Traffic               Peak                Total
  lagg0  in     20.164 MB/s         22.604 MB/s           34.736 GB
         out   140.395 MB/s        151.336 MB/s           14.216 GB

    em1  in     10.286 MB/s         11.382 MB/s           17.384 GB
         out    69.934 MB/s         75.563 MB/s            7.108 GB

    em0  in      9.878 MB/s         11.225 MB/s           17.359 GB
         out    70.461 MB/s         75.773 MB/s            7.108 GB

Just for the record, LACP with jumbo packets gave this (confirming that between two hosts LACP can't use more than one wire):

Code:

lacp bonding.  2x1Gbit connections.  mtu=9216.
Full duplex read/write test of lagg0.  "lagg0 in" and "lagg0 out" matters.
[orac#138] nc -l 1234 >/dev/null
[myth#135] nc orac10a 1234 </dev/zero
[orac#129] nc myth10a 1235 </dev/zero
[myth.129] nc -l 1235 >/dev/null

Interface           Traffic               Peak                Total
  lagg0  in    113.437 MB/s        118.228 MB/s           26.547 GB
         out    61.741 MB/s         70.614 MB/s            1.669 GB

    em1  in    113.439 MB/s        118.228 MB/s           26.547 GB
         out    61.741 MB/s         70.614 MB/s            1.669 GB

    em0  in      0.000 KB/s          0.023 KB/s           32.344 KB
         out     0.000 KB/s          0.023 KB/s           33.455 KB

I have a lot more numbers. Please contact me privately if you would like them.

I expect to get my SSD next week, so there is quite a lot more ZFS, and NFS testing to be done.

Cheers,