Solved Horrendous network performance with VLAN

patpro

Active Member

Reaction score: 14
Messages: 199

Hello,

I'm running FreeBSD 9.3-RELEASE up-to-date, on two HP Proliant G6 server blades in the same enclosure. One with VLANs in the uplink, the other without VLANs.

Blade A is configured to access the network through a connection without VLANs (the link provided by Network team comes with no VLAN tagging). Its transfer rate is perfect, both up and down.

Blade A network configuration:
Code:
ifconfig_bxe0="inet x.y.z.141/24"
defaultrouter="x.y.z.1"
Code:
# ifconfig bxe0
bxe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO>
    ether 00:17:a4:77:04:00
    inet x.y.z.141 netmask 0xffffff00 broadcast x.y.z.255
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-CX4 <full-duplex>)
    status: active
Blade B is configured to access the network through a link sporting multiple VLANs, so I've created a network interface that uses one of these VLANs. Ping is OK, I can ssh to this server, transfer rate to the server (down) is not fantastic but OK, enough to perform pkg installation or FreeBSD update. Transfer rate from the server to the rest of the world is abysmal, often stalling after few 100's KB.

Blade B network configuration:
Code:
ifconfig_bxe0="UP"
vlans_bxe0="161"
ifconfig_bxe0_161="inet x.y.z.142/24"
defaultrouter="x.y.z.1"
Code:
# ifconfig bxe0
bxe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO>
    ether 00:17:a4:77:04:10
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-CX4 <full-duplex>)
    status: active
Code:
# ifconfig bxe0.161
bxe0.161: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    options=303<RXCSUM,TXCSUM,TSO4,TSO6>
    ether 00:17:a4:77:04:10
    inet x.y.z.142 netmask 0xffffff00 broadcast x.y.z.255
    inet6 fe80::217:a4ff:fe77:410%bxe0.161 prefixlen 64 scopeid 0x10
    nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
    media: Ethernet autoselect (10Gbase-CX4 <full-duplex>)
    status: active
    vlan: 161 parent interface: bxe0
Any idea?
 

J65nko

Well-Known Member

Reaction score: 127
Messages: 453

Does the output of # netstat -in show errors and/or collisions?

Code:
Name  Mtu Network  Address  Ipkts Ierrs Idrop  Opkts Oerrs  Coll
usbus  0 <Link#1>  0  0  0  0  0  0
vtnet  1500 <Link#2>  52:54:00:5a:58:78 27632784  0  0 19930560  0  0
vtnet  1500 95.170.82.0/2 95.170.82.241  14275816  -  - 19916521  -  -
vtnet  1500 fe80::5054:ff fe80::5054:ff:fe5  0  -  -  0  -  -
lo0  16384 <Link#3>  9883796  0  0  9883796  0  0
[snip]
Although it does not give stats by interface you could investigate the # netstat -s or # netstat -s -p tcp output. You would have to temporarily disable the non-VLAN NIC, save the output to file, do you scp stuff that is slow, and rerun # netstat -s and compare the result with sdiff(1)

A sample of netstat -s output:
Code:
tcp:
  29721199 packets sent
  21106850 data packets (54363701245 bytes)
  147533 data packets (196291880 bytes) retransmitted
  3561 data packets unnecessarily retransmitted
  193 resends initiated by MTU discovery
  5726127 ack-only packets (45833 delayed)
  0 URG only packets
  170 window probe packets
  642437 window update packets
  2101188 control packets
  24128695 packets received
  15049500 acks (for 54273214823 bytes)
  910363 duplicate acks
  9778 acks for unsent data
  7673670 packets (32016960097 bytes) received in-sequence
  27518 completely duplicate packets (4960174 bytes)
  3211 old duplicate packets
  218 packets with some dup. data (69851 bytes duped)
  4525 out-of-order packets (4528770 bytes)
  0 packets (0 bytes) of data after window
  0 window probes
  1285136 window update packets
  17680 packets received after close
  25 discarded for bad checksums
  0 discarded for bad header offset fields
  0 discarded because packet too short
  0 discarded due to memory problems
  618077 connection requests
  1632413 connection accepts
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

I've done as you said, but the result is not so interesting:
Code:
    108 packets sent                     |        388 packets sent
        100 data packets (14210 bytes)             |            273 data packets (34387 bytes)
        4 data packets (3104 bytes) retransmitted     |            107 data packets (221952 bytes) retransmitted
        4 ack-only packets (2 delayed)             |            7 ack-only packets (3 delayed)
        0 control packets                 |            1 control packet
    162 packets received                     |        520 packets received
        104 acks (for 14212 bytes)             |            327 acks (for 102525 bytes)
        0 duplicate acks                 |            3 duplicate acks
        58 packets (6170 bytes) received in-sequence  |            199 packets (12049 bytes) received in-sequenc
        0 completely duplicate packets (0 bytes)      |            1 completely duplicate packet (48 bytes)
        0 packets with some dup. data (0 bytes duped) |            1 packet with some dup. data (47 bytes duped)
        0 window update packets                 |            1 window update packet
    0 connection requests                     |        1 connection request
    2 connections established (including accepts)         |        3 connections established (including accepts)
    104 segments updated rtt (of 73 attempts)         |        327 segments updated rtt (of 285 attempts)
    2 retransmit timeouts                     |        52 retransmit timeouts
    2 correct ACK header predictions             |        15 correct ACK header predictions
    4 correct data packet header predictions         |        5 correct data packet header predictions
    0 SACK recovery episodes                 |        1 SACK recovery episode
    0 segment rexmits in SACK recovery episodes         |        3 segment rexmits in SACK recovery episodes
    0 byte rexmits in SACK recovery episodes         |        4040 byte rexmits in SACK recovery episodes
    0 SACK options (SACK blocks) received             |        4 SACK options (SACK blocks) received
Left side is "before SCP start", right side is "after SCP stalled".

The -in output of netstat shows only 0s in the Ierrs, Idrop, Oerrs columns.
 

Uniballer

Well-Known Member

Reaction score: 65
Messages: 340

Switch behavior is extremely important to VLAN implementation. So what switch is Blade B connected to? And what does the switch have to say about the stuff it is receiving from Blade B?
 

J65nko

Well-Known Member

Reaction score: 127
Messages: 453

This seems to be the only interesting thing:
Code:
2 retransmit timeouts | 52 retransmit timeouts
But that could be just the symptom of some other issue.
Besides the TCP stuff netstat -s also provides a lot of other info like ICMP errors, but you did not post that.

I run a FreeBSD server in a VM at TransIP.eu and I was told to disable TCP segment offloading. From the /etc/sysctl.conf they provided:
Code:
# TransIP (2013-03-19)
# An issue in the current virtio drivers for FreeBSD (used for IO in the virtualized environment),
# specifically with TCP segment offloading (TSO), results in very poor network performance.
# FOR PROPER FUNCTIONING OF YOUR NETWORK CONNECTION, DO NOT REMOVE LINE BELOW
net.inet.tcp.tso=0
bxe(4) also describes a sysctl setting to disable TSO for this interface. Although it is a shot in the dark, you could give it a try ;)
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

Switch behavior is extremely important to VLAN implementation. So what switch is Blade B connected to? And what does the switch have to say about the stuff it is receiving from Blade B?
Both blades are connected to the same switches (active/passive). I've found nothing in port and switch statistics that shows a problem.

I've made a tcpdump capture that I'm currently browsing in wireshark. There is something odd, compared to the volume of data that went through the SCP connection (595 KB), the size of the capture file is very large (10.8 MB). Most of this is LLC protocol, not sure if it's normal. I must look more deeply into this capture.
 

kpa

Beastie's Twin

Reaction score: 1,814
Messages: 6,318

Play with the TSO4, TSO6, RXCSUM and TXCSUM options on the real interface and see if any of them make any difference in performance.
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

Besides the TCP stuff netstat -s also provides a lot of other info like ICMP errors, but you did not post that.
ICMP stuff was full of zeros.

Play with the TSO4, TSO6, RXCSUM and TXCSUM options on the real interface and see if any of them make any difference in performance.
No difference at all. I've even enabled debugging on interface bxe0, but it was not such a great idea (locked me out of the box).
 

J65nko

Well-Known Member

Reaction score: 127
Messages: 453

A long quote from the Wikipedia Logical Link Control article:

In the seven-layer OSI model of computer networking, the logical link control (LLC) data communication protocol layer is the upper sublayer of the data link layer, which is itself layer 2. The LLC sublayer provides multiplexing mechanisms that make it possible for several network protocols (IP, IPX, Decnet and Appletalk) to coexist within a multipoint network and to be transported over the same network medium. It can also provide flow control and automatic repeat request (ARQ) error management mechanisms.

The LLC sublayer acts as an interface between the media access control (MAC) sublayer and the network layer.
Operation
The LLC sublayer is primarily concerned with:

  • Multiplexing protocols transmitted over the MAC layer (when transmitting) and decoding them (when receiving).
  • Providing node-to-node flow and error control
In today's networks, flow control and error management is typically taken care of by a transport layer protocol such as TCP, or by some application layer protocol, in an end-to-end fashion, i.e. retransmission is done from source to end destination. This implies that the need for LLC sublayer flow control and error management has reduced. LLC is consequently only a multiplexing feature in today's link layer protocols. An LLC header tells the data link layer what to do with a packet once a frame is received. It works like this: A host will receive a frame and look in the LLC header to find out to what protocol stack the packet is destined - for example, the IP protocol at the network layer or IPX. However, today most non-IP network protocols are abandoned.
So if TCP handles the network errors, I wonder why you see a lot of LLC traffic in your captured network traffic.

The next section of the same Wikipedia article mentions slow networking:

Application examples
X.25 and LAPB
An LLC sublayer was a key component in early packet switching networks such as X.25 networks with the LAPB data link layer protocol, where flow control and error management were carried out in a node-to-node fashion, meaning that if an error was detected in a frame, the frame was retransmitted from one switch to next instead. This extensive handshaking between the nodes made the networks slow.
Could it be that something is wrong with the switch configuration?
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

This is strange. The same switch (backplane) is used by those two blade servers running FreeBSD, and by about 14 other blade servers running VMware ESXi 5.x.
ESXi blades use multiple VLANs and work perfectly. A blade running FreeBSD with no VLAN in its uplink (i.e. not sharing the same uplink as ESXi blades) has very good network performances. The only blade with problem is the one running FreeBSD and sharing the uplink of ESXi blades.

Initially I thought the LLC traffic was due to the fact I was using tcpdump on bxe0, the parent of my real network interface bxe0.161, hence capturing "noise". I've made a second capture using interface bxe0.161 in tcpdump, it was way smaller. This smaller capture shows many suspected TCP retransmissions, only 6 LLC packets.

Strange results:

scp of a 347MB file from an ESXi blade server to FreeBSD blade (same switch, same blade chassis, same uplink):

Code:
#    scp esxi11.domain.tld:/tardisks/FILE /dev/null
Password:
FILE                                                                       100%  347MB  38.6MB/s   00:09
scp of a 347MB file from an FreeBSD blade to ESXi blade (same switch, same blade chassis, same uplink):

Code:
#    scp FILE  esxi11.domain.tld:/dev/null
Password:
FILE                                                                         0%  400KB 172.2KB/s - stalled -^
traceroute from FreeBSD blade to ESXi blade:

Code:
#    traceroute esxi11.domain.tld
traceroute to esxi11.domain.tld (x.y.z.151), 64 hops max, 52 byte packets
1  * * *
2  * * *
^C
traceroute from ESXi blade to FreeBSD blade:

Code:
# traceroute freebsdB.domain.tld
traceroute to freebsdB.domain.tld (x.y.z.142), 30 hops max, 40 byte packets
1  freebsdB (x.y.z.142)  0.123 ms  0.090 ms  0.078 ms
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

I've just sent my mail to the list. Let's hope we can figure something out. I won't be too happy if I must install Linux instead :/
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

That was fast, those guys where very helpful and my problem is now solved. Setting net.inet.tcp.tso to 0 in /etc/sysctl.conf was the solution:
Code:
# sysctl net.inet.tcp.tso
net.inet.tcp.tso: 0
# scp FILE esxi11.domain.tld:/dev/null
Password:
FILE                                  100%  347MB  19.3MB/s   00:18
(and J65nko was right with his "shot in the dark"! Next time I won't stick with ifconfig...)
 

J65nko

Well-Known Member

Reaction score: 127
Messages: 453

For reference:
https://lists.freebsd.org/pipermail/freebsd-net/2014-December/040526.html

I wonder whether this issue with TSO (TCP segment offloading) was introduced by the new congestion control modules as described in Summary of five new TCP congestion control algorithms

Code:
[cmd=#]ls -l /boot/kernel/cc_*[/cmd]
-r-xr-xr-x  2 root  wheel  24568 Jul 11 01:53 /boot/kernel/cc_cdg.ko
-r-xr-xr-x  1 root  wheel  71728 Jul 11 01:53 /boot/kernel/cc_cdg.ko.symbols
-r-xr-xr-x  2 root  wheel  18632 Jul 11 01:53 /boot/kernel/cc_chd.ko
-r-xr-xr-x  1 root  wheel  66592 Jul 11 01:53 /boot/kernel/cc_chd.ko.symbols
-r-xr-xr-x  2 root  wheel  9984 Jul 11 01:53 /boot/kernel/cc_cubic.ko
-r-xr-xr-x  1 root  wheel  35504 Jul 11 01:53 /boot/kernel/cc_cubic.ko.symbols
-r-xr-xr-x  2 root  wheel  14720 Jul 11 01:53 /boot/kernel/cc_hd.ko
-r-xr-xr-x  1 root  wheel  57640 Jul 11 01:53 /boot/kernel/cc_hd.ko.symbols
-r-xr-xr-x  2 root  wheel  12288 Jul 11 01:53 /boot/kernel/cc_htcp.ko
-r-xr-xr-x  1 root  wheel  38808 Jul 11 01:53 /boot/kernel/cc_htcp.ko.symbols
-r-xr-xr-x  2 root  wheel  15784 Jul 11 01:53 /boot/kernel/cc_vegas.ko
-r-xr-xr-x  1 root  wheel  59904 Jul 11 01:53 /boot/kernel/cc_vegas.ko.symbols
 
OP
OP
patpro

patpro

Active Member

Reaction score: 14
Messages: 199

I can't remember any of these modules being present in the output of kldstat on my server, but who knows...
 
Top