Breaking the 1 Gb/s barrier with iSCSI multipathing and LACP

Believe it or not, this will be the second time I write this. I had nearly completed the post when, for some reason, I thought it best to press CTRL+W without a draft saved (don't ask why, I don't know):
http://i.stack.imgur.com/jiFfM.jpg

OK, lesson learned... So I wanted to write about something very interesting for which I've yet to find somewhere explained in detail what is needed and what to expect. I have found many articles online that describe iSCSI multipathing, except without LACP in the mix, many failures, few successes. This is what I aim to add here.

Here is some of the sources I've plowed through online:
http://n4f.siftusystems.com/index.php/2013/07/03/iscsi-multipathing-mpio/
http://nex7.blogspot.se/2013/03/ipmp-vs-lacp-vs-mpio.html
http://forums.freenas.org/index.php...esxi-setup-via-iscsi-having-some-issues.8557/
http://arstechnica.com/civis/viewtopic.php?t=1184984
http://etherealmind.com/iscsi-netwo...ipathing-hba-ha-high-availability-redundancy/
http://agnosticcomputing.com/2014/0...formance-oriented-zfs-box-for-hyper-v-vmware/
http://agnosticcomputing.com/2014/03/26/labworks-11-2-i-heart-the-arc-lets-pull-some-drives/
http://agnosticcomputing.com/2014/04/16/labworks-21-4-converged-hyper-v-switching-like-a-boss/
http://agnosticcomputing.com/2014/0...me-convergedswitching-for-hyper-v-now-please/

Storage specs
Code:
1x  Supermicro X8SIL-F
2x  Supermicro AOC-USAS2-L8i
2x  Supermicro CSE-M35T-1B
1x  Intel Core i5 650 3,2 GHz
4x  2 GB 1333 MHz DDR3 ECC RDIMM
8x  2 TB HDD (Mix of Seagate, Samsung, Western Digital)

FreeBSD-10.0-RELEASE-p2

Hypervisor specs
Code:
1x  Supermicro X8SIL-F
1x  Intel Xeon X3470 2.93 GHz
2x  2 GB 1333 MHz DDR3 ECC RDIMM
1x  CF to SATA Converter
1x  32 GB Compact Flash card (for boot/root)

CentOS release 6.5

Storage setup
Code:
Pool layout:
  pool: pool1
 state: ONLINE
  scan: scrub repaired 0 in 3h53m with 0 errors on Mon Jun  9 05:53:39 2014
config:

	NAME           STATE     READ WRITE CKSUM
	pool1          ONLINE       0     0     0
	  mirror-0     ONLINE       0     0     0
	    gpt/disk2  ONLINE       0     0     0
	    gpt/disk3  ONLINE       0     0     0
	  mirror-1     ONLINE       0     0     0
	    gpt/disk4  ONLINE       0     0     0
	    gpt/disk5  ONLINE       0     0     0
	  mirror-2     ONLINE       0     0     0
	    gpt/disk6  ONLINE       0     0     0
	    gpt/disk7  ONLINE       0     0     0
	  mirror-3     ONLINE       0     0     0
	    gpt/disk8  ONLINE       0     0     0
	    gpt/disk9  ONLINE       0     0     0

errors: No known data errors

Block mode:
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12

Partition layout:
        2048  3907027080    1  disk2  (1.8T)
        2048  3907027080    1  disk3  (1.8T)
        2048  3907027080    1  disk4  (1.8T)
        2048  3907027080    1  disk5  (1.8T)
        2048  3907027080    1  disk6  (1.8T)
        2048  3907027080    1  disk7  (1.8T)
        2048  3907027080    1  disk8  (1.8T)
        2048  3907027080    1  disk9  (1.8T)

Volume creation:
# zfs create -o compress=lz4 -b 128k -s -V 500g pool1/hypervisor_1

iSCSI target config:
/etc/ctl.conf
Code:
portal-group pg1 {
        discovery-auth-group no-authentication
        listen 172.16.10.11:3260
        listen 172.16.11.11:3261
}

target iqn.2014-06.bar.foo:storage.foo.bar:hypervisor-1 {
	auth-group no-authentication
	portal-group pg1

	lun 0 {
		path /dev/zvol/pool1/hypervisor_1
		size 500G
	}
}

Network setup

The switch used was a Netgear GS108T v2.
  1. Port 1,2,3,4 configured for jumbo frames
  2. Port 1,2 configured as LACP for storage (LAG 1)
  3. Port 3,4 configured as LACP for hypervisor (LAG 2)
  4. Create VLAN 10,11
  5. Set tagged VLAN 1,10,11 on LAG 1,2 (I can´t remember if this was needed but in case it was.)
  6. Configure LAG 1,2 for jumbo frames as well.

Storage network config

/etc/rc.conf
Code:
...
ifconfig_em0="mtu 9000 up"
ifconfig_em1="mtu 9000 up"
cloned_interfaces="lagg0 vlan1 vlan10 vlan11"
ifconfig_lagg0="up laggproto lacp laggport em0 laggport em1 lagghash l3,l4"
ifconfig_vlan1="inet 192.168.0.4 netmask 255.255.255.0 vlan 1 vlandev lagg0 mtu 1500"
ifconfig_vlan10="inet 172.16.10.11 netmask 255.255.255.0 vlan 10 vlandev lagg0 mtu 9000"
ifconfig_vlan11="inet 172.16.11.11 netmask 255.255.255.0 vlan 11 vlandev lagg0 mtu 9000"
...

Hypervisor network config

/etc/modprobe.d/bonding.conf
Code:
alias bond0 bonding
options bond0 max_bonds=8 mode=802.3ad xmit_hash_policy=layer3+4 miimon=100 downdelay=0 updelay=0

/etc/sysconfig/network-scripts/ifcfg-eth0
Code:
NM_CONTROLLED="no"
BOOTPROTO="none"
DEVICE="eth0"
ONBOOT="yes"
USERCTL="no"
MASTER="bond0"
SLAVE="yes"

/etc/sysconfig/network-scripts/ifcfg-eth1
Code:
NM_CONTROLLED="no"
BOOTPROTO="none"
DEVICE="eth1"
ONBOOT="yes"
USERCTL="no"
MASTER="bond0"
SLAVE="yes"

/etc/sysconfig/network-scripts/ifcfg-bond0
Code:
DEVICE="bond0"
NM_CONTROLLED="no"
USERCTL="no"
BOOTPROTO="none"
BONDING_OPTS="mode=4 miimon=100 xmit_hash_policy=layer3+4"
TYPE="Ethernet"
MTU="9000"

/etc/sysconfig/network-scripts/ifcfg-bond0.1
Code:
DEVICE="bond0.1"
VLAN="yes"
BOOTPROTO="none"
NM_CONTROLLED="no"
BRIDGE="Public"
MTU="1500"

/etc/sysconfig/network-scripts/ifcfg-Public
Code:
TYPE="Bridge"
NM_CONTROLLED="no"
BOOTPROTO="none"
DEVICE="Public"
ONBOOT="yes"
IPADDR="192.168.0.9"
NETMASK="255.255.255.0"

/etc/sysconfig/network-scripts/ifcfg-bond0.10
Code:
DEVICE="bond0.10"
VLAN="yes"
BOOTPROTO="none"
NM_CONTROLLED="no"
BRIDGE="Jumbo_iSCSI_1"
MTU="9000"

/etc/sysconfig/network-scripts/ifcfg-Jumbo_iSCSI_1
Code:
TYPE="Bridge"
NM_CONTROLLED="no"
BOOTPROTO="none"
DEVICE="Jumbo_iSCSI_1"
ONBOOT="yes"
IPADDR="172.16.10.10"
NETMASK="255.255.255.0"

/etc/sysconfig/network-scripts/ifcfg-bond0.11
Code:
DEVICE="bond0.11"
VLAN="yes"
BOOTPROTO="none"
NM_CONTROLLED="no"
BRIDGE="Jumbo_iSCSI_2"
MTU="9000"

/etc/sysconfig/network-scripts/ifcfg-Jumbo_iSCSI_2
Code:
TYPE="Bridge"
NM_CONTROLLED="no"
BOOTPROTO="none"
DEVICE="Jumbo_iSCSI_2"
ONBOOT="yes"
IPADDR="172.16.11.10"
NETMASK="255.255.255.0"

# chkconfig NetworkManager off
# chkconfig network on
# service NetworkManager stop
# service network restart
Note: you can´t do this over SSH of course, since it cuts your connection :)

iSCSI initiator configuration:
# yum install -y iscsi-initiator-utils device-mapper-multipath
# iscsiadm -m discovery -t sendtargets -p 172.16.10.11:3260
Code:
172.16.10.11:3260,-1 iqn.2014-06.bar.foo:storage.foo.bar:hypervisor-1
# iscsiadm -m discovery -t sendtargets -p 172.16.11.11:3261
Code:
172.16.11.11:3261,-1 iqn.2014-06.bar.foo:storage.foo.bar:hypervisor-1
# iscsiadm -m node --targetname "iqn.2014-06.bar.foo:storage.foo.bar:hypervisor-1" --portal "172.16.10.11:3260" --login
# iscsiadm -m node --targetname "iqn.2014-06.bar.foo:storage.foo.bar:hypervisor-1" --portal "172.16.11.11:3261" --login

MPIO config:
/dev/sda here is my CF boot/root device so the iSCSI volumes came in as /dev/sdb and /dev/sdc. To verify this, run:
tail /var/log/messages

# scsi_id -g -u /dev/sdb
Code:
1FREEBSD_MYDEVID_0

/etc/multipath.conf
Code:
blacklist {
        devnode "sda"
}   

defaults {
        user_friendly_names	yes
}

multipaths {

        multipath {
                wwid                 "1FREEBSD_MYDEVID_0"
                alias                hypervisor-1
                path_grouping_policy multibus
                path_selector        "round-robin 0"
                no_path_retry        5        
        }

}

Check multipath status:
# multipath -ll
Code:
hypervisor-1 (1FREEBSD_MYDEVID_0) dm-2 FREEBSD,CTLDISK
size=500G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=1 status=active
  |- 5:0:0:0 sdc 8:32 active ready running
  `- 4:0:0:0 sdb 8:16 active ready running

Using fdisk to partition /dev/mapper/hypervisor-1, you go like, "n, [enter], [enter], [enter], [enter], w". Done! Then run this to create a filesystem:
# mkfs.ext4 /dev/mapper/hypervisor-1[b]p1[/b] (filesystem goes into the partition)

Then you are free to mount that babe anywhere you want and use it for whatever. I´m using it as a VM store and these numbers are from a virtual FreeBSD server with a 80 GB UFS drive running bonnie++:
Code:
Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
virtual.machine 32G   754  99 178318  47 62653  34  1346  98 196696  37 131.0   6
Latency             19592us     926ms     501ms   36170us     355ms     437ms
Version  1.97       ------Sequential Create------ --------Random Create--------
virtual.machine   -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 27388  67 +++++ +++ +++++ +++ 18330  46 +++++ +++ +++++ +++
Latency               100ms    3078us    3026us     167ms   34878us     113us

Read 'em and weep;) Except, for the purpose of just measuring the network throughput, I zfs set sync=disabled pool1/hypervisor_1 just to show its potential.

Last words; the caveat
For reasons unknown to me (please chip in if you know), using the iftop utility watching the physical eth0 and eth1 in CentOS, both iSCSI addresses go out through the same eth0 interface, at which case throughput is capped at 1 Gb/s. Rebooting the hypervisor and looking again, both addresses go out through eth1 instead. Keep rebooting and soon enough you will be able to see address 172.16.11.10 going out through eth0 and 172.16.10.10 through eth1 (or vice versa), and then you will get 2 Gb/s throughput. No idea as to why that occurs, and as I said, if you know, do share:)

/Sebulon
 
Re: Breaking the 1 Gb/s barrier with iSCSI multipathing and

Sebulon said:
Last words; the caveat
For reasons unknown to me (please chip in if you know), using the iftop utility watching the physical eth0 and eth1 in CentOS, both iSCSI addresses go out through the same eth0 interface, at which case throughput is capped at 1 Gb/s. Rebooting the hypervisor and looking again, both addresses go out through eth1 instead. Keep rebooting and soon enough you will be able to see address 172.16.11.10 going out through eth0 and 172.16.10.10 through eth1 (or vice versa), and then you will get 2 Gb/s throughput. No idea as to why that occurs, and as I said, if you know, do share:)
/Sebulon

Could this be from layering two distinct solutions of providing redundancy? LACP alone distributes connections across interfaces in the LAGG group to get double the bandwidth and sometimes the multipath connections get distributed on both physical interfaces and sometime they don't.
 
Re: Breaking the 1 Gb/s barrier with iSCSI multipathing and

junovitch said:
Could this be from layering two distinct solutions of providing redundancy? LACP alone distributes connections across interfaces in the LAGG group to get double the bandwidth and sometimes the multipath connections get distributed on both physical interfaces and sometime they don't.

Yes that's quite possible, that it's just doing what's in some way intended. But I'll tell you something else, that when using lagghash l2,l3 + xmit_hash_policy=layer2+3 another behaviour was observed; all addresses were distributed accros the two NIC's, except that the iSCSI addresses always wound up on the same interface, no matter how many times it was rebooted. Since the switch also has a part to play in this, I've ordered a Cisco SG200-08 just to know if switching the switch ( :) ) produces more predictable results.

/Sebulon
 
Re: Breaking the 1 Gb/s barrier with iSCSI multipathing and

Hi,

I just wanted to update that the observed behaviour is the same with the SG200-08 as with the GS108T v2; rebooting the hypervisor spreads the addresses evenly over the physical interfaces, but there's no guarantee that the iSCSI addresses always are. Sometimes they do and sometimes the don't, and this leads me to believe it´s just by design. Oddly enough, the FreeBSD fileserver doesn't seem to act like this (the iSCSI addresses always winds up spread), only the CentOS hypervisor does. I wonder what the difference is...

I'd love to know if there was a way to "tell" LACP (or whatever´s in charge of this) that "look, I really don´t care much about the other addresses, but these ones should always go out over separate interfaces, whenever possible."

I'm also going to try changing hashing policy in the switch and change the servers's config to 2+3 and see how that behaves.

/Sebulon
 
Back
Top