Infiniband IP address change on IB link-layer addresses is causing kernel trap 12 on iSCSI server

getcom · Sep 8, 2019

Hello all,

this is my first post in the FreeBSD forum.
What I have seen for the last 18 month has impressed me. The quality of the packages, the speed of the packet filter, the concepts behind FreeBSD are reasons to switch other services of my customers to FreeBSD too. The last nugget what I found was OpenZFS which seems to be the best choice to run it on FreeBSD if you don`t like Oracle`s policies.

I`m a old school Unix/Linux user with experience with Solaris, Linux, HP UX and since 1 1/2 years also with FreeBSD.
I`m using FreeBSD for networking/firewalls/VPN and now I`m testing it as a ZFS iSCSI server for a Proxmox cluster.
The distribution is FreeNAS 11.2-U5 which is based on FreeBSD, so I hope it is ok for you that I post my problems with Infiniband here too.

I did not get any useful answer to our problems on the Mellanox, FreeNAS and Proxmox forums. This is the reason why I`m here.

Here is the story:

We have a running Proxmox VE 5.4-13 three node cluster with separate 40Gbit Infiniband dual port card on each server for connecting to a FreeNAS iSCSI server (release is FreeNAS-11.2-U5, latest version, based on FreeBSD 11.2).
On FreeNAS we have a ConnectX-3 card with latest firmware. We tested also a ConnectX-2 card and different firmware versions.
On Proxmox we have ConnectX-2 cards with latest firmware.
The Grid Director 4036 has also the latest firmware running.

Proxmox:
For the cluster communication we have setup a bond0 with two Gigabit cards.
For the VM communication we have setup bond1 with two 10Gbit SFP+ cards.

This is all generally working as it should with two exceptions:
The FreeNAS iSCSI server has sometimes a kernel trap 12 and then it is rebooting.
It is more often if the I/O throughput is higher.
We cannot use the default MTU 65520 because of conncetion errors.

On the FreeNAS side we have also a dual port Mellanox Infiniband ConnectX-3 card which is connected to an Infiniband switch (Grid Director 4036).
The three Proxmox cluster nodes are also connected to this switch over the Infiniband cards.

The Infiniband cards on both ends are configured for connected mode with a MTU of 40950 (with a default MTU=65520 we got lots of connection errors).
We are using a multipath setup on Proxmox with two subnets for IP over Infiniband.
This is working and we get a throughput of ~1-1.1 Gigabyte per second on VMs running on each cluster node in parallel.

Sporadically we get a kernel trap on the FreeNAS server which is then rebooting.
This can happen from half an hour up to 4 days uptime.
In this situation while the FreeNAS server is rebooting, the VMs are not crashing, they are in a delay until the FreeNAS server is online again.
Nevertheless we have to fix it.

I analyzed the FreeNAS crash dumps in /data/crash/ and found out that the last what is happening before the kernel is crashing are events like that:

<118>Fri Sep 6 05:20:07 CEST 2019
<6>arp: 10.20.24.111 moved from 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 to 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 on ib0
<6>arp: 10.20.24.110 moved from 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e3 to 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e4 on ib0
<6>arp: 10.20.25.111 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 on ib1
<6>arp: 10.20.25.110 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e4 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:20:e3 on ib1
<6>arp: 10.20.24.111 moved from 80:00:02:09:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c2 to 80:00:02:08:fe:80:00:00:00:00:00:00:00:02:c9:03:00:09:9f:c1 on ib0
<4>ib0: packet len 12380 (> 2044) too long to send, dropping

It is every time the same behavior. The both 20-octet IPoIB link-layer addresses on all three Proxmox clients are changing from time to time.
After that it looks like that the FreeBSD server sometimes is using the Datagram mode for the new connections and then it tries to send a large packet over this connection which could come from a previous client request/connection in Connected mode. I have still no evidence for that, but this seems to be the most logical explanation. Next step would be to dump the network traffic to get a clearer picture what is happening.
We have nothing interesting in the Linux logs on the Proxmox side related to the link layer address issue.

The root cause seems to be the changing of the IP addresses/link layer addresses on the Proxmox client side and secondary the Datagram mode behavior on the FreeBSD side.

Maybe somebody has an idea what is happening here with the link layer addresses and how to avoid that?

Here are more details of the setup:

On the FreeNAS side I`m using an own subnet for each IB port and I have two portals in the iSCSI setup for each IP (10.20.24.100/24 & 10.20.25.100/24).

The kernel of FreeNAS:
root@freenas1[/data/crash]# uname -a
FreeBSD freenas1 11.2-STABLE FreeBSD 11.2-STABLE #0 r325575+6aad246318c(HEAD): Mon Jun 24 17:25:47 UTC 2019 root@nemesis:/freenas-releng/freenas/_BE/objs/freenas-releng/freenas/_BE/os/sys/FreeNAS.amd64 amd64

The modules loaded on FreeNAS side:
root@freenas1[/data/crash]# cat /boot/loader.conf.local
mlx4ib_load="YES" # Be sure that Kernel modul Melloanox 4 Infiniband will be loaded
ipoib_load="YES" # Be sure that Kernel modul IP over Infiniband will be loaded
kernel="kernel"
module_path="/boot/kernel;/boot/modules;/usr/local/modules"
kern.cam.ctl.ha_id=0

root@freenas1[/data/crash]# kldstat
Id Refs Address Size Name
1 72 0xffffffff80200000 25608a8 kernel
2 1 0xffffffff82762000 100eb0 ispfw.ko
3 1 0xffffffff82863000 f9f8 ipmi.ko
4 2 0xffffffff82873000 2d28 smbus.ko
5 1 0xffffffff82876000 8a10 freenas_sysctl.ko
6 1 0xffffffff8287f000 3aff0 mlx4ib.ko
7 1 0xffffffff828ba000 1a388 ipoib.ko
8 1 0xffffffff82d11000 32e048 vmm.ko
9 1 0xffffffff83040000 a74 nmdm.ko
10 1 0xffffffff83041000 e610 geom_mirror.ko
11 1 0xffffffff83050000 3a3c geom_multipath.ko
12 1 0xffffffff83054000 2ec dtraceall.ko
13 9 0xffffffff83055000 3acf8 dtrace.ko
14 1 0xffffffff83090000 5b8 dtmalloc.ko
15 1 0xffffffff83091000 1898 dtnfscl.ko
16 1 0xffffffff83093000 1d31 fbt.ko
17 1 0xffffffff83095000 53390 fasttrap.ko
18 1 0xffffffff830e9000 bfc sdt.ko
19 1 0xffffffff830ea000 6d80 systrace.ko
20 1 0xffffffff830f1000 6d48 systrace_freebsd32.ko
21 1 0xffffffff830f8000 f9c profile.ko
22 1 0xffffffff830f9000 13ec0 hwpmc.ko
23 1 0xffffffff8310d000 7340 t3_tom.ko
24 2 0xffffffff83115000 ab8 toecore.ko
25 1 0xffffffff83116000 ddac t4_tom.ko
Kernel running on Proxmox:
root@pvecn1:~# uname -a
Linux pvecn1 4.15.18-20-pve #1 SMP PVE 4.15.18-46 (Thu, 8 Aug 2019 10:42:06 +0200) x86_64 GNU/Linux
The modules loaded on Proxmox side:
root@pvecn1:~# cat /etc/modules-load.d/mellanox.conf
mlx4_core
mlx4_ib
mlx4_en
ib_cm
ib_core
ib_ipoib
ib_iser
ib_umad

The Infiniband network setup for example on the first Proxmox client:
# Mellanox Infiniband
auto ib0
iface ib0 inet static
address 10.20.24.110
netmask 255.255.255.0
pre-up echo connected > /sys/class/net/$IFACE/mode
#post-up /sbin/ifconfig $IFACE mtu 65520
post-up /sbin/ifconfig $IFACE mtu 40950

# Mellanox Infiniband
auto ib1
iface ib1 inet static
address 10.20.25.110
netmask 255.255.255.0
pre-up echo connected > /sys/class/net/$IFACE/mode
#post-up /sbin/ifconfig $IFACE mtu 65520
post-up /sbin/ifconfig $IFACE mtu 40950

On the Proxmox side I`m running a multipath setup.
This is the content of /etc/multipath.conf:
defaults {
polling_interval 2
path_selector "round-robin 0"
path_grouping_policy multibus
uid_attribute ID_SERIAL
rr_min_io_rq 1
rr_weight uniform
failback immediate
no_path_retry queue
user_friendly_names yes
}
...

ifconfig for the both Infiniband ports on the FreeNAS server looks like that:

ib0: flags=8043<UP,BROADCAST,RUNNING,MULTICAST> metric 0 mtu 40950
options=80018<VLAN_MTU,VLAN_HWTAGGING,LINKSTATE>
lladdr 80.0.2.8.fe.80.0.0.0.0.0.0.0.2.c9.3.0.3a.ed.41
inet 10.20.24.210 netmask 0xffffff00 broadcast 10.20.24.255
nd6 options=9<PERFORMNUD,IFDISABLED>
ib1: flags=8043<UP,BROADCAST,RUNNING,MULTICAST> metric 0 mtu 40950
options=80018<VLAN_MTU,VLAN_HWTAGGING,LINKSTATE>
lladdr 80.0.2.9.fe.80.0.0.0.0.0.0.0.2.c9.3.0.3a.ed.42
inet 10.20.25.210 netmask 0xffffff00 broadcast 10.20.25.255
nd6 options=9<PERFORMNUD,IFDISABLED>

ifconfig on the first Proxmox client:

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 40950
inet 10.20.24.110 netmask 255.255.255.0 broadcast 10.20.24.255
inet6 fe80::202:c903:9:20e3 prefixlen 64 scopeid 0x20<link>
unspec 80-00-02-08-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC)
RX packets 5596912 bytes 10293861835 (9.5 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 3744669 bytes 48471009082 (45.1 GiB)
TX errors 0 dropped 125 overruns 0 carrier 0 collisions 0

ib1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 40950
inet 10.20.25.110 netmask 255.255.255.0 broadcast 10.20.25.255
inet6 fe80::202:c903:9:20e4 prefixlen 64 scopeid 0x20<link>
unspec 80-00-02-09-FE-80-00-00-00-00-00-00-00-00-00-00 txqueuelen 256 (UNSPEC)
RX packets 6863837 bytes 8858149718 (8.2 GiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 6197948 bytes 96516756048 (89.8 GiB)
TX errors 0 dropped 257 overruns 0 carrier 0 collisions 0

Any hints are welcome for the following questions:
What is the root cause of the connection problems with the default MTU 65520?
What is the root cause for the link layer address switching?
Why is the FreeBSD ipoib driver jumping back to datagram mode but the connected mode is configured? Is this a bug?

Thank you in advance.

Regards Ralf

Hopefully here are some people with Infiniband knowledge.

getcom · Sep 9, 2019

Hello all,

I think I have found an answer to the address change problem on the Linux clients:

I found a comment on embeddedlinux.org:

A Linux host replies to any ARP solicitation requests that specify a target IP address configured on any of its interfaces, even if the request was received on this host by a different interface. To make Linux behave as if addresses belong to interfaces, administrators can use the ARP_IGNORE feature described later in the section "/proc Options."
Hosts can experience the ARP flux problem, in which the wrong interface becomes associated with an L3 address. This problem is described in the text that follows.

other sources:
http://www.mellanox.com/related-doc...OFED_Release_Notes-1.5.1-1.3.6_for_Oracle.txt

- When multiple vNics are connected to the same network, hosts can experience the "ARP flux" problem, in which the wrong interface becomes associated with an L3 address (FM #87335).

Workaround:

Set the following kernel configuration parameters: include the following lines in /etc/sysctl.conf and reboot the machine:
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2

https://downloads.openfabrics.org/O...release/OFED-1.4-docs/ipoib_release_notes.txt

3. Known Issues ===============================================================================

1. If a host has multiple interfaces and
(a) each interface belongs to a different IP subnet,
(b) they all use the same InfiniBand Partition, and
(c) they are connected to the same IB Switch,
then the host violates the IP rule requiring different broadcast domains.
Consequently, the host may build an incorrect ARP table.

The correct setting of a multi-homed IPoIB host is achieved by using a different PKEY for each IP subnet.
If a host has multiple interfaces on the same IP subnet, then to prevent a peer from building an incorrect ARP entry (neighbor) set the net.ipv4.conf.X.arp_ignore value to 1 or 2, where X stands for the IPoIB (non-child) interfaces (e.g., ib0, ib1, etc). This causes the network stack to send ARP replies only on the interface with the IP address specified in the ARP request:

sysctl -w net.ipv4.conf.ib0.arp_ignore=1
sysctl -w net.ipv4.conf.ib1.arp_ignore=1

Or, globally,

sysctl -w net.ipv4.conf.all.arp_ignore=1
For the running kernel on each client I executed following:
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore; echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce; echo 1 >/proc/sys/net/ipv4/conf/ib0/arp_ignore; echo 1 >/proc/sys/net/ipv4/conf/ib1/arp_ignore

I added the corresponding post-up lines to /etc/network/interface to get it permanent.

Hopefully the kernel trap 12 is gone now.
I received an arp address change message on server side every 1 to 15 minutes. This is gone now.
There are no such arp messages since 2 hours.

Surprisingly I could also switch to a MTU of 65520 which was not working previously without lots of connection errors on each client.
On one client there is still something what I have to check. I had two events like that:
connection3:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4408262456, last ping 4408263744, now 4408265024
Sep 9 04:53:45 pvecn3 kernel: [453492.780173] connection3:0: detected conn error (1022)
Sep 9 04:53:45 pvecn3 kernel: [453492.780363] scsi_io_completion: 10 callbacks suppressed

Conclusion:
FreeBSD is doing a good job, Linux was the problem child.

Regards,
Ralf

Infiniband IP address change on IB link-layer addresses is causing kernel trap 12 on iSCSI server

getcom

getcom