BCM57711 on FreeBSD 9.1 final (lot of issues)
Hi!
We recently installed FreeBSD 9.1 64bit on a Dell PowerEdge R510 system in which we have two BCM57711 (for a total of four 10 Gbit interfaces.) Actually in test, the filer is connected with two 10 Gbps interfaces to a 10 Gbps Dell PowerConnect switch that serves some Linux clients using 10 Gbps cards too. We get into a lot of troubles trying to get something working out of this setup.
First issue:
Without any special tweaking, when we're reading or writing to the NFS server from a client, the network card crashes. In the logs I can see:
Code:
Jul 19 11:49:26 filer-01-a kernel: bxe0: ---------- Begin crash dump ----------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x3
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)!
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------ Idle Check ------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CFC: AC > 1 - LCID 39 CID_CAM 0x7 Value is 0xc
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: VOQ_0, VOQ credit is not equal to initial credit. Values are 0xf8 0x140
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING QM: P0 Byte credit is not equal to initial credit. Values are 0x5a1c 0x8000
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING CCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING XCM: XX protection CAM is not empty. Value is 0x1
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING BRB1: BRB is not empty. Value is 0x4
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING TCM: FIC0_INIT_CRD is not 64. Value is 0x30
Jul 19 11:49:26 filer-01-a kernel: bxe0: WARNING PRS: TCM current credit is not 0. Value is 0x10
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR TSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR CSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: ERROR XSEM: interrupt status 0 is not 0. Value is 0x10000
Jul 19 11:49:26 filer-01-a kernel: bxe0: bxe_idle_chk(): Failed with 4 error(s) and 0 warning(s)!
Jul 19 11:49:26 filer-01-a kernel: bxe0: ------------------------------------------------------------------------
Jul 19 11:49:26 filer-01-a kernel: bxe0: ---------- End crash dump ----------
A reboot of the system is not even enough. After rebooting the system, I can't even ping any hosts on the network. It seems that it leaves the card in a bogus state that requires a complete power cycle to get the cards back in business.
We found out that disabling:
tso4 txcsum rxcsum on the cards prevents this from happening.
So although I think it's not, let's say we have a fix for this setting in
rc.conf something like this:
Code:
ifconfig_bxe0="inet 10.50.50.11 netmask 255.255.255.0 mtu 9000 -tso4 -txcsum -rxcsum"
Second issue:
Issuing an
ifconfig mtu 9000
on the interfaces randomly produces this error:
Code:
Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain.
Jul 19 09:47:03 filer-01-a kernel: bxe0: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting!
Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(10934): Memory allocation failure! Cannot fill fp[04] RX chain.
Jul 19 09:47:12 filer-01-a kernel: bxe3: /usr/src/sys/dev/bxe/if_bxe.c(3921): NIC initialization failed, aborting!
That sounds quite bad and I can't reproduce it with a MTU 1500 setting. (but does it makes sense to use a MTU of 1500 on a 10 Gbps local network?)
Third issue,
part 1)
We've tried two interfaces (each interface with an MTU of 9000) using LAGG, like this:
Code:
ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 9000
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24
This instantly crashes the kernel and causes a machine reboot. The log says:
Code:
Jul 19 09:47:12 filer-01-a kernel:
Jul 19 09:47:12 filer-01-a kernel:
Jul 19 09:47:12 filer-01-a kernel: Fatal trap 12: page fault while in kernel mode
Jul 19 09:47:12 filer-01-a kernel: cpuid = 0; apic id = 20
Jul 19 09:47:12 filer-01-a kernel: fault virtual address = 0x6d
Jul 19 09:47:12 filer-01-a kernel: fault code = supervisor read data, page not present
Jul 19 09:47:12 filer-01-a kernel: instruction pointer = 0x20:0xffffffff808d5879
Jul 19 09:47:12 filer-01-a kernel: stack pointer = 0x28:0xffffff80003227f0
--*** BOOOM REBOOT ***--
Jul 19 09:49:49 filer-01-a syslogd: kernel boot file is /boot/kernel/kernel
/var/crash/core.txt.0 returns:
Code:
Unread portion of the kernel message buffer:
Fatal trap 12: page fault while in kernel mode
cpuid = 5; apic id = 33
fault virtual address = 0x6d
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff808d5879
stack pointer = 0x28:0xffffff80003227f0
frame pointer = 0x28:0xffffff8000322820
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi6: task queue)
trap number = 12
panic: page fault
cpuid = 5
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80bd8240 at trap_fatal+0x290
#3 0xffffffff80bd857d at trap_pfault+0x1ed
#4 0xffffffff80bd8b9e at trap+0x3ce
#5 0xffffffff80bc315f at calltrap+0x8
#6 0xffffffff8045da8c at bxe_free_buf_rings+0x4c
#7 0xffffffff8046c0d5 at bxe_init_locked+0x125
#8 0xffffffff80470cfe at bxe_ioctl+0x4fe
#9 0xffffffff8099d08f at if_setlladdr+0x1ff
#10 0xffffffff8174c94a at lagg_port_setlladdr+0x8a
#11 0xffffffff8092cf55 at taskqueue_run_locked+0x85
#12 0xffffffff8092d0da at taskqueue_run+0x3a
#13 0xffffffff808be8d4 at intr_event_execute_handlers+0x104
#14 0xffffffff808c0076 at ithread_loop+0xa6
#15 0xffffffff808bb9ef at fork_exit+0x11f
#16 0xffffffff80bc368e at fork_trampoline+0xe
Uptime: 39m41s
Dumping 1505 out of 32735 MB:..2%..11%..21%..31%..41%..52%..61%..71%..81%..91%
...cropped...
Okay, guess it has something to do again with the MTU 9000 but this time it does completely panic the kernel. This is no good.
Part 2) Trying bonding with normal MTU 1500
Code:
ifconfig bxe0 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig bxe2 up -tso4 -txcsum -rxcsum mtu 1500
ifconfig lagg0 create
ifconfig lagg0 up laggproto failover laggport bxe0 laggport bxe2 10.50.50.11/24
This time, no error messages, no crash.
But no. Even when everything seems to be correct, the bonding is not working. We can't ping any host on the network. Also the
lagg0 says: No carrier
See:
Code:
bxe0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
ether 00:10:18:98:35:f8
inet6 fe80::210:18ff:fe98:35f8%bxe0 prefixlen 64 scopeid 0x3
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-SR <full-duplex>)
status: active
bxe2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
ether 00:10:18:98:35:f8
inet6 fe80::210:18ff:fe95:eaa0%bxe2 prefixlen 64 scopeid 0x5
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (10Gbase-SR <full-duplex>)
status: active
lagg0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM>
ether 00:10:18:98:35:f8
inet6 fe80::7a2b:cbff:fe1a:eab1%lagg0 prefixlen 64 scopeid 0x14
inet 10.50.50.11 netmask 0xffffff00 broadcast 10.50.50.255
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
media: Ethernet autoselect
status: no carrier
laggproto failover lagghash l2,l3,l4
laggport: bxe2 flags=0<>
laggport: bxe0 flags=1<MASTER>
Please note that prior to installing
freebsd FreeBSD, the machine was running a Debian 7 GNU/Linux 64 bit OS where we had the cards bonded and MTU'ed to 9000 without any crash or stability issue. So it looks to me that there is something really wrong with the Broadcom driver on
freebsd 9.1, at least with the NICs used in Dell servers.
Provided that Broadcom themselves doesn't supply drivers for
freebsd is there any possible fix?
Thanks for your attention and your help.
Cheers,
Sébastien