More RSS UDP tests – this time on a Dell R720

adrian · Oct 29, 2014

I've recently had the chance to run my RSS UDP test suite up on a pair of Dell R720s. They came with on-board 10G Intel NICs (ixgbe(4) in FreeBSD) so I figured I'd run my test suite up on it.

Thank you to the Enterprise Storage Division at Dell for providing hardware for me to develop on!

The config is like in the previous blog post, but now I have two 8-core Sandy Bridge Xeon CPUs to play with. To simply things (and to not have to try and solve NUMA related issues) I'm running this on the first socket. The Intel NIC is attached to the first CPU socket.

So:

CPU: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz (2000.04-MHz K8-class CPU) x 2
RAM: 64GiB
HTT disabled

/boot/loader.conf:

[FONT=Courier New]# ... until ncpus is tunable, make it use 8 buckets.
net.inet.rss.bits=3
net.isr.maxthreads=8
net.isr.bindthreads=1[/FONT]

This time I want to test with 8 streams, so after some trial and error I found the right IPv4 addresses to use:

Server: 10.11.2.1/24
Client: 10.11.2.3/24, 10.11.2.2/24, 10.11.2.32/24, 10.11.2.33/24, 10.11.2.64/24, 10.11.2.65/24, 10.11.2.17/24, 10.11.2.18/24

The test was like before - the server ran one rss-udp-srv program that spawns one thread per RSS bucket. The client side runs rss-clt programs to generate traffic - but now there's eight of them instead of four.

The results are what I expected: the contention is in the same place (UDP receive) and it's per-core - it doesn't contend between CPU cores.

Each CPU is transmitting and receiving 215,000 510-byte UDP frames a second. It scales linearly - 1 CPU is 215,000 TX/RX frames a second. 8 CPUs is 215,000 TX/RX frames a second * 8. There's no degrading as the CPU core count increases.

That's 1.72 million packets per second. At 510 bytes frames it's about 7 gigabits/sec in and out.

The other 8 cores are idle. Ideally we'd be able to run an application in those cores - so hopefully I can get my network / rss library up and running enough to prototype an RSS-aware memcached and see if it'll handle this particular workload.

It's a far cry from what I think we can likely achieve - but please keep in mind that I know I could do more awesome looking results with netmap, PF_RING or Intel's DPDK software. What I'm trying to do is push the existing kernel networking subsystem to its limits so the issues can be exposed and fixed.

So, where's the CPU going?

In the UDP server program (pid 1620), it looks thus:

[FONT=Courier New]# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 1620
PMC: [CPU_CLK_UNHALTED_CORE] Samples: 34298 (100.0%) , 155 unresolved

%SAMP IMAGE FUNCTION CALLERS
8.0 kernel fget_unlocked kern_sendit:4.2 kern_recvit:3.9
7.0 kernel copyout soreceive_dgram:5.6 amd64_syscall:0.9
3.6 kernel __mtx_unlock_flags ixgbe_mq_start
3.5 kernel copyin m_uiotombuf:1.8 amd64_syscall:1.2
3.4 kernel memcpy ip_output:2.9 ether_output:0.6
3.4 kernel toeplitz_hash rss_hash_ip4_2tuple
3.3 kernel bcopy rss_hash_ip4_2tuple:1.4 rss_proto_software_hash_v4:0.9
3.0 kernel _mtx_lock_spin_cooki pmclog_reserve
2.7 kernel udp_send sosend_dgram
2.5 kernel ip_output udp_send[/FONT]

In the NIC receive / transmit thread(s) (pid 12), it looks thus:

[FONT=Courier New]# pmcstat -P CPU_CLK_UNHALTED_CORE -T -w 1 -p 12

PMC: [CPU_CLK_UNHALTED_CORE] Samples: 79319 (100.0%) , 0 unresolved

%SAMP IMAGE FUNCTION CALLERS
10.3 kernel ixgbe_rxeof ixgbe_msix_que
9.3 kernel __mtx_unlock_flags ixgbe_rxeof:4.8 netisr_dispatch_src:2.1 in_pcblookup_mbuf:1.3
8.3 kernel __mtx_lock_flags ixgbe_rxeof:2.8 netisr_dispatch_src:2.4 udp_append:1.2 in_pcblookup_mbuf:1.1 knote:0.6
3.8 kernel bcmp netisr_dispatch_src
3.6 kernel uma_zalloc_arg sbappendaddr_locked_internal:2.0 m_getjcl:1.6
3.4 kernel ip_input netisr_dispatch_src
3.4 kernel lock_profile_release __mtx_unlock_flags
3.4 kernel in_pcblookup_mbuf udp_input
3.0 kernel ether_nh_input netisr_dispatch_src
2.4 kernel udp_input ip_input
2.4 kernel mb_free_ext m_freem
2.2 kernel lock_profile_obtain_ __mtx_lock_flags
2.1 kernel ixgbe_refresh_mbufs ixgbe_rxeof[/FONT]

It looks like there's some obvious optimisations to poke at (what the heck is fget_unlocked() doing up there?) and yes, copyout/copyin are really terrible but currently unavoidable. The toeplitz hash and bcopy aren't very nice but they're occuring in the transmit path because at the moment there's no nice way to efficiently set both the outbound RSS hash and RSS bucket ID and send to a non-connected socket destination (ie, specify the destination IP

ort as part of the send.) There's also some lock contention that needs to be addressed.

The output of the netisr queue statistics looks good:

[FONT=Courier New]root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -Q
Configuration:
Setting Current Limit
Thread count 8 8
Default queue limit 256 10240
Dispatch policy direct n/a
Threads bound to CPUs enabled n/a

Protocols:
Name Proto QLimit Policy Dispatch Flags
ip 1 256 cpu hybrid C--
igmp 2 256 source default ---
rtsock 3 256 source default ---
arp 4 256 source default ---
ether 5 256 cpu direct C--
ip6 6 256 flow default ---
ip_direct 9 256 cpu hybrid C--

Workstreams:
WSID CPU Name Len WMark Disp'd HDisp'd QDrops Queued Handled
0 0 ip 0 25 0 839349259 0 49 839349308
0 0 igmp 0 0 0 0 0 0 0
0 0 rtsock 0 2 0 0 0 92 92
0 0 arp 0 0 118 0 0 0 118
0 0 ether 0 0 839349600 0 0 0 839349600
0 0 ip6 0 0 0 0 0 0 0
0 0 ip_direct 0 0 0 0 0 0 0
1 1 ip 0 20 0 829928186 0 286 829928472
1 1 igmp 0 0 0 0 0 0 0
1 1 rtsock 0 0 0 0 0 0 0
1 1 arp 0 0 0 0 0 0 0
1 1 ether 0 0 829928672 0 0 0 829928672
1 1 ip6 0 0 0 0 0 0 0
1 1 ip_direct 0 0 0 0 0 0 0
2 2 ip 0 0 0 835558437 0 0 835558437
2 2 igmp 0 0 0 0 0 0 0
2 2 rtsock 0 0 0 0 0 0 0
2 2 arp 0 0 0 0 0 0 0
2 2 ether 0 0 835558610 0 0 0 835558610
2 2 ip6 0 0 0 0 0 0 0
2 2 ip_direct 0 0 0 0 0 0 0
3 3 ip 0 1 0 850271162 0 23 850271185
3 3 igmp 0 0 0 0 0 0 0
3 3 rtsock 0 0 0 0 0 0 0
3 3 arp 0 0 0 0 0 0 0
3 3 ether 0 0 850271163 0 0 0 850271163
3 3 ip6 0 0 0 0 0 0 0
3 3 ip_direct 0 0 0 0 0 0 0
4 4 ip 0 23 0 817439448 0 345 817439793
4 4 igmp 0 0 0 0 0 0 0
4 4 rtsock 0 0 0 0 0 0 0
4 4 arp 0 0 0 0 0 0 0
4 4 ether 0 0 817439625 0 0 0 817439625
4 4 ip6 0 0 0 0 0 0 0
4 4 ip_direct 0 0 0 0 0 0 0
5 5 ip 0 19 0 817862508 0 332 817862840
5 5 igmp 0 0 0 0 0 0 0
5 5 rtsock 0 0 0 0 0 0 0
5 5 arp 0 0 0 0 0 0 0
5 5 ether 0 0 817862675 0 0 0 817862675
5 5 ip6 0 0 0 0 0 0 0
5 5 ip_direct 0 0 0 0 0 0 0
6 6 ip 0 19 0 817281399 0 457 817281856
6 6 igmp 0 0 0 0 0 0 0
6 6 rtsock 0 0 0 0 0 0 0
6 6 arp 0 0 0 0 0 0 0
6 6 ether 0 0 817281665 0 0 0 817281665
6 6 ip6 0 0 0 0 0 0 0
6 6 ip_direct 0 0 0 0 0 0 0
7 7 ip 0 0 0 813562616 0 0 813562616
7 7 igmp 0 0 0 0 0 0 0
7 7 rtsock 0 0 0 0 0 0 0
7 7 arp 0 0 0 0 0 0 0
7 7 ether 0 0 813562620 0 0 0 813562620
7 7 ip6 0 0 0 0 0 0 0
7 7 ip_direct 0 0 0 0 0 0 0
root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # [/FONT]

It looks like everything is being dispatched correctly; nothing is being queued and/or dropped.

But yes, we're running out of socket buffers because each core is 100% pinned:

[FONT=Courier New]root@abaddon:/home/adrian/git/github/erikarn/freebsd-rss # netstat -sp udp
udp:
6773040390 datagrams received
0 with incomplete header
0 with bad data length field
0 with bad checksum
0 with no checksum
17450880 dropped due to no socket
136 broadcast/multicast datagrams undelivered
1634117674 dropped due to full socket buffers
0 not for hashed pcb
5121471700 delivered
5121471044 datagrams output
0 times multicast source filter matched[/FONT]

There's definitely room for improvement.

More RSS UDP tests – this time on a Dell R720

adrian

Guest