Performance troubleshouting of FreeBSD networking stack and/or kqueue/kevent

Hi there,

First I want to say that I understand that the following questions are very broad and possibly only indirectly related to the FreeBSD networking (not sure). It's just that after days spent on the bellow issue the only option I can see is to ask for help, or some piece of advice.

There is a project called F-Stack. It glues together the networking stack from FreeBSD 11 over DPDK. It uses the DPDK to get the packets from the network card in user space and then uses the FreeBSD stack to handle the packets again in user space. It also provides socket API and epoll API which uses internally kqueue/kevent from the FreeBSD.
We made a setup to test the performance of transparent TCP proxy based on F-Stack and another one running on Standard Linux kernel. We did the tests on KVM with 2 cores (Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz) and 32GB RAM. 10Gbs NIC was attached in passthrough mode.
The application level code, the one which handles epoll notifications and memcpy data between the sockets, of the both proxy application is 100% the same. Both proxy applications are single threaded and in all tests we pinned the applications on core 1. The interrupts from the network card were pinned to the same core 1.

Here are the test results:
1. The Linux based proxy was able to handle about 1.7-1.8 Gbps before started to throttle the traffic. No visible CPU usage was observed on core 0 during the tests only core 1 where the application and the IRQs were pinned took the load.
2. The DPDK+FreeBSD proxy was able to thandle 700-800 Mbps before started to throttle the traffic. No visible CPU usage was observed on core 0 during the tests only core 1 where the application was pinned took the load.
3. We did another test with the DPDK+FreeBSD proxy just to give us some more info about the problem. We disabled the TCP proxy functionality and let the packets be simply ip forwarded by the FreeBSD stack. In this test we reached up to 5Gbps without being able to throttle the traffic. We just don't have more traffic to redirect there at the moment.
4. We did a profiling with Linux perf of the DPDK+FreeBSD proxy with 200 Mbps of traffic just to check if some functionality is visible bottleneck. If I understand the results correctly, the application spends most time to read packets from the network card and after than the time is spent in the kqueue/kevent related functionality.
Code:
# Children      Self       Samples  Command          Shared Object       Symbol                                               
# ........  ........  ............  ...............  ..................  .....................................................
#
    43.46%    39.67%          9071  xproxy.release   xproxy.release      [.] main_loop
            |          
            |--35.31%--main_loop
            |          |          
            |           --3.71%--_recv_raw_pkts_vec_avx2
            |          
            |--5.44%--0x305f6e695f676e69
            |          main_loop
            |          
             --2.68%--0
                       main_loop

    25.51%     0.00%             0  xproxy.release   xproxy.release      [.] 0x0000000000cdbc40
            |
            ---0xcdbc40
               |          
               |--5.03%--__cap_rights_set
               |          
               |--4.65%--kern_kevent
               |          
               |--3.85%--kqueue_kevent
               |          
               |--3.62%--__cap_rights_init
               |          
               |--3.45%--kern_kevent_fp
               |          
               |--1.90%--fget
               |          
               |--1.61%--uma_zalloc_arg
               |          
                --1.40%--fget_unlocked

    10.01%     0.00%             0  xproxy.release   [unknown]           [k] 0x00007fa0761d8010
            |
            ---0x7fa0761d8010
               |          
               |--4.23%--ff_kevent_do_each
               |          
               |--2.33%--net::ff_epoll_reactor_impl:: process_events <-- Only this function is ours
               |          
               |--1.96%--kern_kevent
               |          
                --1.48%--ff_epoll_wait

     7.13%     7.12%          1627  xproxy.release   xproxy.release      [.] kqueue_kevent
            |          
            |--3.84%--0xcdbc40
            |          kqueue_kevent
            |          
            |--2.41%--0
            |          kqueue_kevent
            |          
             --0.88%--kqueue_kevent

     6.82%     0.00%             0  xproxy.release   [unknown]           [.] 0x0000000001010010
            |
            ---0x1010010
               |          
               |--2.40%--uma_zalloc_arg
               |          
                --1.22%--uma_zero_item

Here are are the configuration options for the FreeBSD stack
Code:
[freebsd.boot]
hz=100
fd_reserve=1024
kern.ncallout=524288
kern.sched.slice=1
kern.maxvnodes=524288
kern.ipc.nmbclusters=262144
kern.ipc.maxsockets=524000
net.inet.ip.fastforwarding=1
net.inet.tcp.syncache.hashsize=32768
net.inet.tcp.syncache.bucketlimit=32
net.inet.tcp.syncache.cachelimit=1048576
net.inet.tcp.tcbhashsize=524288
net.inet.tcp.syncache.rst_on_sock_fail=0
net.link.ifqmaxlen=4096
kern.features.inet6=0
net.inet6.ip6.auto_linklocal=0
net.inet6.ip6.accept_rtadv=2
net.inet6.icmp6.rediraccept=1
net.inet6.ip6.forwarding=0

[freebsd.sysctl]
kern.maxfiles=524288
kern.maxfilesperproc=524288
kern.ipc.soacceptqueue=4096
kern.ipc.somaxconn=4096
kern.ipc.maxsockbuf=16777216
kern.ipc.nmbclusters=262144
kern.ipc.maxsockets=524288
net.link.ether.inet.maxhold=5
net.inet.ip.redirect=0
net.inet.ip.forwarding=1
net.inet.ip.portrange.first=1025
net.inet.ip.portrange.last=65535
net.inet.ip.intr_queue_maxlen=4096
net.inet.tcp.syncache.rst_on_sock_fail=0
net.inet.tcp.rfc1323=1
net.inet.tcp.fast_finwait2_recycle=1
net.inet.tcp.sendspace=16384
net.inet.tcp.recvspace=16384
net.inet.tcp.cc.algorithm=cubic
net.inet.tcp.sendbuf_max=16777216
net.inet.tcp.recvbuf_max=16777216
net.inet.tcp.sendbuf_auto=1
net.inet.tcp.recvbuf_auto=1
net.inet.tcp.sendbuf_inc=16384
net.inet.tcp.recvbuf_inc=524288
net.inet.tcp.sack.enable=1
net.inet.tcp.msl=2000
net.inet.tcp.delayed_ack=1
net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1

From the above tests and measurements I made the following conclusions/observations (which of course could be wrong):
- The FreeBSD stack has no problems forwarding 5Gbps of traffic and thus the performance decrease should be caused of some of the above layers - TCP handling in the stack, kqueue/kevent functionality and the application working with the kqueue/kevent?
- The kqueue/kevent functionality appear in the CPU profiling with much higher numbers than any application code. This could be because the application is using the kqueue in some wrong way. However, it uses it through the F-stack epoll functions, which wraps the kqueue/kevent functions of FreeBSD. The application code with the regard to epoll managment is 100% the same in the two proxy applications. As far as I looked in the F-stack wrappers over the kqueue/kevent functions they are very thin and they don't do anything suspicous but I don't have much experience with kqueue/kevent and FreeBSD on the whole.
- For the Linux proxy case, the IRQs may be handled on given core but the actual packet processing within the networking stack could happen on both cores and this could lead to the better performance. However, we did not observe visible CPU usage on the core 0 during the tests.

And finally, after this long post, here are my questions:
1. Does somebody has observations or educated guess what amount of traffic should I expect the FreeBSD stack + kqueue to process in the above scenario? Are the numbers low or expected?
2. Does somebody can think of some the kqueue/kevent specifics compared to Linux epoll which can lead to worse performance? For example the usage of EV_CLEAR flag?
3. Can I check some counters of the FreeBSD stack which will point me to potential bottlenecks?
3. If somebody can give me some advice what more to more check/debug/profile, or what config/sysctl settings to tweak to improve the performance of the DPDK+FreeBSD based proxy?

Any help is appreciated!

Thanks in advance,
Pavel.
 
Back
Top