Small Packets - Higher CPU Usage

Hello,

I have a Dual E5-2650, and when dealing with small packets, lets say :

1.500.000 PPS up to 2.000.000+ PPS

The system starts really using alot of CPU, but before i had a Dual E5630 (old cpu compared to those ones w/ 8 Cores each and 16 threads), and the performance didn't changed at all after i replaced the CPU's by better ones.

Anybody have any tip on how this could be solved ?

I did a small system profiling and found the bottlenecks in the kernel which are :

Code:
                                                     <spontaneous>
[1]     56.2    0.00       21.62                 taskqueue_thread_loop [1]
                0.01       19.34 1216937/1216937     taskqueue_run_locked [2]
                0.03        2.24 1216937/1216937     msleep_spin [26]

-----------------------------------------------

                0.01       19.34 1216937/1216937     taskqueue_thread_loop [1]
[2]     50.3    0.01       19.34 1216937         taskqueue_run_locked [2]
                0.90       17.75 1068049/1068049     lem_handle_rxtx [3]
                0.00        0.32 1216937/1217196     wakeup [63]
                0.32        0.00 1216937/45219536     spinlock_exit <cycle 1> [6]
                0.00        0.04  148888/148888      dummynet_task [103]
                0.00        0.00 1216937/21067380     spinlock_enter [87]

-----------------------------------------------

                0.90       17.75 1068049/1068049     taskqueue_run_locked [2]
[3]     48.4    0.90       17.75 1068049         lem_handle_rxtx [3]
                1.38       12.59 8877441/8877441     ether_input [4]
                0.25        3.50 8877441/8877441     lem_get_buf [19]
                0.02        0.00 1068049/1068049     lem_enable_intr [111]
                0.01        0.00 1068049/1068049     lem_txeof [118]
                0.00        0.00       4/854         _mtx_lock_sleep [382]
                0.00        0.00       2/543         lem_start_locked [416]

-----------------------------------------------

                1.38       12.59 8877441/8877441     lem_handle_rxtx [3]
[4]     36.3    1.38       12.59 8877441         ether_input [4]
                0.21       11.32 8877441/8877441     ether_demux [7]
                0.59        0.00 8877441/8878778     bcmp [52]
                0.23        0.18 8877441/8877459     random_harvest [60]
                0.06        0.00 8877441/8877441     mac_ifnet_create_mbuf [98]

-----------------------------------------------

[5]     30.6   11.76        0.04 45219536+31167706 <cycle 1 as a whole> [5]
               11.56        0.00 21067380             spinlock_exit <cycle 1> [6]
                0.19        0.00 41152830             critical_exit <cycle 1> [79]
                0.00        0.03 6035741             _thread_lock_flags <cycle 1> [107]
                0.00        0.01 2761692             sched_switch <cycle 1> [121]
                0.00        0.00    7175             tdq_lock_pair <cycle 1> [319]
                0.00        0.00    4498             _mtx_lock_spin <cycle 1> [428]
                0.00        0.00 2761692             mi_switch <cycle 1> [800]
                0.00        0.00 2596234             thread_lock_block <cycle 1> [802]

I must add that :

1) I turned off interrupt moderation, otherwise i was being much more limited.
2) Followed Tips from here : http://wiki.freebsd.org/NetworkPerformanceTuning

But still, take a look :

1.250.000 PPS+
Code:
last pid: 18102;  load averages:  3.30,  2.75,  1.48                                                                                                                                                                 up 0+15:00:32  10:33:17
43 processes:  2 running, 39 sleeping, 1 zombie, 1 waiting
CPU 0:   0.0% user,  0.0% nice,  0.0% system, 76.4% interrupt, 23.6% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 76.3% interrupt, 23.7% idle
CPU 2:   0.0% user,  0.0% nice,  0.0% system, 74.8% interrupt, 25.2% idle
CPU 3:   0.0% user,  0.0% nice,  0.0% system, 76.7% interrupt, 23.3% idle

Here is the vmstat output :

Code:
irq276: ix0:que 0              203917287       3584
irq277: ix0:que 1              198976921       3497
irq278: ix0:que 2              198092556       3482
irq279: ix0:que 3              218340699       3837

netisr stats :

Code:
Configuration:
Setting                          Value      Maximum
Thread count                         1            1
Default queue limit                256        10240
Direct dispatch                enabled          n/a
Forced direct dispatch         enabled          n/a
Threads bound to CPUs         disabled          n/a

Protocols:
Name   Proto QLimit Policy Flags
ip         1    256   flow   ---
igmp       2    256 source   ---
rtsock     3   4096 source   ---
arp        7    256 source   ---
ip6       10    256   flow   ---

Workstreams:
WSID CPU   Name     Len WMark   Disp'd  HDisp'd   QDrops   Queued  Handled
   0   0  ip         0     2 1992536144        0        0       84 1992536228
          igmp       0     0        0        0        0        0        0
          rtsock     0     1        0        0        0     1340     1340
          arp        0     0     3956        0        0        0     3956
          ip6        0     0        0        0        0        0        0
 
Yes, and it is even worse, especially because ixgbe have modern concepts, making polling 'useless'.
 
Back
Top