Multiple pipe/queue performance issues with Dummynet

Hi all,

I've been using dummynet for a while to perform degraded network testing which has been really useful.

Recently, we wanted to measure the performance limits of it on our hardware. To do this we setup a machine with 8 interfaces paired into 4 ethernet bridges.

We are having throughput issues when more than 2 pipes are being used simultaneously. These issues appear to be independent of the bandwidths specified.

Test setup
Code:
Machine A Dummynet Machine (DL 360G3) Machine B
192.168.0.1 192.168.0.2  192.168.4.2 192.168.4.1 
       .1.1        .1.2         .5.2        .5.1
       .2.1        .1.3         .6.2        .6.1
       .3.1        .1.4         .7.2        .7.1
With the /24s all being in their own VLANs on the switch (total of 8 VLANs).

We generate UDP traffic on the 4 interfaces on Machine B going to the corresponding interfaces on Machine A. Machine A has UDP black hole turned on to prevent icmp traffic responses.

The symptoms

When we specify pipes as follows:
ipfw pipe 40 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 20Mbit/s queue 5 delay 0ms plr 0

and put each UDP stream through it's own pipe via IPFW rules we see throughput of around 10Mbit/s per receiver. Dummynet declares that it is dropping the packets we'd expect to see through at these bandwidths in it's counters.

When we specify pipes as follows:
ipfw pipe 40 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 50Mbit/s queue 5 delay 0ms plr 0

We see throughput of around 25Mbit/s per receiver.


When we generate 3 traffic streams and have 3 pipes in place, the performance is roughly 2/3rds of expected bandwidth.

When we generate 2 traffic streams and have 2 pipes in place, the performance matches the desired bandwidth (all the way up to 650Mbit per stream which is where our traffic generators start to struggle).

We can get the desired throughput in most cases with 4 pipes by using exceedingly long queues (sometimes as much as as a few thousand packets) but this leads to large unpredictable latencies which we're keen to avoid.

What we've tried
This is a list of avenues we've explored

  • Rebuilding the kernel with device polling
  • Rebuilding the kernel with HZ=10000
  • Increasing the hash size
  • With and without io_fast
  • With and without one_pass
  • Varying Bandwidths
  • Testing with a single core processor
  • Using masks to create dynamic queues instead of separate pipes


All of which still result in a Resultant bandwidth =~ Desired Bandwidth * (2/Number of Pipes [or dynamic queues])
Where the number of pipes is > 1.

We feel we must be missing the elephant in the room with this one, and we were hoping someone here might be able to shed light on where we're going wrong.

Many thanks
 
We've tried various rule combinations. Initially we used the /24 subnets to filter, but we've also filtered using the specific src and dest ip addresses (as above) and also have used specific port numbers and targeted the traffic.

I'll post some e.g.'s when I get a chance but specifically:
- the ipfw counters show that we get the right number of packets arriving at the right rules (and not more)
- the ipfw.packet_drops increases by the amount of packets that we lose but expected to see forwarded

So I don't think it's our rule set that's the problem here.
 
This is an example of the rule sets that we've tried but we've had the same result across many variations (specifying the rules in one direction, using specific ips or ports, getting rid of the icmp rules etc.)

We even got the same result when we used dynamic queues and just two pipes.

Code:
wan_dummy# ipfw show
00050      87       4002 allow ip from any to any mac-type 0x0806 layer2
00100    8276     652832 allow ip from any to any dst-port 22
00101    9586    1073704 allow ip from any to any src-port 22
00980     600      37222 allow ip from me to me dst-port 4475
00990       0          0 deny ip from any to me dst-port 4475
01000  148178  152173287 pipe 40 ip from 192.168.0.0/24 to 192.168.4.0/24 dst-port 1-65535
01001       0          0 pipe 40 icmp from 192.168.0.0/24 to 192.168.4.0/24
01002       0          0 pipe 40 ip from 192.168.4.0/24 to 192.168.0.0/24 src-port 1-65535
01003       0          0 pipe 40 icmp from 192.168.4.0/24 to 192.168.0.0/24
01010  150705  154924740 pipe 41 ip from 192.168.1.0/24 to 192.168.5.0/24 dst-port 1-65535
01011       0          0 pipe 41 icmp from 192.168.1.0/24 to 192.168.5.0/24
01012       0          0 pipe 41 ip from 192.168.5.0/24 to 192.168.1.0/24 src-port 1-65535
01013       0          0 pipe 41 icmp from 192.168.5.0/24 to 192.168.1.0/24
01020  148002  152146056 pipe 42 ip from 192.168.2.0/24 to 192.168.6.0/24 dst-port 1-65535
01021       0          0 pipe 42 icmp from 192.168.2.0/24 to 192.168.6.0/24
01022       0          0 pipe 42 ip from 192.168.6.0/24 to 192.168.2.0/24 src-port 1-65535
01023       0          0 pipe 42 icmp from 192.168.6.0/24 to 192.168.2.0/24
01030  143805  147831540 pipe 43 ip from 192.168.3.0/24 to 192.168.7.0/24 dst-port 1-65535
01031       0          0 pipe 43 icmp from 192.168.3.0/24 to 192.168.7.0/24
01032       0          0 pipe 43 ip from 192.168.7.0/24 to 192.168.3.0/24 src-port 1-65535
01033       0          0 pipe 43 icmp from 192.168.7.0/24 to 192.168.3.0/24
60000 1072105 1098784926 allow ip from any to any mac-type 0x0800 layer2
60100    1176     535284 allow tcp from any to any
60200     608      40304 allow udp from any to any
60300     938      70280 allow icmp from any to any
65535    4368    1601473 deny ip from any to any
wan_dummy#
 
You aren't specifying the interface in any of your pipe rules (any of your rules in fact). Thus, the packets are being sent into the pipe twice ... once on entering the interface, once on leaving the interface. Hence, all your throughput results will be halved, which is what you are seeing.

Add the interface to the rules, and I bet you'll see an improvement. :)
 
ok - not tried that. A little bit skeptical as the ipfw counters indicate that the packets are only going in to the pipe once but at this point I'll try anything!

Should I be using "via bridge<x>" or do I need to tie it to a specific interface?
 
Hrm, I haven't played with bridging in FreeBSD, not sure how that works in relation to IPFW. Some digging through the man pages will be in order.

I think you need to write the rules to use the actual interfaces (xl0, em0, etc) and not the bridge device, and to write them to divert packets to the pipes on the "outgoing" interface. (That's how you do it for routers, so it should be similar for bridges.)
 
I've had a play with this but it doesn't seem to make any difference.

With bridging you need to ensure that the following is set:
net.link.bridge.ipfw: 1

As I understand it, this means that packets passing through the bridge(s) are passed through ipfw. We're using the host as a bridge, not as a router so I don't think the packets get processed twice or anything like that.

Just to emphasise how wierd I think this is ....

I launch my traffic generator sending 1000 byte UDP packets (1042 bytes on the wire) for 20 seconds at 30 Mbps across just two of the bridges and I get pretty much exactly 20Mbps.

Sending End
Code:
opened file for binary write
opened file for binary write
77338 packets sent
77341 packets sent
Receiving End
Code:
0 packets captured
30 packets received by filter
0 packets dropped by kernel
0 packets captured
29 packets received by filter
0 packets dropped by kernel
48085 packets captured
48114 packets received by filter
0 packets dropped by kernel
48085 packets captured
48137 packets received by filter
0 packets dropped by kernel

But, if I get the traffic generator to send at just 8Kbps across the other two bridges at the same time the bandwidth on them gets throttled to half what I've specified ...
Sending End
Code:
opened file for binary write
opened file for binary write
opened file for binary write
opened file for binary write
73406 packets sent
20 packets sent
75778 packets sent
20 packets sent
Receiving End
Code:
20 packets captured
39 packets received by filter
0 packets dropped by kernel
20 packets captured
39 packets received by filter
0 packets dropped by kernel
24712 packets captured
24730 packets received by filter
0 packets dropped by kernel
23969 packets captured
24012 packets received by filter
0 packets dropped by kernel

So it appears that just simply having these other pipes active is crucifying my performance. As mentioned above, the problem scales. So I can set my pipes at 200Mbps or 1Mbps, I always get pretty much exactly half of what I ask for if I have the other pipes active.

Now I can mitigate against this by making the queues on my pipes longer (e.g. 200) but then I'm getting unwanted queueing delays added in to my overall picture.

I'm wondering whether dummynet is allocating a certain amount of time to each of the pipes or something like this, but I'm really in the dark and have run out of configuration options that I can think to change.
 
I've re-worked the test so that the dummynet host is routing rather than briding.

ipfw rules (pipes 40-43 identical, restricting to 20Mbps)
Code:
00050      0         0 allow ip from any to any mac-type 0x0806 layer2
00100     41      3188 allow ip from any to any dst-port 22
00101     42      4648 allow ip from any to any src-port 22
00980      0         0 allow ip from me to me dst-port 4475
00990      0         0 deny ip from any to me dst-port 4475
01000  77905  80086340 pipe 40 ip from any to any xmit 192.168.0.2
01010  75519  77633532 pipe 41 ip from any to any xmit 192.168.1.2
01020     20     20560 pipe 42 ip from any to any xmit 192.168.2.2
01030     20     20560 pipe 43 ip from any to any xmit 192.168.3.2
60000      0         0 allow ip from any to any mac-type 0x0800 layer2
60100      0         0 allow tcp from any to any
60200 153464 157760992 allow udp from any to any
60300      0         0 allow icmp from any to any
65535      0         0 deny ip from any to any
Sending end (sending 30Mbps)
Code:
opened file for binary write
opened file for binary write
opened file for binary write
opened file for binary write
75519 packets sent
77905 packets sent
20 packets sent
20 packets sent
Receiving End (roughly capturing 10Mbps)
Code:
20 packets captured
20 packets captured
24999 packets captured
25757 packets captured

This is the same result I got when bridging rather than routing. If I only send across two of the pipes I get 20Mbps through.
 
Ok - apologies to anyone who's spent any time looking at this! The problem appears to be with traffic generation. When we get above two generator processes the data is being sent burstily so that although it looks right on a per second basis, at a finer granularity it's either idle or sending above the bandwidth specified.
 
Back
Top