Hi all,
I've been using dummynet for a while to perform degraded network testing which has been really useful.
Recently, we wanted to measure the performance limits of it on our hardware. To do this we setup a machine with 8 interfaces paired into 4 ethernet bridges.
We are having throughput issues when more than 2 pipes are being used simultaneously. These issues appear to be independent of the bandwidths specified.
Test setup
With the /24s all being in their own VLANs on the switch (total of 8 VLANs).
We generate UDP traffic on the 4 interfaces on Machine B going to the corresponding interfaces on Machine A. Machine A has UDP black hole turned on to prevent icmp traffic responses.
The symptoms
When we specify pipes as follows:
ipfw pipe 40 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 20Mbit/s queue 5 delay 0ms plr 0
and put each UDP stream through it's own pipe via IPFW rules we see throughput of around 10Mbit/s per receiver. Dummynet declares that it is dropping the packets we'd expect to see through at these bandwidths in it's counters.
When we specify pipes as follows:
ipfw pipe 40 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 50Mbit/s queue 5 delay 0ms plr 0
We see throughput of around 25Mbit/s per receiver.
When we generate 3 traffic streams and have 3 pipes in place, the performance is roughly 2/3rds of expected bandwidth.
When we generate 2 traffic streams and have 2 pipes in place, the performance matches the desired bandwidth (all the way up to 650Mbit per stream which is where our traffic generators start to struggle).
We can get the desired throughput in most cases with 4 pipes by using exceedingly long queues (sometimes as much as as a few thousand packets) but this leads to large unpredictable latencies which we're keen to avoid.
What we've tried
This is a list of avenues we've explored
All of which still result in a Resultant bandwidth =~ Desired Bandwidth * (2/Number of Pipes [or dynamic queues])
Where the number of pipes is > 1.
We feel we must be missing the elephant in the room with this one, and we were hoping someone here might be able to shed light on where we're going wrong.
Many thanks
I've been using dummynet for a while to perform degraded network testing which has been really useful.
Recently, we wanted to measure the performance limits of it on our hardware. To do this we setup a machine with 8 interfaces paired into 4 ethernet bridges.
We are having throughput issues when more than 2 pipes are being used simultaneously. These issues appear to be independent of the bandwidths specified.
Test setup
Code:
Machine A Dummynet Machine (DL 360G3) Machine B
192.168.0.1 192.168.0.2 192.168.4.2 192.168.4.1
.1.1 .1.2 .5.2 .5.1
.2.1 .1.3 .6.2 .6.1
.3.1 .1.4 .7.2 .7.1
We generate UDP traffic on the 4 interfaces on Machine B going to the corresponding interfaces on Machine A. Machine A has UDP black hole turned on to prevent icmp traffic responses.
The symptoms
When we specify pipes as follows:
ipfw pipe 40 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 20Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 20Mbit/s queue 5 delay 0ms plr 0
and put each UDP stream through it's own pipe via IPFW rules we see throughput of around 10Mbit/s per receiver. Dummynet declares that it is dropping the packets we'd expect to see through at these bandwidths in it's counters.
When we specify pipes as follows:
ipfw pipe 40 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 41 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 42 config bw 50Mbit/s queue 5 delay 0ms plr 0
ipfw pipe 43 config bw 50Mbit/s queue 5 delay 0ms plr 0
We see throughput of around 25Mbit/s per receiver.
When we generate 3 traffic streams and have 3 pipes in place, the performance is roughly 2/3rds of expected bandwidth.
When we generate 2 traffic streams and have 2 pipes in place, the performance matches the desired bandwidth (all the way up to 650Mbit per stream which is where our traffic generators start to struggle).
We can get the desired throughput in most cases with 4 pipes by using exceedingly long queues (sometimes as much as as a few thousand packets) but this leads to large unpredictable latencies which we're keen to avoid.
What we've tried
This is a list of avenues we've explored
- Rebuilding the kernel with device polling
- Rebuilding the kernel with HZ=10000
- Increasing the hash size
- With and without io_fast
- With and without one_pass
- Varying Bandwidths
- Testing with a single core processor
- Using masks to create dynamic queues instead of separate pipes
All of which still result in a Resultant bandwidth =~ Desired Bandwidth * (2/Number of Pipes [or dynamic queues])
Where the number of pipes is > 1.
We feel we must be missing the elephant in the room with this one, and we were hoping someone here might be able to shed light on where we're going wrong.
Many thanks