PF How BSD pf performance depend on CPU frequency, L2/L3 cache size

Sergei_Shablovsky · Jun 11, 2021

Hi, hardware gurus!

How exactly BSD pf performance (in matter of low latency, high PPS, etc) depend on bus frequency, main CPU frequency and L2/L3 cache size in multi-package (mean physical multi-CPU, like Intel E5500/5600, E5-2000 series) server systems that intend working as border firewall with a lot of double/quad PCIe 2.0+ NICs ?

Note, we not speak about using firewall in VMs, only bare metal + FreeBSD.

THANK You for attention and much detailed answers!

SirDice · Jun 11, 2021

You may already have read these, or maybe not.

Network Tuning and Performance Guide @ Calomel.org

FreeBSD Network Performance Tuning @ Calomel.org

Kristof Provost · Jun 11, 2021

I'm not aware of anyone having done a benchmarking series across different CPUs, so there's not really any information there.

It may be more useful for you to tell us what sort of performance your application needs.

To give you an idea, the latest version of FreeBSD pf can push around 18Mpps (in simple setups) on my particular hardware configuration.

Sergei_Shablovsky · Jun 14, 2021

SirDice said:
You may already have read these, or maybe not.

Network Tuning and Performance Guide @ Calomel.org

FreeBSD Network Performance Tuning @ Calomel.org

Thank You for suggestions!

Of course, read already

And because cannot see the answers, ask here.

May be I miss something?

Sergei_Shablovsky · Jun 14, 2021

Kristof Provost said:
I'm not aware of anyone having done a benchmarking series across different CPUs, so there's not really any information there.

Or may be because this is property of some IT company.

Kristof Provost said:
It may be more useful for you to tell us what sort of performance your application needs.

On moment now, interesting
1. pfSense (lets name it “GUI for pf + some other stuff”)
2. HAProxy

Speeds 10-30G, copper Eth.
Intel CPUs

Kristof Provost said:
To give you an idea, the latest version of FreeBSD pf can push around 18Mpps (in simple setups) on my particular hardware configuration.

Please could You describe Your setup (of course just drop parts about security). Thanks!

Kristof Provost · Jun 14, 2021

Sergei_Shablovsky said:
On moment now, interesting
1. pfSense (lets name it “GUI for pf + some other stuff”)

You'll want to talk to Netgate about that. They have more performance numbers than I do.

Sergei_Shablovsky said:
2. HAProxy

Speeds 10-30G, copper Eth.
Intel CPUs

In 64-byte packets? 1500 bytes? 9000 bytes?

Sergei_Shablovsky said:
Please could You describe Your setup (of course just drop parts about security). Thanks!

It's a development setup. It's entirely artificial.

Sergei_Shablovsky · Jun 19, 2021

Kristof Provost said:
You'll want to talk to Netgate about that. They have more performance numbers than I do.

Fellows from Netgate user forum not agree with that: most of them keep pfSense-based firewalls for small net (hotel, campus, 10-20 people’s office, etc...), fall in love to GUI, and not any motivation to goes down into bus, cpu, NIC, and FreeBSD network stack working...
So, best of them just redirect me on FreeBSD user forum or the same Calimet articles.

Kristof Provost said:
In 64-byte packets? 1500 bytes? 9000 bytes?

1500

Kristof Provost said:
It's a development setup. It's entirely artificial.

May be interesting anyway.

Thank You!

Kristof Provost · Jun 19, 2021

Sergei_Shablovsky said:
Fellows from Netgate user forum not agree with that: most of them keep pfSense-based firewalls for small net (hotel, campus, 10-20 people’s office, etc...), fall in love to GUI, and not any motivation to goes down into bus, cpu, NIC, and FreeBSD network stack working...
So, best of them just redirect me on FreeBSD user forum or the same Calimet articles.

They are mistaken. I contract for NetGate (but explicitly do not speak for them). They have a team dedicated to performance testing. Talk to them, they can help.
Many users use the free (community edition, CE) version, but Netgate has commercial offerings, including support, that are suitable for business use at larger scale than the examples you list here.

Sergei_Shablovsky said:
May be interesting anyway.

That setup uses Dell T640s (Intel Xeon Silver 4210 2.2G, 10C, 32GB RAM) with Chelsio T62100 NICs. I stress that this is a development setup and not a production system with real traffic.

Sergei_Shablovsky · Jun 28, 2021

SirDice said:
You may already have read these, or maybe not.

Network Tuning and Performance Guide @ Calomel.org

FreeBSD Network Performance Tuning @ Calomel.org

Thank You again for suggestions.

Which software (let’s say “set of software”) You suggest for testing (better with some automation, if possible) the bandwidth, latency, PPS, depending on packet size, etc. ?

Sergei_Shablovsky · Jun 29, 2021

SirDice said:
You may already have read these, or maybe not.

Network Tuning and Performance Guide @ Calomel.org

FreeBSD Network Performance Tuning @ Calomel.org

Some sub-question (not needed to create separate tread because question are too close to this tread):

How EXACTLY the technologies:

Intel SpeedStep
Multi-Threads
CPU Enhanced Halt (C1E)
C3/C6/C7 State Support
CPU EIST Function
AHCI
QPI Frequency
Memory Scrubs

impact on FreeBSD performance as a 24/7 Firewall (pf / ipfw) ?

P.S. For example, I definitely know that C3 on 1-thread application (like pf) cause going to sleep other cores and small speed up the 1-st core.

smdb01us · Jul 21, 2021

IMHO, knowing how exactly a system will perform (compute + pf in this case) depends on some many factors that it would be impossible to guess. When companies like Netgate give you specs on how their appliances perform is because they have a discrete and consistent set of HW and variables to perform tests on and even then, you can only hope that their “real traffic” (that is short/long lived flows, udp traffic, tcp optimizations based on OSs, MTU, applications, etc) and setup (number of rules, options, clients behind it) matches your real traffic and setup. If your hardware deviates from what it has been tested, your best approach would be test it yourself and if you don’t have the hardware yet, look for approximate HW to other tests and extrapolate. Cheers.

Sergei_Shablovsky · Jul 26, 2021

smdb01us said:
IMHO, knowing how exactly a system will perform (compute + pf in this case) depends on some many factors that it would be impossible to guess. When companies like Netgate give you specs on how their appliances perform is because they have a discrete and consistent set of HW and variables to perform tests on and even then, you can only hope that their “real traffic” (that is short/long lived flows, udp traffic, tcp optimizations based on OSs, MTU, applications, etc) and setup (number of rules, options, clients behind it) matches your real traffic and setup. If your hardware deviates from what it has been tested, your best approach would be test it yourself and if you don’t have the hardware yet, look for approximate HW to other tests and extrapolate. Cheers.

In common, I agree with You and other here.

So, at the some point of discussions like this, each time we conclude to “play” on a real hardware in a lab environment.

Because we speak about only one use case - firewalling, the test we really need is to simulate “real traffics”, change the settings (BIOS/UEFI) and make measurement.
(I write «BIOS/UEFI” because initial message in this thread was about hardware technologies: power management (ACPI, etc) , memory management (fast scrub, patrol scrub, mirrored ram banks, etc), CPU proprietary functionality (SpeedStep, Multi-Threads, etc...)

What set of software You (and May be some others) suggest for this sort of testing?

iperf/iperf3, rdmon, ping, what else?

smdb01us · Jul 27, 2021

Sergei_Shablovsky said:
iperf/iperf3, rdmon, ping, what else?

Can you invite the IT and marketing departments for a ping party?

. Seriously tho, iperf, and similar tools are good to test throughput and latency for a small number of flows but to simulate a real-world deployment, you will need something that will generate hundreds of flows per client so it can stress the CPU, memory and NICs. I seem to remember reading in this forum that pf is not multi-threaded so I would also research if multiple cores and threads will make a difference or if just a low number of cores and high clock will do the trick.

I haven't tested any of these tools but it looks like there are several tools like Seagul (http://gull.sourceforge.net/index.html) that can help.

Jose · Jul 27, 2021

smdb01us said:
...I seem to remember reading in this forum that pf is not multi-threaded so I would also research if multiple cores and threads will make a difference...

"This implementation is derived from OpenBSD 4.5. It has been heavily modified to be capable of running in multithreaded FreeBSD kernel and scale its performance on multiple CPUs."

pf(4)

www.freebsd.org

Sergei_Shablovsky · Nov 2, 2021

Kristof Provost said:
In 64-byte packets? 1500 bytes? 9000 bytes?

According the new Congestion Control (CC) like BBR2, QUIC, RACK algorithms come from Google/Netflix, seems that MTU nowadays would be 1350 and less

Sergei_Shablovsky · Nov 2, 2021

Kristof Provost said:
I contract for NetGate (but explicitly do not speak for them). They have a team dedicated to performance testing. Talk to them, they can help.

How to contact exactly to this team? I’d be thankful for PM

Kristof Provost · Nov 2, 2021

Sergei_Shablovsky said:
How to contact exactly to this team? I’d be thankful for PM

I have no insight in the marketing/sales organisation of Netgate. Contact them on the sales address sales@netgate.com and ask for performance numbers.

Sergei_Shablovsky · Nov 2, 2021

SirDice said:
You may already have read these, or maybe not.

Network Tuning and Performance Guide @ Calomel.org

FreeBSD Network Performance Tuning @ Calomel.org

Thank You again for this suggestions.

But this articles looks like a little bit outdated:
1. This is some sort of synthetic “lab” setup, but in real life the network flow have more “mobile device” character: lot of packet loss and latency;
2. Due of 1., modern Congestion Control (CC) algorithms coming: BBR2 (already implemented in FreeBSD 13.X), QUIC, RACK...

So, tuning FreeBSD network stack by primary tuning NIC buffers/queue going to be outdated, - most work placed on drivers/kernel drivers.
(Because NIC developer are not so fast to implement modern CC “directly in silicon”, next 2-4 years we see implementation only on OS level)

If I miss something?

Sergei_Shablovsky · Jun 24, 2024

Kristof Provost said:
That setup uses Dell T640s (Intel Xeon Silver 4210 2.2G, 10C, 32GB RAM) with Chelsio T62100 NICs. I stress that this is a development setup and not a production system with real traffic.

Which packet generator system You prefer to use?

Are You have experience with Trex?

Kristof Provost · Jun 24, 2024

Sergei_Shablovsky said:
Which packet generator system You prefer to use?

I use https://github.com/luigirizzo/netmap

Sergei_Shablovsky said:
Are You have experience with Trex?

None whatsoever.

PF How BSD pf performance depend on CPU frequency, L2/L3 cache size

Administrator