minimum settings for 10Gbit ix

What is/are the recommended tunable settings for an ix-based 10Gbit Base-T nic on a busy, high-usage network card? Are there any adjustments to these settings when there are 2 ports (ix0, ix1) when one or both are in use?

For starters,
net.isr.maxthreads="-1"

right?
 
One thing I don't see mentioned much is NUMA performance.
You actually get lesser performance with multiple CPU's due to NUMA spanning different CPUS's.
So if using multiple CPU's it helps to know what PCIe slot is connected to which CPU.
CPU0 is generally more responsive but depending on the architecture of the board, most of the PCIe slots might be assigned to only one CPU. So consider PCIe bandwidth used across each CPU in conjunction with Network Card placement.
For example if you find CPU0 has most of your peripherals hanging off it, try using CPU1 instead to balance the IO load.
There is no general guide for this, You have to benchmark and move your network adapter around to see the effects.
If you have Intel 10G Network adapters built on the motherboard there is not much you can do about it.
What hardware are you using? What other high bandwidth accessories are installed?
 
What is/are the recommended tunable settings for an ix-based 10Gbit Base-T nic on a busy, high-usage network card? Are there any adjustments to these settings when there are 2 ports (ix0, ix1) when one or both are in use?

For starters,
net.isr.maxthreads="-1"

right?

I have these settings among a few others for a "routing firewall"...

Turn off Hyper threading!

sysctl.conf
# Performance tuning by intel driver info. (See intel driver docs)
hw.intr_storm_threshold=0 # Default 1000
# Turn off flowcontrol
dev.ix.2.fc=0 # Default 3
dev.ix.3.fc=0 # Default 3

If it is a router where traffic passes, I would also add the following in rc.conf
Network card...
-lro -tso4 -tso6 -vlanhwtso
Misc
harvest_mask="351"


I also have a few more sysctl settings. But I do not want to post them as they are specific to my setup and not necessarily good for you.

I have a Xeon-D 2.4 GHz. I route over 9 Gbit/s through it with 10000 active states. A normal CPU usage in a condition like this is 15-40%. I have 500 rules in pf.conf.

So I think my setup for sure deliver very well.

A tip is to not tune everything just because you can. Tune one setting at a time. Also note that increasing some buffers settings in sysctl.conf also introduces latency that you do not want. So be careful...

Chelsio for example has different important tuneables...
Regards
/Peo
 
One thing I don't see mentioned much is NUMA performance.
You actually get lesser performance with multiple CPU's due to NUMA spanning different CPUS's.
So if using multiple CPU's it helps to know what PCIe slot is connected to which CPU.
CPU0 is generally more responsive but depending on the architecture of the board, most of the PCIe slots might be assigned to only one CPU. So consider PCIe bandwidth used across each CPU in conjunction with Network Card placement.
For example if you find CPU0 has most of your peripherals hanging off it, try using CPU1 instead to balance the IO load.
There is no general guide for this, You have to benchmark and move your network adapter around to see the effects.
If you have Intel 10G Network adapters built on the motherboard there is not much you can do about it.
What hardware are you using? What other high bandwidth accessories are installed?
CPU0 and CPU1? What is this, 2005? ix nics will use multiple cpus so it's not as simple as this. You can't even answer this question without knowing more about what the OP is doing. Is this a single nic system/server, or a router/bridge? How many cpus? What Ghz? What load? Is the NIC on board?
 
So instead of helping this user in another thread I have to defend my words. Here it goes.
CPU0 and CPU1? What is this, 2005? ix nics will use multiple cpus so it's not as simple as this.
All PCIe slots must use a PCIe controller. Where are they located? They are now located on the CPU package itself.
So if your NIC uses a PCIe lane every Transaction Layer Packet (TLP) must first go through a CPU's PCIe Bus and into L3 cache.
Some motherboards use different physical CPU's for PCIe Lanes.
On my SuperMicro X9 and X10 LGA2011 boards the PCIe slot lanes are split up among the the two CPU's.
For example PCIe slots 1,2,4 and 5 are assigned to CPU1 PCIe root port. PCIe slots 3,6, and 7 are assigned to CPU2
Here is the relevant BIOS screen:

But on my Tyan LGA2011 board all PCIe slots are assigned to CPU1 and the motherboard peripherals are assigned to CPU2.

So ask yourself, how exactly does your Intel 10Gge Network Interface use both CPU's? It is called NUMA.
Remote NUMA adds latency to the TLP path route so it is not as fast as a single socket implementation.
Multiple CPU cores on a single die do not use NUMA but their own internal ring.
So a single CPU board will offer faster throughput than a multiple socketed board.

On die hardware packet routing is superior to software routing(NUMA) even with QPI enabled.
Writing from a device in a PCIe slot to memory on a remote NUMA node through QPI incurs latency in several ways:
.
Netflix is working to improve our NUMA performance.
 
So instead of helping this user in another thread I have to defend my words. Here it goes.

All PCIe slots must use a PCIe controller. Where are they located? They are now located on the CPU package itself.
So if your NIC uses a PCIe lane every Transaction Layer Packet must first go through a CPU's PCIe Bus and into L3 cache.
Some motherboards use different physical CPU's for PCIe Lanes.
On my SuperMicro X9 and X10 LGA2011 boards the PCIe slot lanes are split up among the the two CPU's.
For example PCIe slots 1,2,4 and 5 are assigned to CPU1 PCIe root port. PCIe slots 3,6, and 7 are assigned to CPU2
Here is the relevant BIOS screen:

But on my Tyan LGA2011 board all PCIe slots are assigned to CPU1 and the motherboard peripherals are assigned to CPU2.

So ask yourself, how exactly does your Intel 10Gge Network Interface use both CPU's? It is called NUMA.
NUMA adds to the TLP path route so is not as fast as a single socket implementation.
Multiple CPU cores on a single die do not use NUMA but their own internal ring.
So a single CPU board will offer faster throughput than a multiple socketed board.

On die hardware packet routing is superior to software routing(NUMA) even with QPI enabled.

.
Netflix is working to improve our NUMA performance.

I'm just wondering why anyone would buy a MB with multiple physical CPUs. Are you decrypting spy messages? More than 10 or 12 cpus
is overkill for most practical applications. dual/quad socket MBs were a bad idea in 2005 and they're a bad idea now. I remember the jokers with
freebsd 4.x running dual socket MBs and they were slower than 1 cpu and they had no clue.

As I mentioned, you can't give advice with the info the OP provided. There are tons of factors. On board 10g nics often share pci lanes so they're utterly
useless for high speed applications.
 
I'm just wondering why anyone would buy a MB with multiple physical CPUs.
Virtualization would be the number one reason for me.
If you turn off hyperthreading you are left with a small number of cores.
My 2650LV3 chips support 12 cores. With a second socket I can increase that to 24 cores.
With that I can manage 5 or 6 VM's comfortably.

I could also see a use case in software development. Compilers can really take advantage cores and that speeds development work.
 
There are programming languages designed explicitly to exploit multiple core effectively, Erlang being one of the most common. I've been using multicore setups since late 90s, for a simple reason - you could get two of last year's CPUs for lower *cost* than this year's fastest one, and the performance (at the time Windows NT, but now obviously FreeBSD) was noticeably better. Maybe not 2x better, but certainly 1.5x and for less cost too!
 
There are programming languages designed explicitly to exploit multiple core effectively, Erlang being one of the most common. I've been using multicore setups since late 90s, for a simple reason - you could get two of last year's CPUs for lower *cost* than this year's fastest one, and the performance (at the time Windows NT, but now obviously FreeBSD) was noticeably better. Maybe not 2x better, but certainly 1.5x and for less cost too!

This depends on what you're doing. Used 12 core cpus are $225 on ebay if you want to cheap out. There's no way 2 6 core cpus are faster than 1 12 core of the same ghz. Plus dual socket MB are more expensive; the memory is more expensive. If you have 10gb/s to move saving a couple $100 is foolhardy.
 
Virtualization would be the number one reason for me.
If you turn off hyperthreading you are left with a small number of cores.
My 2650LV3 chips support 12 cores. With a second socket I can increase that to 24 cores.
With that I can manage 5 or 6 VM's comfortably.

I could also see a use case in software development. Compilers can really take advantage cores and that speeds development work.
Turning off hyperthreading seems incredibly dumb; the fewer cores you have per virtual machine the more useful hyperthreading is (its most useful with 1 physical core). The biggest obstacle in multicore utilization is cpu contention under load, and the more cpus available the better, even if they're hyperthreads.

As for development I like more GHZ and fewer cores. (The cpus are cheaper also) My main dev boxes are a quad 3.6Ghz E3 and a 3.5Gh 6 core E5 and they both rock.
 
This depends on what you're doing. Used 12 core cpus are $225 on ebay if you want to cheap out. There's no way 2 6 core cpus are faster than 1 12 core of the same ghz. Plus dual socket MB are more expensive; the memory is more expensive. If you have 10gb/s to move saving a couple $100 is foolhardy.

Umm yeah you're doing the math wrong here. That's not what I said.

Last year's 12 core cpu is almost 1/2 the price of this year's 12 core cpu -> you can get 24 cores for a similar price if you use last year's kit. The big difference I notice these days is that power consumption is dropping. The last time I upgraded equipment, the power costs over a couple of years was what made it *cheaper* to update. I haven't taken this into account and maybe it makes all the difference?

wrt 2nd hand gear, yes there is always a bargain to be had. I saw 2 full racks of prime intel hardware going for a song recently from a bank foreclosure. If I could have figured out a way to get them into my cellar along with the 3-phase power it required, well... home heating in winter would never have been a problem again :D
 
Using this as a guide is like getting restaurant advice from a 5yo. Seriously. Dumbing down your entire system by disabling hyperthreading because the default settings on many ethernet drivers are bone-headed makes no sense. Tune the card. Without card tuning (#queues, interrupt moderation, etc) the benches are useless. They're testing the default config of the drivers which are largely written by some guy who couldn't wait to be finished writing the driver. i remember sparring with Jack Vogel who did the intel drivers back in the day; a good guy but a terrible programmer. I re-wrote the igb driver in 2010 and it was way more efficient than the stock driver.

They're also testing throughput without any relation to efficiency; if you use 80% of a 10 core system to achieve 98% utilization it's not better than using 80% of 2 cores to achieve 90% utilization. I 3.5Ghz core can easily forward 10Gb/s; spreading them isn't necessarily a good idea for all applications. If you have cpus to burn, fine. But it's no way to benchmark.
 
Umm yeah you're doing the math wrong here. That's not what I said.

Last year's 12 core cpu is almost 1/2 the price of this year's 12 core cpu -> you can get 24 cores for a similar price if you use last year's kit. The big difference I notice these days is that power consumption is dropping. The last time I upgraded equipment, the power costs over a couple of years was what made it *cheaper* to update. I haven't taken this into account and maybe it makes all the difference?

wrt 2nd hand gear, yes there is always a bargain to be had. I saw 2 full racks of prime intel hardware going for a song recently from a bank foreclosure. If I could have figured out a way to get them into my cellar along with the 3-phase power it required, well... home heating in winter would never have been a problem again :D
I dont need 24 cores. I'd rather buy 2 systems than put 24 cores into 1
 
On a side note; I see that the em/igb driver has been replaced in freeBSD 12 by something written by some chick from the now defunct nextBSD. Was this done because the driver is better, or because it combined em, lem and igb into 1 driver? Did anyone bother to bench it? will our systems be 10% slower because someone decided the new driver is more elegant?
 
been replaced in freeBSD 12 by something written by some chick from the now defunct nextBSD.
Misogyny much? Tread carefully, you're on very, very thin ice with remarks like that.

Was this done because the driver is better, or because it combined em, lem and igb into 1 driver? Did anyone bother to bench it? will our systems be 10% slower because someone decided the new driver is more elegant?
 
Misogyny much? Tread carefully, you're on very, very thin ice with remarks like that.



you sound like a fool. If it was a male I would have said "some dude"; what name would you call me then? That kind of oversensitivity is so ridiculous and boring. I'm always amused that such things seem to bother girly-men more than it does actual women. It's also amusing that you seem to think that being here is some sort of privilege that requires everyone to conform to your view of the world.

iflib has nothing to do with driver performance; the entire rx/tx interface has been re-written, so it's essentially a new driver for all igb and em cards. THIS is why upgrading is not something to just be done randomly; Its quite possible that if you're running em or igb cards and upgrade to 12 you have a whole new set of issues to benchmark and could have a much slower machine.
 
Back
Top