Single Socket Boards versus Dual Sockets

Recently I bought a couple of Dual Socket LGA2011 boards. I was really enamored to see 48 cores available with hyperthreading.
Once that newness wore off I started benchmarking my NVMe with diskinfo -t and noticed the speeds seemed low.

So I bought a few Single Socket LGA2011 boards to see if the PCIe bus is taking a hit, by spreading IO over 2 CPU's.

Where does NUMA play into this? Is it not very efficient?

The reason I ask is I see CPU benchmarked where a Single CPU shows ~15K but a Dual CPU run only shows ~20K [1]
So the second CPU is only contributing <50%.
Is this just a benchmarking phenomenon or is this really how additional CPU's contribute to load sharing?

Is this just a load related issue? Benchmarking is synthetic. The extra cores are very useful when compiling software.
I wonder how much hit Quad LGA2011 takes.

[1]
 
I guess you see see it as the extra CPU brings depth. Maybe not contributing so much to overall speed.
 
You have two sockets, each of them containing a chip, which contains also the memory interface, and the PCIe interface.

Let's assume your synthetic benchmark is either single threaded, or it uses many threads. On average randomly half the benchmark threads will be on CPU 1 when they start (if multithreaded, half the threads will be; if single-threaded, the probability of it being on that CPU is 50%). The other half it is on CPU 2. When the benchmark starts, it will allocate memory, which will probably (with high probability) be attached to whatever CPU the threads run on.

Your disk interface (in your case NVMe, in other cases the SATA or SAS HBAs) is physically attached to one CPU or another. Looking at the discussion above, half of your memory is on the wrong CPU, as are (statistically) half your threads.

NUMA does figure into it, massively. Say for example your benchmark program is multi-threaded (I typically run such IO benchmarks with several thousand threads), but the memory allocation happens in the startup code, before the code forks into thousands of threads. Then all memory will be attached to the one CPU that the program was running on before it forked, so on average, half the threads will be in contention to cross over to the other CPU. So you either have to write code that is NUMA-aware, or you have to do memory management after forking into threads (which is much harder, because now all the threads need to coordinate).

The link between the CPUs is slow ... not in absolute terms, but compared to the bandwidth between each individual CPU and the memory and IO attached to that individual CPU. Now you see why having two sockets for your benchmark can really screw you up?

Here is what you should do (theoretically, in practice it is nearly impossible): Buy two of everything: network interfaces, storage IO interfaces, memory, and CPU. Attach all disks dual-ported to both IO interfaces (that can be done with SAS disks). Whenever a logical operation wants to do IO, it makes an informed decision which memory to use, which disk interface to use, and over which network interface to send the data. The logical operation then gets queued on a thread that happens to be running on the CPU that minimizes the number of trips between the CPUs (threads mostly have CPU affinity). This is a lot of work to implement.
 
given your last phrase, should he try different oses? like linux or dfbsd and if it's all the same, then it's the hardware still ahead of software?
 
Honestly, I don't know that I can recommend a particular OS over another. As you can probably guess from the detail above, I've dealt with this optimization, and it was hard. That was under Linux, on both amd64 (a.k.a. x86-64) and PowerPC. Understanding how NUMA control works, and how it interacts with memory allocation and IO/network, and with thread scheduling, was a lot of work. I've also seen it done on a commercial Unix using a different CPU architecture, where this for some reason was less of an issue (the reasons could include that the hardware was just generally much faster so less optimization was necessary, or that a different team of experts handled it, or that this other Unix is just so much better at thread scheduling and memory management that it has the throughput affinity built-in). I have no idea how this is handled in FreeBSD; the answer probably depends on whether there are developers who care about this particular use case (high throughput IO on numa-class machines).

I guess Phishfry could just install Linux and try it again, and see whether he gets different benchmark results. The problem is: he will spend lots of effort doing that setup, and getting the benchmark and tuning it, and then very likely get rather different results. Drilling down to where the differences come from will be a lot of work. Is that worth it? His call.
 
Good read from Matt. Even if I disagree with his Toshiba /OCZ beef. I have used heatsinks on mine from day one.
I have five XG3-512GB modules. They are generally weaker than the Samsung line but acceptable.
Recently I dove into Samsung. Started with a pair of PM953 modules 960GB. In the M.2 format but long at 110mm.
I mounted them on a Supermicro card for bifurcation in a graid1 setup. Hosting my VM's.
Now I found something better. PM983. These are in the U.2 format and are very fast. I am using 1TB models, but those have half the write speed of the 2TB and above models. But affordable at $150.
I am fooling with various U.2 adapters from $20 paddlecard to LSI 9440-8i.
The main advantage I see with "Enterprise NVMe" is something that is not alluded to much, sustained transfer rate.
With M.2 being so tiny it is a disaster to keep cool even with heatsinks and fans. Who needs that with a server?
So that is where U.2 drives shine. They run cooler due to density alone. They do still get very warm.

I will concede that 970PRO is the speed king and 970EVO is number two. I say phooey on the M.2 format. It's great for laptops.
 
Back
Top