How to measure performance

This concerns more -or-less Intel core-i CPUs.

Normally i don't care about performce. If the machine works and gets things done in a reasonable way, everything is fine. I care much more about stability, so if lowering the bus cycles by 5% might increase the safety margin towards soft-flaws, I would do it.

Now I got my hands on a bunch of RAM, which is not on the QVL (qualified vendor list) of the board (but is within specs and so should work nevertheless), and those pieces offer two different speed ratings, a conservative JEDEC one and an XMP one.

Corrollary: It seems we no longer live in times where hardware has a precise functional rating that would say <it is guaranteed to run up to that speed>. We now have a culture of so called "over clockers", and industry caters for them by providing stuff to be "over clocked", including instructions on how to do so. Not only is this not what one should understand as overclocking (i.e. running stuff under conditions that are beyond what the manufacturer does propose), it also goes astray from what I understand as engineering work - it does no longer answer the question "what is this piece proposed to do?" in a measureable way.
(Just imagine somebody would sell an airplane and state: it endures a maximum tolerable g-force of 2.8, but if you like extreme flying, you can also use a g-force of 3.4 o_O )

Curious as I am, I now would like to know if there is a remarkable difference between these ratings, e.g. some actual metrics. So I checked and found the benchmarks/hpl, which should be more-or-less the established way to determine those MFlops counts that are all thru the newspapers.

But then I learned: this tool does not simply measure the ad-hoc throughput of the system (like dd would show the ad-hoc throughput of a drive); instead it is something that needs to be carefully adjusted to the actual topology in order to max-out the compute throughput of the cores. At first tries the counts shown were about factor 1000 below what newspapers say my CPU should do.

Finally I started to watch what the CPUs are doing, with sysutils/intel-pcm. That port shows the detailed core activity and an IPC value - instructions per cycle (per core). And while I was used to see about 1 IPC (per core) during e.g. compiling, now it showed 3.5 IPC.
So that is what hyper-threading is about (my chip doesn't), why Intel charges a lot of extra money for chips having it enabled, and why it is not useful on high-performance mathematical workloads.

Questions that remain:
When the CPU shows about 3.5 IPC, and clocks at 2.9 GHz with 4 cores, that should make 40 GFlops. But strangely, the hpl tool shows a value that is always precisely !/4 of that, i.e. 10 GFlops. Why is this? (It is not related to the 4 cores, the same 1/4 factor applies when running on a single CPU.)

And: has anybody achieved metrics with that hpl tool that would vaguely resemble those that are talked in the technical newspapers? Probably with some high-end mainboard? (Mine is just basic consumer standard, specifically chip i5-3570T and board asus P8B75-V.)
 
The processor is listed with 25.6GB/s. Depending on how the benchmark is implemented, the memory might be the bottleneck - for example if a value is read from the memory instead of a CPU register. In this case it should be limited to about 6 GFlop/s
Also, you might want to check if the 3.5 IPC is valid for the floating point commands used by the benchmark. I have not tested CPUs, but I have had similar experience with java/aparapi and GPUs, where I thought I should be able to get 6 TFLOP/s with half precision but it came about 1.5-ish and the only reasonable explanation was that the aparapi developers used doubles internally and this was probably the bottleneck.
 
  • Thanks
Reactions: PMc
The processor is listed with 25.6GB/s.

Hm, where did You get that one? (It seems to be a throughput value, and I vaguely remember having it seen before.)
I could refer to this article that mentions the i5-3570 (without the T) at 105 GFlops.

Depending on how the benchmark is implemented, the memory might be the bottleneck

Memory should be a bottleneck, and if that can be pinpointed, one could see the practical difference of different memory parameters. But first one should understand what is actually measured.

Also, you might want to check if the 3.5 IPC is valid for the floating point commands used by the benchmark.

Yeah, the hard way would be to look into the source, and then look into the library - which is a fortran library, and I dont think I wanna go that far...
 
Back
Top