C How should I query the level 1 data cache size in C?

The e-cores and p-cores are completely identical from a x86_64 application's point of view. If they were not, they could not be preempted while running on a p-core and rescheduled to continue running on an e-core, which isn't just what e-core/p-core unaware FreeBSD is having them do all the time, and all other operating systems were also doing when the AlderLake was released, it's also what the chip is designed to do: Most applications (threads) cause burst load every now and then become idle or do minor tasks waiting for more work to be provided to them, and the OS scheduler can analyze a machine specific register to acquire the information whether the CPU's microcode considers a thread worthy of being migrated to the other efficiency level or not, but processing this MSR is optional.

Where's the "Atom" coming from? Each of my Emerald Rapids' E-Cores outperforms a Skylake SP Core, and that's with AVX512 code, which Atoms don't support. According tho intel's slides, the main difference between the e-cores and the p-cores is the number of vector co-processors, where the e-cores have 1 and the p-cores have 3 co-processors. Which makes sense as the detection and distribution of out-of-band executable vector code to the available vector co-processors is done by the instruction pipeline optimizer, which is invisible to x86_64 applications, so x86_64 applications have no means to know how many co-processors are present or are being used, allowing for this number to change at runtime. Intel has also had trouble with the excessive power consumption and heat generation of their vector co-processors ever since they introduced AVX2, while most applications even now, i.e. a decade later, do not use AVX to all during their lifetime. So reducing the number of high AVX performance capable cores allows most applications to run at the same performance while selected applications can now run faster on a performance cores that has additional co-processors: All Xeon Silver 44xx/45xx have only p-cores, while the Gold 54xx/55xx/64xx/65xx have most cores being e-cores and most Golds have less p-cores then their Silver "competitors", but outside of microbenchmarking, the Golds outperform the Silvers in work/time, and even more so in work/watt, i.e. have better TCOO and efficiency.
IIRC, AlderLake had several variants. And differences in having AVX512 or not within p- and e- cores could be fatal difference for apps which mandates AVX512 on build.

So schedulers SHALL be 100% aware of the cores, strictly speaking.

But why FreeBSD base and apps not affected?
Simply because AVX512 is usually not enabled by default.
Note that if some apps that auto-detect allocated cores on its startup could crash if initially allocated p-cores (unpinned) but at some point moved to e-cores.
 
Reminder: This whole discussion was originally really about cache line size, not about cache size, or the presence or absence of certain cache levels. The OP wants to allocate certain data structures so they don't span cache lines (presumably they are small enough to fit into a single cache line).

My very brutal proposal to the OP would be this: We know that the ISA provides atomic operations on 128-bit integers. Therefore the cache line size must be larger than and a multiple of 128 bits. If the data structure can be squeezed into 128 bits, they're golden. Now that may not be completely trivial, and probably means that no whole pointers can be stored in the data structure. That can be easily worked around.
 
If my intr and cc processes on btop show bottlenecking, can I blame everyone on this discussion thread

Presenting: Base System — The Final Overload. Rated R for Redlining.

- this is my sneak preview for the screenshots
 
IIRC, AlderLake had several variants. And differences in having AVX512 or not within p- and e- cores could be fatal difference for apps which mandates AVX512 on build.

So schedulers SHALL be 100% aware of the cores, strictly speaking.

But why FreeBSD base and apps not affected?
Simply because AVX512 is usually not enabled by default.
Note that if some apps that auto-detect allocated cores on its startup could crash if initially allocated p-cores (unpinned) but at some point moved to e-cores.
Alderlake Cores have the full feature set available on all cores. If a scheduler isn't aware of the core variants, it will push avx heavy applications on a e-cores, where they will run at degraded performance, but they will run nevertheless.

FreeBSD started deploying AVX512 accelerated codepaths around a decade ago. As of FreeBSD-14.3, GENERIC uses AVX512 in ZFS, the cryptographic stack (IPSec, GELI, ...), sodium, OpenSSL's userland stuff, and many others components.

Reminder: This whole discussion was originally really about cache line size, not about cache size, or the presence or absence of certain cache levels. The OP wants to allocate certain data structures so they don't span cache lines (presumably they are small enough to fit into a single cache line).

My very brutal proposal to the OP would be this: We know that the ISA provides atomic operations on 128-bit integers. Therefore the cache line size must be larger than and a multiple of 128 bits. If the data structure can be squeezed into 128 bits, they're golden. Now that may not be completely trivial, and probably means that no whole pointers can be stored in the data structure. That can be easily worked around.
We also know that there is no "supported" x86_64 environment around that has a cache line size different to 64 bytes.
 
Which arm of the community is working on processes, multiprocessing, intr, and high throughput

cvsrust.png
 
Which arm of the community is working on processes, multiprocessing, intr, and high throughput
If you have a specific question or suggestion, you can explain it here, but don't expect it to reach developers. If you want to enter into a detailed technical discussion with developers, the mailing lists are more appropriate.
 
Back
Top