C How should I query the level 1 data cache size in C?

xieyuheng · Mar 24, 2025

I use sysconf(_SC_LEVEL1_DCACHE_SIZE) in linux,
but it seems freebsd does not have this:

sysconf

man.freebsd.org

I need to query cache size in the following code:

C:

// aligned to cache line to avoid false sharing
void *
allocate_shared(size_t size) {
    size_t cache_line_size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
    assert(cache_line_size > 0);
    size_t real_size = ((size / cache_line_size) + 1) * cache_line_size;
    void *pointer = aligned_alloc(cache_line_size, real_size);
    memset(pointer, 0, real_size);
    assert(pointer);
    assert(pointer_is_8_bytes_aligned(pointer));
    assert(pointer_is_cache_line_aligned(pointer));
    return pointer;
}

atax1a · Mar 24, 2025

looks like you'll have to do inline asm with the CPUID instruction to get this data

cracauer@ · Mar 24, 2025

Run memtest86+. It'll report the cache sizes and speeds.

atax1a · Mar 24, 2025

I think OP was asking how to do it programmatically. MacOS exposes this value via sysctl; freebsd doesn't seem to.

VladiBG · Mar 24, 2025

Accessing macOS System Information - Free Pascal wiki

wiki.freepascal.org

onnxruntime/onnxruntime/core/platform/posix/env.cc at main · microsoft/onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator - microsoft/onnxruntime

github.com

Code:

#if (defined(__APPLE__) || defined(__FreeBSD__) || defined(__NetBSD__)) && defined(HW_L2CACHESIZE)
    int mib[2] = {CTL_HW, HW_L2CACHESIZE};
    size_t len = sizeof(value);
    if (sysctl(mib, 2, &value, &len, NULL, 0) < 0) {
      return -1;  // error
    }
#endif
    return value;

atax1a · Mar 24, 2025

unfortunately, at least on the 14.2 amd64 machines we've seen, that sysctl node does not exist:

Code:

# sysctl hw.l2cachesize
sysctl: unknown oid 'hw.l2cachesize'

cracauer@ · Mar 24, 2025

Are cache sized even a constant on any given machine now? What about Intel's E-Cores? And AMD's two-CCD x3d chips?

Also, using this for optimization is problematic. Due to the caches being associative you can't fill them to the tilt with just the data you want (unless you take that into account).

xieyuheng · Mar 25, 2025

Maybe using PAGE_SIZE is a good workaround.
I will do more experiments to see.

Crivens · Mar 25, 2025

Cache size optimizations usually only make sense when you want maximum performance, as in number crunching rinning full throttle. Any small-but-weak cores do not factor in here. This also usually is best handled in the lower libraries, like BLAS/LAPACK/... which can even auto-tune to the cache size and which employ algorithms using the cache workings to the max. I would not bet against the ones writing that code that I could do better. I tried, and while my code still scaled better than linear with the number of cores, that was not good enough.

Optimizing for cache line size usually makes much more sense, valgrind will help you there sorting your data structures for cache line locality. Do that first, then the cache size may start to be a thing to address.

T-Aoki · Mar 25, 2025

Maybe optimizing for cache sizes and/or cache line sizes on recent heterogenous (non-fully-symmetrical) CPUs would strongly (almost mandatorily) want almost perfect support of the scheduler, contributed directly from each CPU vendors.

xieyuheng · Mar 26, 2025

Thanks for the advice, I am writing a simple (single producer single consumer) queue,
I do experiments to see the effect of the cache size optimization, which is about 2x - 3x.

Here is part of the optimization (I also to tricks like cache cursors, but not showed in the code):

Before:

C:

struct queue_t {
    size_t size;
    size_t mask;
    void **values;
    atomic_cursor_t front_cursor;
    atomic_cursor_t back_cursor;
    destroy_fn_t *destroy_fn;
};

queue_t *
queue_new(size_t size) {
    assert(size > 1);
    assert(is_power_of_two(size));
    queue_t *self = new_shared(queue_t);
    self->size = size;
    self->mask = size - 1;
    self->values = allocate_pointers(size);
    self->back_cursor = 0;
    self->front_cursor = 0;
    return self;
}

After:

C:

struct queue_t {
    size_t size;
    size_t mask;
    void **values;
    atomic_cursor_t *front_cursor;
    atomic_cursor_t *back_cursor;
    destroy_fn_t *destroy_fn;
};

queue_t *
queue_new(size_t size) {
    assert(size > 1);
    assert(is_power_of_two(size));
    queue_t *self = new_shared(queue_t);
    self->size = size;
    self->mask = size - 1;
    self->values = allocate_pointers(size);
    self->back_cursor = new_shared(atomic_cursor_t);
    self->front_cursor = new_shared(atomic_cursor_t);
    return self;
}

Where:

C:

#define new(type) allocate(sizeof(type))
#define new_shared(type) allocate_shared(sizeof(type))

xibo · Jun 28, 2025

xieyuheng said:
Maybe using PAGE_SIZE is a good workaround.
I will do more experiments to see.

Memory Pages and CPU Memory Caches are very different Things. Pages are used to convert virtual address space into physical address space, while caches are used to accelerate frequently accessed lines of memory. A line is usually smaller and never larger then a page.

A line can atomically be locked for access using TSX-NI, but process / thread memory protections affect full pages only, as the kernel maps memory pages rather then subpage ranges. TSX-NI on the other hand cannot lock a full page. As another user already mentioned, some AMD CPUs have cores with varying MMUs and caches on the same socket, which means the kernel would have a hard time keeping the locking consistent if it was working on subpage level.

On current "computer" hardware, in "most" situations your L1 cache has a size in the order of magnitude of 100kb, consisting of 64 byte sized cache line entries that map into fractions of 4k sized memory pages. Some purpose dedicated systems have larger pages, e.g. the Xeon Scalable series supports 1GB pages (source). The page size does, however, not affect the cache line size - instead, the large pages are only useful if the system will run only a few processes each assumed to be very large, in which case the kernel will be able to map a single 1GB page per page fault rather then having to page fault up to 250k times to map 1GB of 4k pages for the same effect... if the MMU can even store 250k records. Otherwise it'll have to purge some "stale" entries in order to fit new pages and then have to re-register the stale entry later on if it gets re-accessed.

xieyuheng said:

After:

C:

struct queue_t {
    size_t size;
    size_t mask;
    void **values;
    atomic_cursor_t *front_cursor;
    atomic_cursor_t *back_cursor;
    destroy_fn_t *destroy_fn;
};

queue_t *
queue_new(size_t size) {
    assert(size > 1);
    assert(is_power_of_two(size));
    queue_t *self = new_shared(queue_t);
    self->size = size;
    self->mask = size - 1;
    self->values = allocate_pointers(size);
    self->back_cursor = new_shared(atomic_cursor_t);
    self->front_cursor = new_shared(atomic_cursor_t);
    return self;
}

Don't use assert(3) in production code, return a null pointer or another error indicator instead, and check the return value of queue_new in the invoking contexts.
aligned_alloc(2) and friends also return null pointers on errors, which has to be checked and handled rather then asserted on.

If you can get away with debug builds that don't remove assertions altogether, stripping you of the error "handling", you don't need to optimize for memory caching either - not to mention the debug binaries are usually larger and have text sections interleaved with "bloat" that will waste precious L1 cache on top of using more/slower instruction sequences.

kent_dorfman766 · Jun 28, 2025

if freeBSD supports /proc/cpuinfo like linux does then that information can be retrieve by reading the pseudofile and parsing the requested data from it.

I think though that the larger argument is, IMHO trying to program around cache size strikes me as an optimization with very limited value.

kent_dorfman766 · Jun 28, 2025

xibo said:
Don't use assert(3) in production code, return a null pointer or another error indicator instead, and check the return value of queue_new in the invoking contexts.

IMHO it is perfectly advantageous to use assert() as long as you use the NDEBUG macro to disable it during production build. Conditional compilation is your friend: #ifdef...#endif

xibo · Jun 28, 2025

kent_dorfman766 said:
IMHO it is perfectly advantageous to use assert() as long as you use the NDEBUG macro to disable it during production build. Conditional compilation is your friend: #ifdef...#endif

conditional compilation is a testing nightmare as you have to test all possible combinations of the macros, which quickly explodes in complexity. Assert based error handling, too, is troublesome to test, as you can't verify the "error handling" codes' ability to handle the error without crashing the test binary, and if you configure your test suite to require a crash of the binary on a given input, it will be difficult to test the reason it crashed was the error being identified or a different issue caused the program to terminate. To properly validate the application crashed in an expected way you have to tune the test framework a lot, and many test frameworks cannot do it at all. Return code checking is much more precise and doesn't require complex test framework setup, either.

To take an example, assume the code in this post's snippet is to be tested to handle a failure of aligned_alloc(2):

xieyuheng said:

I use sysconf(_SC_LEVEL1_DCACHE_SIZE) in linux,
but it seems freebsd does not have this:

sysconf

man.freebsd.org

I need to query cache size in the following code:

C:

// aligned to cache line to avoid false sharing
void *
allocate_shared(size_t size) {
    size_t cache_line_size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
    assert(cache_line_size > 0);
    size_t real_size = ((size / cache_line_size) + 1) * cache_line_size;
    void *pointer = aligned_alloc(cache_line_size, real_size);
    memset(pointer, 0, real_size);
    assert(pointer);
    assert(pointer_is_8_bytes_aligned(pointer));
    assert(pointer_is_cache_line_aligned(pointer));
    return pointer;
}

namely the sequence

Code:

    void *pointer = aligned_alloc(cache_line_size, real_size);
    memset(pointer, 0, real_size);
    assert(pointer);
    assert(pointer_is_8_bytes_aligned(pointer));
    assert(pointer_is_cache_line_aligned(pointer));

Let's assume we test this snippet with the FreeBSD libc's aligned_alloc to perform an allocation we know in forward it cannot handle, and we want to verify the snippet handles the error, with the expectation that the test application asserts and thereby terminates. For this purpose, the test framework is instructed this test is known to fail and in fact expected to fail. After all, we're abort(3)-ing in the assert(pointer) line, right?

When the test is run, the test application abnormaly terminates, returning to the test framework's runner with non-success child proccess exit code, which is what we told the framework to expect, so the error handling works, right?

What happened, however, was that that the first line assigned NULL to pointer, the next line then asked memset(3) to write into that NULL, and memset caused a segmentation fault. The segfault then caused an abnormal termination, without the error handling code having been involved, since it checks the value of pointer being acceptable only after having passed it to memset.

If the snipet is fixed by first asserting the return value being acceptable and only invoking memset after it passed the assertion, the release build, where the assertion, i.e. the error handling, was "optimized out", will still run straight into memset and segfault.

The next line after the null pointer check asserts 8 byte alignment. This smells like the memory chunk is meant to be used for mmintrin.h stuff, or handwriten SIMD assembly code. If it is used for that purpose, the value should be checked at allocation, and on failure emit a diagnostic any end user understands before returning to the applications' main loop or retrying the allocation, as continuing to the SIMD code would emit SIGILL and coredump on some opcodes and hardly anyone would be able to understand what happened. If the assertion is "optimized out" for release builds... it should provide a meaningful diagnostic nevertheless, and equally have a means of recovery, either by retrying the allocation or by returning to the main loop in order to process other input that doesn't need 8 byte aligments while the SIMD request is handled by another thread/process/node, or use a potentially slower non-SIMD implementation to process the request - e.g. multimedia/ffmpeg does multithreaded video decoding using xmmintrin code, but has pure C fallback implementations in case it can't use xmmintrin which it checks at runtime on a per-thread level, so it can potentially be using SIMD on some threads and non-SIMD on others (which AMD folks are probably quite happy about)... it also has a number of CVEs for asserting on values only after having used them, just like the snippet that passes an unchecked pointer to memset and then asserts on it being nonzero.

For the last line, I doubt it should be handled as an error. All it affects is performance, "eventually", and I wouldn't bother unless a profiler proofs that cache coherency is actually being an issue. Number crunching apps, which are the only ones that more or less could be considered to actually be affected by cache coherency in a measurable dimension, are common to be run on clusters and GPUs. In either environment, the process can be moved through nodes at runtime, meaning that cache line sized might vary at runtime. If it's cheap node usage, such as cloud infrastructure that is bought on demand and canceled on job completion to safe costs, the compute nodes can be live migrated transparently at run time, potentially causing the application to continue running on a system with different cache line size. FreeBSD can handle being live migrated, and reinvoking sysctl(2) will provide the new cache line size that can be used for future allocation, but what will happen to already allocated memory chunks? How much performance would it waste to check whether the process got live migrated? If you bail and coredump over an unaligned allocation, losing all data of all threads that is not in consistent storage, and restart the process in hope it will succesfully allocate aligned memory, rerequest work from the compute clusters' master node and thereby cause the master node to have to work more, too, is it really faster then just proceeding with unaligned memory and losing some cache coherency on a single thread for a single request?

xibo · Jun 28, 2025

kent_dorfman766 said:
if freeBSD supports /proc/cpuinfo like linux does then that information can be retrieve by reading the pseudofile and parsing the requested data from it.

I think though that the larger argument is, IMHO trying to program around cache size strikes me as an optimization with very limited value.

FreeBSD has a linux compatibility layer that includes linux procfs support. It's implemented with the linprocfs(5) filesystem that is not mounted on /proc by default. Native FreeBSD applications don't use procfs(5) (FreeBSD's procfs which is different from linprocfs), so you might be able to just mount it on /proc. The safe/recommended way is to mount it elsewhere though, and to use it in a jail or chroot containing a linux userspace.

Code:

    alonso@sunstream /usr/home/alonso % uname -a
    FreeBSD sunstream.purpleflowergarden.twilightparadox.com 14.3-RELEASE FreeBSD 14.3-RELEASE releng/14.3-n271432-8c9ce319fef7 GENERIC amd64
    alonso@sunstream /usr/home/alonso % mkdir /tmp/proc
    alonso@sunstream /usr/home/alonso % sudo mount -t linprocfs  linproc /tmp/proc
    alonso@sunstream /usr/home/alonso % cat /tmp/proc/cpuinfo 
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 63
    model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
    stepping        : 2
    cpu MHz         : 2400.00
    cache size      : 256 KB
    physical id     : 0
    siblings        : 32
    core id         : 0
    cpu cores       : 32
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 6
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm constant_tsc xsaveopt
    bugs            : 
    bogomips        : 4800.00
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management:  

    processor       : 1
    vendor_id       : GenuineIntel
    [...]

T-Aoki · Jun 29, 2025

xibo said:

FreeBSD has a linux compatibility layer that includes linux procfs support. It's implemented with the linprocfs(5) filesystem that is not mounted on /proc by default. Native FreeBSD applications don't use procfs(5) (FreeBSD's procfs which is different from linprocfs), so you might be able to just mount it on /proc. The safe/recommended way is to mount it elsewhere though, and to use it in a jail or chroot containing a linux userspace.

Code:

    alonso@sunstream /usr/home/alonso % uname -a
    FreeBSD sunstream.purpleflowergarden.twilightparadox.com 14.3-RELEASE FreeBSD 14.3-RELEASE releng/14.3-n271432-8c9ce319fef7 GENERIC amd64
    alonso@sunstream /usr/home/alonso % mkdir /tmp/proc
    alonso@sunstream /usr/home/alonso % sudo mount -t linprocfs  linproc /tmp/proc
    alonso@sunstream /usr/home/alonso % cat /tmp/proc/cpuinfo
    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 63
    model name      : Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
    stepping        : 2
    cpu MHz         : 2400.00
    cache size      : 256 KB
    physical id     : 0
    siblings        : 32
    core id         : 0
    cpu cores       : 32
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 6
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm constant_tsc xsaveopt
    bugs            :
    bogomips        : 4800.00
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management: 

    processor       : 1
    vendor_id       : GenuineIntel
    [...]

For linux apps, there's /compat/linux/proc for linprocfs. This is quite logical as (IIUC) linux apps running under Linuxulator thinks that /compat/linux is the root directory. So as /compat/linux/sys for linsysfs.

And what are exposed are (IIUC) what are already somehow exposed (at least maintained internally) by the kernel and easily converted to Linux style (or already matching at the first place).

My understanding is that if continuously retaining and maintaining the info is quite benefical for scheduler and exposing the info itself as-is without converting as read-only data is enough, it could come true in the future.
But overhauling the scheduler itself should happen first, as current ones that can be chosen are (just my opinion, though) not enough fit with assymmetric multi processors like BIG/Little and p- / e- cores.

bgavin · Jun 29, 2025

I don’t have a dog running in this event, but my hard won experience tells me to profile the code long before worrying about cache optimization.

cracauer@ · Jun 29, 2025

I hope everybody also realizes that many consumer CPUs these days have different cores with different cache sizes in one machine.

So a global query with one answer makes no sense.

kent_dorfman766 · Jun 29, 2025

xibo
I think we have different ideas about the use of assert(). It is for when you DO want the program to terminate on error (for the purpose of debugging and quickly knowing where it crashed. All I'm saying is that those assert() are benign because they are conditionally compiled out with proper NDEBUG macro use.

as for conditional code blocks? It's way to situation dependent to state equivically that they are wrong or right.

re - ops snippet
upon quick inspection it would appear that the list of assert() is redundant since aligned_alloc() can only return null or a properly aligned pointer based on supplied arguments.

For the most part I'll stick with C++ and throw exceptions wherever possible. I used to be a strong advocate of standard function prototypes where the return value indicated both an enum status and pass/fail, but I've come to love C++ exceptions...except when debugging multi-threaded apps, which is a PITA in all cases. LOL

kent_dorfman766 · Jun 29, 2025

cracauer@ said:
I hope everybody also realizes that many consumer CPUs these days have different cores with different cache sizes in one machine.

So a global query with one answer makes no sense.

Is that true? The only time I've run across that was embedded system ARM mutlicore that had generic ARM core and a separate realtime tuned ARM core for heaving lifting

cracauer@ · Jun 29, 2025

kent_dorfman766 said:
Is that true? The only time I've run across that was embedded system ARM mutlicore that had generic ARM core and a separate realtime tuned ARM core for heaving lifting

For example, on AMD x3d chips with 12 or 16 cores only one half of the cores have access to the big L3 cache.

T-Aoki · Jun 29, 2025

And Intel Core processors starting from AlderLake has p-core (performance core) and e-core (efficient core), whidh is in different architecture (Core and Atom respectively, at the beginning).

xibo · Jun 29, 2025

T-Aoki said:
And Intel Core processors starting from AlderLake has p-core (performance core) and e-core (efficient core), whidh is in different architecture (Core and Atom respectively, at the beginning).

The e-cores and p-cores are completely identical from a x86_64 application's point of view. If they were not, they could not be preempted while running on a p-core and rescheduled to continue running on an e-core, which isn't just what e-core/p-core unaware FreeBSD is having them do all the time, and all other operating systems were also doing when the AlderLake was released, it's also what the chip is designed to do: Most applications (threads) cause burst load every now and then become idle or do minor tasks waiting for more work to be provided to them, and the OS scheduler can analyze a machine specific register to acquire the information whether the CPU's firmware considers a thread worthy of being migrated to the other efficiency level or not, but processing this MSR is optional.

Where's the "Atom" coming from? Each of my Emerald Rapids' E-Cores outperforms a Skylake SP Core, and that's with AVX512 code, which Atoms don't support. According tho intel's slides, the main difference between the e-cores and the p-cores is the number of vector co-processors, where the e-cores have 1 and the p-cores have 3 co-processors. Which makes sense as the detection of out-of-band executable vector code and its distribution to the available vector co-processors is done by the instruction pipeline optimizer, which is invisible to x86_64 applications, so x86_64 applications have no means to know how many co-processors are present or are being used, allowing for this number to change at runtime. Intel has also had trouble with the excessive power consumption and heat generation of their vector co-processors ever since they introduced AVX2, while most applications even now, i.e. a decade later, do not use AVX during their entire lifetime. So reducing the number of high AVX performance capable cores allows most applications to run at the same performance while selected applications can now run faster on a performance core that has additional co-processors: All 4th/5th Generation Xeon Silver processors have only p-cores, while the 4th/5th generation Gold processors have most cores being e-cores, and most Golds have less p-cores then their Silver "competitors", but outside of microbenchmarking, the Golds outperform the Silvers in work/time, and even more so in work/watt, i.e. have better TCOO and efficiency.

cracauer@ · Jun 29, 2025

In Intel Alder Lake and newer the P and E cores have different caches, including different amounts of L1 cache.

C How should I query the level 1 data cache size in C?

Administrator