C How should I query the level 1 data cache size in C?

I use sysconf(_SC_LEVEL1_DCACHE_SIZE) in linux,
but it seems freebsd does not have this:

I need to query cache size in the following code:

C:
// aligned to cache line to avoid false sharing
void *
allocate_shared(size_t size) {
    size_t cache_line_size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
    assert(cache_line_size > 0);
    size_t real_size = ((size / cache_line_size) + 1) * cache_line_size;
    void *pointer = aligned_alloc(cache_line_size, real_size);
    memset(pointer, 0, real_size);
    assert(pointer);
    assert(pointer_is_8_bytes_aligned(pointer));
    assert(pointer_is_cache_line_aligned(pointer));
    return pointer;
}
 
I think OP was asking how to do it programmatically. MacOS exposes this value via sysctl; freebsd doesn't seem to.
 

Code:
#if (defined(__APPLE__) || defined(__FreeBSD__) || defined(__NetBSD__)) && defined(HW_L2CACHESIZE)
    int mib[2] = {CTL_HW, HW_L2CACHESIZE};
    size_t len = sizeof(value);
    if (sysctl(mib, 2, &value, &len, NULL, 0) < 0) {
      return -1;  // error
    }
#endif
    return value;
 
unfortunately, at least on the 14.2 amd64 machines we've seen, that sysctl node does not exist:
Code:
# sysctl hw.l2cachesize
sysctl: unknown oid 'hw.l2cachesize'
 
Are cache sized even a constant on any given machine now? What about Intel's E-Cores? And AMD's two-CCD x3d chips?

Also, using this for optimization is problematic. Due to the caches being associative you can't fill them to the tilt with just the data you want (unless you take that into account).
 
Cache size optimizations usually only make sense when you want maximum performance, as in number crunching rinning full throttle. Any small-but-weak cores do not factor in here. This also usually is best handled in the lower libraries, like BLAS/LAPACK/... which can even auto-tune to the cache size and which employ algorithms using the cache workings to the max. I would not bet against the ones writing that code that I could do better. I tried, and while my code still scaled better than linear with the number of cores, that was not good enough.

Optimizing for cache line size usually makes much more sense, valgrind will help you there sorting your data structures for cache line locality. Do that first, then the cache size may start to be a thing to address.
 
Maybe optimizing for cache sizes and/or cache line sizes on recent heterogenous (non-fully-symmetrical) CPUs would strongly (almost mandatorily) want almost perfect support of the scheduler, contributed directly from each CPU vendors.
 
Thanks for the advice, I am writing a simple (single producer single consumer) queue,
I do experiments to see the effect of the cache size optimization, which is about 2x - 3x.

Here is part of the optimization (I also to tricks like cache cursors, but not showed in the code):

Before:

C:
struct queue_t {
    size_t size;
    size_t mask;
    void **values;
    atomic_cursor_t front_cursor;
    atomic_cursor_t back_cursor;
    destroy_fn_t *destroy_fn;
};

queue_t *
queue_new(size_t size) {
    assert(size > 1);
    assert(is_power_of_two(size));
    queue_t *self = new_shared(queue_t);
    self->size = size;
    self->mask = size - 1;
    self->values = allocate_pointers(size);
    self->back_cursor = 0;
    self->front_cursor = 0;
    return self;
}

After:

C:
struct queue_t {
    size_t size;
    size_t mask;
    void **values;
    atomic_cursor_t *front_cursor;
    atomic_cursor_t *back_cursor;
    destroy_fn_t *destroy_fn;
};

queue_t *
queue_new(size_t size) {
    assert(size > 1);
    assert(is_power_of_two(size));
    queue_t *self = new_shared(queue_t);
    self->size = size;
    self->mask = size - 1;
    self->values = allocate_pointers(size);
    self->back_cursor = new_shared(atomic_cursor_t);
    self->front_cursor = new_shared(atomic_cursor_t);
    return self;
}

Where:

C:
#define new(type) allocate(sizeof(type))
#define new_shared(type) allocate_shared(sizeof(type))
 
Back
Top