How to make "impossible" memory allocations fail in a sane way?

Cath O'Deray · Jan 8, 2022

Ambert said:
… @obsigna -- Next time there is a call for Foundation-supported project ideas, …

The call for proposals is open.

FreeBSD Foundation Soliciting Project Proposals

Hello everyone, The call for proposals has been sent out. -- Joe (with Foundation hat on)

forums.freebsd.org

FreeBSD Foundation 2022 Call for Proposals : freebsd

zirias@ · Jan 8, 2022

Ambert said:
You should not take my word for it. For the moment I can't have access to a FreeBSD installation (I am stuck on a Linux install), so the code was not tested properly. That's why most of my sentences begin with "I think that". But I think you can execute this code on any FreeBSD install and you will understand how it works just by looking at the output of the program. I could have made that program prettier, but it would not have been a good introduction to mmap().

I don't think you can make it "prettier" the way I meant it. Although mmap() is specified in POSIX, anonymous mappings aren't specified there, so it's also unclear whether this "re-mapping" page-wise would really work.

But if this code works on FreeBSD, it would at least mean you can write a program on FreeBSD reserving a large contiguous address space without having to rely on overcommit, which is already kind of cool.

Ambert said:
In addition, if you want to write portable code, malloc() is a good bet. Otherwise you have to abstract the memory handling in some kind of library, and use mmap() for Unix systems and VirtualAlloc() for Windows systems (and potentially other low-level memory functions for other operating systems). And VirtualAlloc() cannot do everything mmap() can.

That's one reason I think some extended (and portable) interface would be nice *). Of course, supporting Windows in addition to mmap() as a backend would be more work for an implementation...

Probably it's much too late for all my thoughts here because overcommit is just an "accepted fact" and people wouldn't start adopting better interfaces in different OS and applications ?. So you can just add the "OOM killer" to the list of unforseeable environmental events that can always bring down your application/service in an unexpected way...

-----
*) edit: For simple usecases like a potentially growing array, a simple extension like this could be enough:

Code:

void *mreserve(size_t size);
void *mallocfrom(void *pool, size_t size);
void munreserve(void *pool);

Of course, this won't suffice if what you need is some sparse data structure...

_martin · Jan 8, 2022

covacat said:
try to run some large process that uses shm

I don't have anything useful at hand. stress-ng had issues running it. It's been over 20 years since I coded anything shm related, didn't have time to write anything myself. But I think it's worth exploring further. Or maybe even checking current PRs.

Zirias As you mentioned somewhere above protect(1) is a FreeBSD way to achieve what you need (assuming I understood you want to exlcude your process from OOM). You can also call procctl(2) within your process (if run as root).

Cath O'Deray · Jan 8, 2022

covacat said:
try to run some large process that uses shm …

_martin said:
I don't have anything useful at hand. stress-ng had issues running it. …

Is stress2 of any relevance? (Just curious.)

freebsd-src/tools/test/stress2 at main · freebsd/freebsd-src

The FreeBSD src tree publish-only repository. Experimenting with 'simple' pull requests.... - freebsd/freebsd-src

github.com

README

covacat · Jan 8, 2022

_martin said:
I don't have anything useful at hand. stress-ng had issues running it

you can probably hack /usr/src/tests/sys/posixshm

zirias@ · Jan 8, 2022

_martin said:
Zirias As you mentioned somewhere above protect(1) is a FreeBSD way to achieve what you need (assuming I understood you want to exlcude your process from OOM). You can also call procctl(2) within your process (if run as root).

Yes this would enable me to rule out the OOM killer as one possible "crash reason" for my process and I think it would even be feasible here: It mostly doesn't need much RAM. Just occassionally, it uses a hash function from security/libargon2 which is very memory-hungry, but that memory is quickly released again.

Still I'm thinking about ways to conceptually improve the situation (although, sure, this will most probably not lead anywhere ?). It would make so much more sense if an application could learn when it can't get the RAM it needs (as in, the actual RAM, not just address-space reservation) and react to it. Most of the time a (somewhat) clean exit would be the best it could do, but that's better than "crashing". And thinking about my usage of libargon2: It's just ONE function of the service, so in cases like this, you could even think about just giving the client a temporary error in case the function currently can't get the RAM it needs. Well, you can at least dream of better design...

_martin · Jan 8, 2022

Well if you have coredump available you could debug the reason for the crash of your program. At least that's where I'd start. There's also the question what "crash" means in your case.

That's the beauty of the virtual address space -- you as a program don't know if the address is ram, swap or anything in between. Also strictly speaking "address space reservation" doesn't make much sense, memory reservation does. Userspace (64b) address space is way larger than any available ram (at least for now, excluding some special computers).

As memory is lazy allocated you could malloc and write to it. If it's a bigger chunk than PAGE_SIZE writing one byte to each page is enough (technically one bit but granularity is one byte). You're wasting system memory this way but those pages would be allocated to your process. They could still get swapped out in case of memory pressure but they will be there. malloc (using this term loosely to say memory allocator) will most likely hold to these pages even when freed as it does its thing on it. munmap() would release those pages immediately.

zirias@ · Jan 8, 2022

_martin said:
Well if you have coredump available you could debug the reason for the crash of your program.

I really wonder where this repeated misunderstanding comes from, maybe from the fact this was moved to the "programming" section

There is no problem with my code, and it doesn't crash either. I just added "sane" handling of an OOM situation, and when trying to test it more realistically, I noticed it's impossible with the default setting of vm.overcommit. So, yes, it's basically a question about overcommit behavior of FreeBSD and the corresponding sysctl.

_martin said:
"address space reservation" doesn't make much sense

I beg to differ, the usecase for it was first mentioned by Ambert in this thread: You want some growing memory to be contiguous in virtual address space. With malloc(), this isn't possible in the presence of multiple parts of the program (or even shared libs) using it. So what programs do is reserve just a huge chunk of memory, relying on overcommit behavior...

_martin · Jan 8, 2022

Zirias said:
this would enable me to rule out the OOM killer as one possible "crash reason" for my process

Judging from this.

Virtual address space is flat. With some small exceptions you have full 128TiB virtual address space available to you.

malloc (allocator) will always provide you flat, contiguous space in requested size (i.e. allocated chunk is always contiguous). It is allocating space on already contiguous mmaped chunk (big chunk that is used at starting ground for allocator). You as a userspace programmer should not care if two chunks of allocated memory are next to each other (e.g malloc(32), malloc(32)). But, for example, if you need to extend the dynamically allocated buffer (hence a need for larger contiguous chunk) you can use realloc. Allocator will take care of that. And for sure no, text/data/bss/stack/vdso/shared libs segment or any other custom mapped segment into the address space is a problem.

You can also write your own allocator or avoid it by using mmap. If you know what you're doing you can use MAP_FIXED to keep chunks at known addresses. I can't think of reason why you'd need to, but there's a way (only time I need that is when writing exploits and I need to know where I am, e.g. jumping to custom exploit page).

Important note though: contiguous chunk in virtual address does not mean contiguous chunk in physical memory.

zirias@ · Jan 8, 2022

_martin said:
But, for example, if you need to extend the dynamically allocated buffer (hence a need for larger contiguous chunk) you can use realloc. Allocator will take care of that.

You should be aware in presence of heap fragmentation, realloc() often needs to copy all the contents to a new location. This is not acceptable for some usecases (we've had all of this in this thread before....). It is acceptable for my service here, I'm using it.

_martin said:
You can also write your own allocator or avoid it by using mmap. If you know what you're doing you can use MAP_FIXED to keep chunks at known addresses.

This has been discussed here before as well. Sure it works, but it also requires to actually request the memory upfront (although the OS will only try to really provide it at a page fault, but then the program isn't in control any more to handle it if it can't be fullfilled). Also, this is not really portable, but that's just a side note.

_martin said:
Important note though: contiguous chunk in virtual address does not mean contiguous chunk in physical memory.

This is less relevant for most usecases.

obsigna · Jan 8, 2022

Here comes my preliminary summary on what I understood about the problem and how I started to mitigate this.

My understanding of the problem:
Actually the title of this thread tells it: „How to make "impossible" memory allocations fail in a sane way?“. My first reaction was, what is the problem? Only check for OOM in your program than call exit(3) and let the atexit(3) handlers do the cleaning up. Then I learned about two unbelievably ridiculous circumstances which prohibit this „sane“ way of OOM handling within our programs:

malloc(3) never fails, unless our program asks it to allocate 1 petabyte of memory. Therefore our program never would learn about an OOM situation early enough for doing anything sane about it.
The FreeBSD system got implemented an OOM Killer, i.e. a cave man of the stone age who, in case of a problem, smashes the head of the next person around with a big club - it could at least signal a TERM before KILL (bronze age), couldn’t it?

My mitigation:

I had already an implementation of a malloc wrapper, which got introspection facilities, like total allocation and count of allocated chunks. This was already very handy when hunting memory leaks, since on program exit I require both to be 0. With that already in place, it was easy to impose a limit for the total allocation (512 MB default) which can be changed on the command line. In case the total allocation exceeds the defined limit, the malloc-wrapper returns NULL, and my program does whatever would be appropriate in the given situation.
The wrapper actually calls a = malloc(s) and now in the same breath it calls madvise(a, s, MADV_WILLNEED|MADV_PROTECT) (thanks to one of the suggestions of Ambert).

According to madvise(2), MADV_PROTECT:

Informs the VM system this process should not be killed when the swap space is exhausted. The process must have superuser privileges. This should be used judiciously in processes that must remain running for the system to properly function.

I need to check somehow, whether this is really the case, though.

My wish:
A modern age OOM supervisor (not killer), which implements a new signal OOM and sends this to all user land processes first. I would happily let my process finish in a „sane" way, when it would receive a signal like this. In case the OOM signal does not free enough memory, it could send a TERM to the memory hogs (bronze age), for only then eventually falling back to the stone age (smashing heads) behaviour.

shkhln · Jan 8, 2022

obsigna said:
unbelievably ridiculous

Considering that every other shared resource I can think of (ISP bandwidth, road capacity, hospital beds, whatever) is provisioned under assumption that it won't be simultaneously needed by 100% of potential users, "ridiculous" is not the word I'll use. More like "inevitable".

obsigna said:
The wrapper actually calls a = malloc(s) and now in the same breath it calls madvise(a, s, MADV_WILLNEED|MADV_PROTECT)

Do you enjoy deadlocks?

obsigna · Jan 8, 2022

shkhln said:
Do you enjoy deadlocks?

Deadlock of what into what?

shkhln · Jan 8, 2022

obsigna said:
Deadlock of what in what?

At least with overcommit enabled, FreeBSD most likely still won't return any nulls to you, and since FreeBSD can't kill your process it would rather pause it until enough memory is available. Then you'll be like this guy: https://forums.freebsd.org/threads/out-of-memory.69755/.

obsigna · Jan 8, 2022

shkhln said:
Considering that every other shared resource I can think of (ISP bandwidth, road capacity, hospital beds, whatever) is provisioned under assumption that it won't be simultaneously needed by 100% of potential users, "ridiculous" is not the word I'll use. More like "inevitable".

In developed countries, doctors do not smash the heads of patients when they ran out of beds, only to make space for new occupants. This would be unbelievably ridiculous, wouldn’t it?

shkhln · Jan 8, 2022

obsigna said:
In developed countries, doctors do not smash the heads of patients when they ran out of beds, only to make space for new occupants. This would be unbelievably ridiculous, wouldn’t it?

They do triaging.

obsigna · Jan 8, 2022

shkhln said:
They do triaging.

If it comes that far, according to well thought rules. And there are procedures, like a signal OOB (out of beds) before a TERM before a KILL. See my wish.

_martin · Jan 8, 2022

Zirias said:
but it also requires to actually request the memory upfront (although the OS will only try to really provide it at a page fault

That's why I mentioned writing to each page (preallocation) and unmap it when not needed.

Still relevant was the trigger of the panic you were able to achieve. Can you confirm you have p4 generic kernel running on your system ? I'll try to massage the bug a bit.

obsigna · Jan 8, 2022

shkhln said:
At least with overcommit enabled, FreeBSD most likely still won't return any nulls to you, and since FreeBSD can't kill your process it would rather pause it until enough memory is available. Then you'll be like this guy: https://forums.freebsd.org/threads/out-of-memory.69755/.

This is a different situation. I ship measurement controllers operated by FreeBSD, and when the measurement daemon needs to be killed only for somebody wants to run Firefox or Chrome or alike on it, then it is pointless that a measurement controller without a measurement daemon continuous to operate. In this case, I prefer the user holds down the power button for a few seconds. Some users are clever enough to just no more run Firefox/Chrome etc., others would need to go through this some more times, but eventually they will understand as well.

zirias@ · Jan 9, 2022

_martin said:
That's why I mentioned writing to each page (preallocation) and unmap it when not needed.

That's not really a solution either if your usecase is some dynamic, potentially large and growing array. You want simple indices to work, so it needs to be contiguous in memory. But once you unmap what you don't need currently, other code-paths might put a mapping exactly in this location of virtual address space...

As there are programs allocating huge chunks of memory they will most likely never really use, I assume they have usecases like for example this. That's why I think decoupling the reservation of address space from that of actual pages backing it could improve the situation, you could at least meet these needs without overcommit then.

_martin said:
Still relevant was the trigger of the panic you were able to achieve. Can you confirm you have p4 generic kernel running on your system ? I'll try to massage the bug a bit.

It's almost GENERIC, except for device sg (Linux-compatible raw SCSI devices). I'd be very surprised if that made a difference. What's probably relevant is that my system was already heavily overcommitting (I checked with roughly the same programs running later, vm.swap_reserved was at >600GB on this machine with 8GB RAM / 8GB swap).

How to make "impossible" memory allocations fail in a sane way?

Cath O'Deray

FreeBSD Foundation Soliciting Project Proposals

zirias@

_martin

Cath O'Deray

freebsd-src/tools/test/stress2 at main · freebsd/freebsd-src

covacat

zirias@

_martin

zirias@

_martin

zirias@

obsigna

Profile disabled

shkhln

obsigna

Profile disabled

shkhln

obsigna

Profile disabled

shkhln

obsigna

Profile disabled

_martin

obsigna

Profile disabled

zirias@