How to make "impossible" memory allocations fail in a sane way?

obsigna · Jan 5, 2022

Zirias said:
But: OOM should be a condition that can be handled at least with a clean exit. And this would work if I'd ever get an error (null returned) from malloc() et al. So, for now, I'm back to this sysctl supposed to control "overcommit" behavior in FreeBSD

Am I missing something?

I use atexit(3) handlers for this. Works perfectly also in OOM conditions. Your malloc wrapper most not call abort(3) but exit(3). The difference ist that abort simply pulls the plug of the executable, while exit calls all registered atexit handlers in reverse order before stopping the program. I would even not be very surprised if the btree stuff would install an atexit handler, but on abort it won’t become called. Anyway, for me atexit handlers together with a malloc-exit-wrapper instead of malloc-abort-wrapper is the sane OOM solution, and here it works without hick up.

zirias@ · Jan 5, 2022

obsigna said:
Am I missing something?

Yes, the simple fact that malloc() just (almost?) never fails to begin with, at least with the default setting of vm.overcommit=0. You can't handle what you don't know.

obsigna said:
I use atexit(3) handlers for this. Works perfectly also in OOM conditions.

That's a way to do it. I decided instead for a longjmp() back to where the crucial cleanup is executed anyways*. But neither of this helps if malloc() just succeeds and the problem only arises when using the memory, forcing the kernel to engage the OOM killer...

----
*) using this generic "panic" function:

Code:

void Service_panic(const char *msg)
{
    if (running) for (int i = 0; i < numPanicHandlers; ++i)
    {
        panicHandlers[i](msg);
    }
    logsetasync(0);
    logmsg(L_FATAL, msg);
    if (running) longjmp(panicjmp, -1);
    else abort();
}

together with the following around the service main loop:

Code:

    running = 1;
    if (setjmp(panicjmp) < 0) goto shutdown;

    // [...] main event loop

shutdown:
    running = 0;
    // [...] cleanup

(And btw, atexit() wouldn't work for my service cause it has worker threads. The threadpool registers a "panic handler" that checks whether it's called on a worker thread, in that case it longjmp()s out of the thread job first and the handler for a finished thread job on the main thread then calls Service_panic() again....)

obsigna · Jan 5, 2022

Excerpt from malloc(3):

RETURN VALUES
Standard API
The malloc() and calloc() functions return a pointer to the allocated
memory if successful; otherwise a NULL pointer is returned and errno is
set to ENOMEM.
…
…
Non-standard API
The mallocx() and rallocx() functions return a pointer to the allocated
memory if successful; otherwise a NULL pointer is returned to indicate
insufficient contiguous memory was available to service the allocation
request.

Are you telling that malloc alway returns non-NULL pointers, even in cases it cannot provide the requested memory?

I have to admit, that I only check whether the returned pointer is NULL and in this case do the OOM handling. If the actual behaviour of malloc is different than written in the man page above, then you would need to file a PR against malloc.

I see also that many things changed with this jemalloc sophistication, perhaps not everything is really useful.

zirias@ · Jan 5, 2022

obsigna said:
Are you telling that malloc alway returns non-NULL pointers, even in cases it cannot provide the requested memory?

Yes.

obsigna said:
I have to admit, that I only check whether the returned pointer is NULL and in this case do the OOM handling. If the actual behaviour of malloc is different than written in the man page above, then you would need to file a PR against malloc.

It's most likely not malloc() causing that (which, IIRC, uses mmap() internally ... traditionally, sbrk() was used, but I guess that's a thing from the past). The problem is that the OS gives you anonymous mappings, even if they can't be backed. If you read this whole thread, it's actually "by design" (and the sysctl vm.overcommit is meant to give some control over this overcommit behavior).

obsigna · Jan 5, 2022

Zirias said:
(And btw, atexit() wouldn't work for my service cause it has worker threads. The threadpool registers a "panic handler" that checks whether it's called on a worker thread, in that case it longjmp()s out of the thread job first and the handler for a finished thread job on the main thread then calls Service_panic() again....)

While I believe you, that atexit(3) does not work in your case, I just checked it (again) with one of my heavily threaded daemons, and here atexit handlers become executed regardless from which thread exit(3) is called. I restrict myself to the standard pthread(2) API, though.

Zirias said:
It's most likely not malloc() causing that (which, IIRC, uses mmap() internally ... traditionally, sbrk() was used, but I guess that's a thing from the past). The problem is that the OS gives you anonymous mappings, even if they can't be backed. If you read this whole thread, it's actually "by design" (and the sysctl vm.overcommit is meant to give some control over this overcommit behavior).

Well this is almost unbelievable. So, malloc(3) gives me memory which cannot be used????? This would be one of the buggiest bugs that I ever heard of.

EDIT: I never touched vm.overcommit, here it is 0.

zirias@ · Jan 5, 2022

obsigna said:
While I believe you, that atexit(3) does not work in your case, I just checked it (again) with one of my heavily threaded daemon, and here atexit handlers become executed regardless from which thread exit(3) is called. I restrict myself to the standard pthread(2) API, though.

The problem is not that it wouldn't be called. The problem is that the cleanup would have to access memory that was last modified by a different thread without any memory barrier (e.g. pthread_mutex) in-between. That's one of the things that "just work fine" in 99% of the cases, but can go horribly wrong.

obsigna said:
Well this is almost unbelievable. So malloc(3) gives me memory which cannot be used????? This would be one of the buggiest bugs that I ever heard of.

Actually, not really. Ambert outlined a somewhat "sane" usecase for that (in a nutshell, ensuring an unfragmented virtual address space for large and potentially growing objects). Still it's unfortunate and destroys all hope to react to OOM with a graceful exit...

obsigna · Jan 5, 2022

Now, I just checked it:

Code:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

int main(int argc, char *const argv[])
{
   long size = strtol(argv[1], NULL, 10)*1024*1024*1024;
   void *p = malloc(size);
   printf("%ld, 0x%016zX\n", size, (intptr_t)p);

   return 0;
}

Code:

CyStat-210:~ root# ./oomcheck 10
10737418240,      0x0000000801200700
CyStat-210:~ root# ./oomcheck 100
107374182400,     0x0000000801200700
CyStat-210:~ root# ./oomcheck 1000
1073741824000,    0x0000000801200700
CyStat-210:~ root# ./oomcheck 10000
10737418240000,   0x0000000801200700
CyStat-210:~ root# ./oomcheck 100000
107374182400000,  0x0000000801200700
CyStat-210:~ root# ./oomcheck 1000000
1073741824000000, 0x0000000000000000

What the fuck?!?

malloc returns a valid pointer for 100000 GB, who could have expected this. Well at least for an allocation of 1 petabyte (10¹⁵) it does not want to give a guaranty anymore. Was this ridiculous behaviour introduced by jemalloc?

zirias@ · Jan 5, 2022

obsigna said:
Was this ridiculous behaviour introduced by jemalloc?

No. As I said before, I'm pretty sure the user-space allocator has nothing to do with that. It will just pass through the error (if any) given by the kernel.

edit: Try to write that memory (e.g. with memset()), you'll see your system go into heavy swapping and finally the OOM killer randomly killing large processes.

zirias@ · Jan 5, 2022

Resuming this thread in a nutshell so far:

setting vm.overcommit to 1 should indeed prohibit allocations exceeding the available backing store (physical RAM + swap)
the kernel panic I've seen changing that sysctl was seen by others as well and is most likely a bug
there's a usecase for reserving a "ridiculous" amount of memory from your application: ensure contiguous virtual address space

I'd conclude the traditional APIs are lacking. There should be a way to just reserve address space, without reserving actual memory

covacat · Jan 5, 2022

surprise, on my mac i just allocated 1tb

Code:

macmini:~ mac$ ./mm 1
1073741824, 0x00000001189F3000
macmini:~ mac$ ./mm 1000
1073741824000, 0x0000000108E1F000

zirias@ · Jan 5, 2022

covacat said:
surprise, on my mac i just allocated 1tb

Yes, this behavior seems pretty wide-spread. I know Linux is doing the same as well. And now that I learned a "sane" usecase for it, I think it can't really be fixed without changed/improved APIs. Having contiguous address space for a potentially growing object, that's a valid requirement...

shkhln · Jan 5, 2022

Zirias said:
Yes, the simple fact that malloc() just (almost?) never fails to begin with, at least with the default setting of vm.overcommit=0.

I think it should work with resource limits. At least I don't see why it wouldn't. (Just a nitpick.)

Ambert · Jan 5, 2022

Zirias said:
I'd conclude the traditional APIs are lacking. There should be a way to just reserve address space, without reserving actual memory

Maybe the MAP_GUARD flag is specifically designed for that purpose (cf. mmap(2)). And I don't know if memory pages reserved with the PROT_NONE protection are really a problem when vm.overcommit is set to 1.

Zirias, when you will perform the test suggested by covacat (setting vm.overcommit to 1 in sysctl.conf), I suggest you try it with several sizes of swap installed on your computer. Maybe increasing the size of the swap will make things better (cf. the definition of vm.overcommit in tuning(7)).

Also, I don't know much about signal handling, but maybe it is possible to catch the signal sent by the OOM killer to your process, and do some cleaning before termination.

mark_j · Jan 5, 2022

Zirias said:
Resuming this thread in a nutshell so far:

setting vm.overcommit to 1 should indeed prohibit allocations exceeding the available backing store (physical RAM + swap)

No, physical ram has nothing to do with it. It's reserved swap preserved when a process begins versus physical swap space available.

Zirias said:
the kernel panic I've seen changing that sysctl was seen by others as well and is most likely a bug

there's a usecase for reserving a "ridiculous" amount of memory from your application: ensure contiguous virtual address space

I'd conclude the traditional APIs are lacking. There should be a way to just reserve address space, without reserving actual memory

unitrunker · Jan 5, 2022

shkhln said:
The thing is, if some process consumes an amount of memory you failed to predict, this is already a problem. OOMs should never happen in normal operation.

This comment right here trumps the whole thread.

ralphbsz · Jan 6, 2022

(About programs malloc'ing memory, then not using it)

Zirias said:
How is that ever "reasonable"? If you said "rarely", ok, that's why swapping out pages makes sense. But not use it at all? Why should you ever reserve memory if you'll never write to it? I'd call that ill program design...

You are right, it's not very common. But it happens. Example: I create a vector that is supposed to hold up to 1 million entries (because that's a reasonable upper limit for how many things my program has to deal with). This run, there are only 10K entries. Maybe I'm using a data structure that's deliberately sparse for great insert/remove performance, and I just don't care that I'm wasting a few dozen megabytes, because memory is cheap. It particularly happens with server code that does internal caching with good cache management: If the workload the server has to handle doesn't need much caching (great locality), some memory will go unused. That's fine, the malloc() calls were nearly free, and don't use many resources. That's exactly the idea behind overcommit.

All the countless reasons your program could crash aside (you can eliminate intrinsic reasons in theory, but not environmental reasons): Running out of memory is a condition that allows at least a "graceful" exit, if your program would learn about it the moment it tries to reserve memory.

In practice, recovering form malloc failure (even if such a thing commonly happened) is harder than it seems. There are several reasons. One is that the "graceful exit" code probably needs memory allocation too. One technique I've seen used is to send all such emergency exit code through one common routine. And then at startup time, reserve one memory buffer (maybe 1MB) that is never ever used, and the first thing the emergency exit code does is to free that memory buffer, so the exit code can function with a few mallocs. But even that fails today. The reason is that most big modern programs are multi-threaded (and have to be, to take advantage of multi-core CPUs and to overlap network and IO latencies). So one thread runs out of memory, and longjmp's to the common exit routine. But that exit routine can not synchronously stop all other threads from malloc'ing (since any synchronous locking mechanism would be too slow), and even if it free's an emergency reserve pool, that will be immediately consumed by the other threads. I've tried writing such "out of memory recovery" code, and after weeks of messing with it, gave up. You're better off looking at the problem you're trying to solve, estimate how much memory is available (you know the machine, you know what other software is running), and planning accordingly. And if (god forbid) someone runs a giant memory hog (like emacs'ing a 1GB log file) on the machine, it's game over. Doctor, it hurts when I do that. Well, then stop doing it.

The bad thing about that practice is: Once the system learns it can't map all the currently needed pages to physical RAM any more, the only resort is the OOM killer, randomly killing some large process (so, any process in the system can be affected).

Or your program catches a segfault. Just as painful and unpleasant.

A broken program just reserving insane amounts of memory will be able to bring down other processes on the same machine. That's something virtual memory was originally designed to avoid.

There are many things that traditional operating systems were designed to avoid. For example isolation between users ... and we've given up on that, we instead move competing users into VMs, containers, or jails. I mean, we even do things like running a simple and harmless piece of software (the DNS server) in a jail, just "because". I think what you're saying is that OS design has not reached its goal. I agree, but that's the world we live in. Writing reliable and performant software in an imperfect world can require gritting your teeth and accepting reality.

about stack space, I don't really see a problem with that. As long as you use neither VLAs, stuff like alloca() or recursion, you can guarantee an upper bound for stack usage of your program (and any algorithm can be implemented without these).

Agree. With good coding practices, running out of stack space should be rare. You just have to make sure all programmers on the project understand that.

(About shkhln's comment: "if some process consumes an amount of memory you failed to predict, this is already a problem.")

unitrunker said:
This comment right here trumps the whole thread.

Sadly true. If you want to build reliable production systems, look at all the processes on the machine.

covacat · Jan 6, 2022

you probably don't run you service as root, but just in case vm.overcommit is not enforced for root

zirias@ · Jan 6, 2022

mark_j said:
No, physical ram has nothing to do with it. It's reserved swap preserved when a process begins versus physical swap space available.

That wouldn't make much sense (and doesn't correspond to the wording in the manpage, why just beginning of a process? it only talks about reserved swap vs available swap). The consequence would be with vm.overcommit=1, userspace could never get any memory if there wasn't any swap at all.

BUT I guess I found why changing the setting fails so badly. Looking at the machine (8GB RAM, 8GB swap) I tried that while running normally, I see vm.swap_reserved at > 600 GB

– that's just insane... I still think the kernel shouldn't panic, but of course any userspace allocation will immediately fail.

ralphbsz said:
Example: [...] That's exactly the idea behind overcommit.

That's pretty similar to the usecase Ambert already described. It boils down to what you actually want is reserving address space. I think it could be solved by allowing to reserve address space and the memory backing it separately...

ralphbsz said:
One is that the "graceful exit" code probably needs memory allocation too.

Depends on what it's doing. Persisting something to disk is probably pretty common, if that needs dynamic allocations, then yes. I don't use some "emergency buffer", the quick path out (taken with the longjmp()) frees a few objects anyways, so there's hope this will be enough.

ralphbsz said:
So one thread runs out of memory, and longjmp's to the common exit routine. But that exit routine can not synchronously stop all other threads from malloc'ing (since any synchronous locking mechanism would be too slow), and even if it free's an emergency reserve pool, that will be immediately consumed by the other threads.

In my design, the thread exiting for OOM (or any other "panic" reason) would tell the main thread (via some shared/locked memory allocated by the main thread) that there was a panic, which causes the main thread to signal all other threads to exit... IF malloc() would reliably return NULL, this should be enough.

ralphbsz said:
You're better off looking at the problem you're trying to solve, estimate how much memory is available

There is no concrete problem with my code, I thought that was obvious from my initial post. OOM is a condition I can't predict (it depends on what else is running on the machine) and I'd like to react with a graceful exit IF it ever happens (best effort), that's all. Seems overcommit makes that impossible in practice.

zirias@ · Jan 6, 2022

covacat said:
you probably don't run you service as root, but just in case vm.overcommit is not enforced for root

It's started as root, but drops privileges early on (just after setting up its listening sockets). uid/gid are configurable, I currently use nobody/nogroup.

But the thing is: I didn't even run it with vm.overcommit=1 so far cause trying to test it on my desktop/dev machine, I immediately got this kernel panic.

Ambert · Jan 6, 2022

Zirias said:
OOM is a condition I can't predict (it depends on what else is running on the machine) and I'd like to react with a graceful exit IF it ever happens (best effort), that's all.

Since the OOM condition is handled by the operating system, and it solves it by sending termination signals to some guilty-looking processes, maybe the "best effort" you can do is write a function that terminates your application nicely when it receives the SIGTERM signal. According to the handbook:

Handbook said:
Two signals can be used to stop a process: SIGTERM and SIGKILL. SIGTERM is the polite way to kill a process as the process can read the signal, close any log files it may have open, and attempt to finish what it is doing before shutting down. In some cases, a process may ignore SIGTERM if it is in the middle of some task that cannot be interrupted.

SIGKILL cannot be ignored by a process. Sending a SIGKILL to a process will usually stop that process there and then. [1]

That way, you would offer the operating system a choice: either reclaim your memory gently and let you clean up, or reclaim your memory abruptly and make a mess.

Handling the SIGTERM signal is useful for other terminating conditions, not only OOM. For instance, the handbook says that a shutdown will:

Handbook said:
send all processes the TERM signal, and subsequently the KILL signal to any that do not terminate in a timely manner

----------------------
Edit: I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? -- Basically, they say that there is no easy way to handle an OOM condition in a program's code, due to the design of the operating system (overcommitment).
-----------------------
Edit2: For your particular case, I think I have found a solution: 1) Create a child process. 2) Do your main computation in the child. 3) You keep a small summary of the changes made to your database since the last sync() in the parent process (a "diff"). 4) The child is a much more interesting target for the OOM killer, so it is killed first. 5) When the parent find out the child process has been killed, the parent process writes the diff to a log file, and terminates. 6) When the program restarts, it reads the log file and apply the changes to the database.

zirias@ · Jan 6, 2022

Ambert, handling SIGTERM is a must for a well-behaved daemon, but the OOM killer doesn't send a signal, it just terminates your process forcefully ("SIGKILL")

zirias@ · Jan 6, 2022

Ambert said:
I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? -- Basically, they say that there is no easy way to handle an OOM condition in a program's code, due to the design of the operating system (overcommitment).

There is no "easy" way, but there certainly is a "best effort" way... But this *is* an interesting response (quoting from there):

The fork() should give the child a private "copy" of the 1 GB buffer, by
setting it to copy-on-write.

That's yet another API shortcoming, fork() shouldn't be the only way to start a new process.. (edit: yes, I remember there was vfork() e.g. on Linux trying to solve exactly this problem, but it was an ill-defined disaster -- and Windows only offers CreateProcess() and no fork(), which is just as bad....)

The disadvantage, of
course, is that if someone calls the bluff, then we kill random processes.

That's exactly the issue

although programs can in theory handle failed allocations and respond
accordingly, in practice they don't do so and just quit anyway.

and this sounds to me like a chicken/egg problem. Programs don't bother handling it because they know it's moot anyways, because the OS doesn't tell them the error at a time they could react

(edit2: adding your usecase of having contiguous address space to the picture, I think the solution would really be to provide a way to reserve just address space and still require the program to reserve memory backing it before it can be used, so there's a chance to get and handle that error when it happens)

Ambert said:
For your particular case, I think I have found a solution: 1) Create a child process. 2) Do your main computation in the child. 3) You keep a small summary of the changes made to your database since the last sync() in the parent process (a "diff"). 4) The child is a much more interesting target for the OOM killer, so it is killed first. 5) When the parent find out the child process has been killed, the parent process writes the diff to a log file, and terminates. 6) When the program restarts, it reads the log file and apply the changes to the database.

MUCH too complex. As I said above, there is no problem with my code needing unusually much memory. And then, I guess just adding a sync() after sensitive changes is probably much less overhead than this inter-process communication with extra buffers etc...

ralphbsz · Jan 6, 2022

There is another solution, but it's nasty: Give up on the idea that a computer can be shared between multiple users or multiple processes. Go back to the 1950s, and use a dedicated computer for the task you want to run. Except that today you don't use a real physical computer, but instead get yourself a virtual one.

So if your task can calculate how much memory it will use at maximum, and can do internal tracking of memory usage, just instantiate a VM of that memory size, and run your code as a single-purpose VM.

The reason this is nasty is: It means that the whole raison d'etre of operating systems (which is resource control, management and virtualization) has failed at the OS layer, and is instead being done at the VM layer. And you need to be sure your VM layer does the memory allocation the way you like it (namely guaranteed), which isn't always the case. I vaguely remember that kubernetes will overcommit memory too. Oops ...

zirias@ · Jan 6, 2022

ralphbsz said:
It means that the whole raison d'etre of operating systems (which is resource control, management and virtualization) has failed at the OS layer, and is instead being done at the VM layer.

Even worse, it's just the same thing taken to a different level (memory ballooning).

The simple thruth is: memory is a limited resource, and the concept of virtual memory was once invented to make sure no application can bring down other applications (or even the OS itself). The resource "memory" should be given out by a strict "first come, first serve" policy, with optional resource limits configurable by the system's administrator...

Cath O'Deray · Jan 6, 2022

Ambert said:
… I found a similar thread on the freebsd-hackers mailing list: Why kernel kills processes that run out of memory instead of just failing memory allocation system calls? …

Also, more recent, if you haven't already seen it:

The out-of-swap killer makes poor choices | <https://markmail.org/message/siorx6pswhpncluf>

How to make "impossible" memory allocations fail in a sane way?

obsigna

Profile disabled

zirias@

obsigna

Profile disabled

zirias@

obsigna

Profile disabled

zirias@

obsigna

Profile disabled

zirias@

zirias@

covacat

zirias@

shkhln

Ambert

mark_j

unitrunker

ralphbsz

covacat

zirias@

zirias@

Ambert

zirias@

zirias@

ralphbsz

zirias@

Cath O'Deray