When I was an IBM systems programmer in the ‘70s, there was an unofficial way to apply live patches to the running system, using a non-IBM-sanctioned utility called Corezap. (They didn’t call it the kernel, but same concept.) It usually worked, but any error in entering the patch, or interactions with other modules that the patch author didn’t anticipate, because they didn’t expect the patch to be applied to a running system, could cause an instant crash.
With that history, I find the idea of live kernel patching really scary. You can avoid a reboot, but sometimes you end up crashing and having to recover a workload that unexpectedly had the rug pulled out from under it.
I like the idea of making the workload more portable.
Another example from back in the day: The IMS database system had a transaction-processing component called IMS/DC. A kind of program called a Batch Message Processing program, or BMP, which would take messages (entered from terminals) from a queue and process them. IMS had a checkpoint/restart scheme that would save the state of things as part of the IMS call that retrieved to message. If the BMP program blew up, you could fix it, and restart it at the most recent checkpoint. Since it restarted at the message retrieval call, you could change the program any way you wanted, and IMS didn’t care. (This was in stark contrast to the regular checkpoint/restart facility, which didn’t let you change anything, and was more for when the hardware fell over.)
Now, this sort of scheme is much more applicable to transaction-oriented processing. It is a lot harder to do something like this for long-running compute-heavy processes.
But my point is that I believe a safer future comes from figuring out how to make it possible to suspend, move, and resume applications, rather than changing the kernel on the fly.
If we do go the live kernel patching route, it would seem that patches that involve multiple modules would have to indicate if the changes have to be applied in a certain order, or even simultaneously. (I’m not sure how you would do that. Perhaps patch a wait into the lead module, do the rest of the patch, and then patch out the wait and resume everyone who is waiting? [I don’t know enough about FreeBSD internals to know how you would do this. In the mainframe world it would be a POST.])