Before 'systemD' how was it hostorically done to keep serivces/daemons kerenel running through failures

8bitGlitch · Feb 15, 2021

So, I am not sure where to put this, and I am not starting a flame war. However, I am interesting in understanding what distros and OS Unix did before SystemD had hooks into everything.

Systemd provides a number of resources, which sort of breaks POSIX and the Unix philosophy; however, one of the features I do like is the watchdog features, but from my understanding this is only for daemons that start in userspace, not the kernel.

But... this got me thinking. How did historically OS like Unix or other RTOS provide a mechanism to keep track if daemons failed? Are these features extended into the kernel to keep track of the kernel? Maybe I am confusing a few concepts, since, but say OS for satellites and other remote devices. The kernel needs to keep running, so what happens if some error is detected, but the OS needs to stay up?

I might be confusing a lot of concepts together, but I hope to have a real discussion, which folks can educate me on historical concepts and possible different designs that you would not see in consumer servers or embedded devices like cellphones.

Getting ready for Feb 18th!!!!! Our Return to the RED PLANET!!! Yeah!!!!

Jose · Feb 15, 2021

8bitGlitch said:
Systemd provides a number of resources, which sort of breaks POSIX and the Unix philosophy; however, one of the features I do like is the watchdog features, but from my understanding this is only for daemons that start in userspace, not the kernel.

This kind of process supervision does not belong in PID 1, and has been implemented many times by other projects. From an essay written by the author of the Musl C library:

None of the things systemd "does right" are at all revolutionary. They've been done many times before. DJB's daemontools, runit, and Supervisor, among others, have solved the "legacy init is broken" problem over and over again (though each with some of their own flaws).

I recommend that whole essay. He proposes a stunningly simple init daemon. I'm not enough of a systems hacker to know how realistic that approach is. If you're interested in a system that incorporates a supervisor that is not Systemd, I recommend Void Linux. I used it briefly and was impressed.

8bitGlitch said:
But... this got me thinking. How did historically OS like Unix or other RTOS provide a mechanism to keep track if daemons failed? Are these features extended into the kernel to keep track of the kernel? Maybe I am confusing a few concepts, since, but say OS for satellites and other remote devices. The kernel needs to keep running, so what happens if some error is detected, but the OS needs to stay up?

It's still around, and confusingly called "sysvinit". Confusing because Sysv init is a category of things. My understanding is that package had a set of sample rc scripts and a basic implementation of a PID 1 init daemon. Most distros heavily modified the rc framework, but left the init daemon more or less alone. Supervision was left to other systems. Along came Systemd, and you know the rest.

The earliest sort-of analog I can think of is the inetd(8) "super" server.

The simplest rc scripts framework might be LFS's. I've never tried it, though:

9.2. LFS-Bootscripts-20230728

Edit: One more thing, I avoid the whole "supervision" approach. If your daemon crashes and needs to be restarted, it has a bug that needs to be fixed, and it should stay crashed so someone notices. This papering over serious problems reminds me of Windows reboot syndrome.

zirias@ · Feb 15, 2021

8bitGlitch said:
How did historically OS like Unix or other RTOS provide a mechanism to keep track if daemons failed?

Not at all.

And, if you think about it: it's *generally* a bad idea to just restart a failing service. After all, it normally fails for a reason.

Still, there can be the need for supervision (like: informing an admin upon failure), and there were always tools doing exactly that (e.g. the daemontools). There's no reason at all to incorporate THAT functionality in init.

Of course, just restarting a failing service MIGHT be an option depending on the situation, and there are tools that can (optionally) do that, but it should be obvious that this is just a workaround for an unsolved problem.

sko · Feb 15, 2021

Zirias said:
And, if you think about it: it's *generally* a bad idea to just restart a failing service. After all, it normally fails for a reason.

exactly THIS! I always hated that argument and considered it a bad habit.

A service never fails without a reason that has to be looked into, so automatic restart of a service should only be used in very special cases. For those cases there have been various solutions for ages before systemd even was someones wet dream... But the systemd-approach IMHO exactly matches the typical tenor of the docker/kubernetes-fandom: "if something fails, just start 2 new instances".

As for other bloated (but quite well designed) service management systems: SMF on solaris switches a failed service into "maintenance" mode, so this service can't be restarted and isn't loaded on reboot UNTIL you specifically issue a svcadm clear <service>. This is IMHO the only correct way to deal with failing services. I'm not a big fan of SMF, but it definitely gets a lot of things right where systemd is just completely useless and/or broken (by design).
So if you are interested in how systemd should have been implemented, you should investigate SMF.

To get back on topic:
I'm using daemon(8) to deamonize various jobs/services/scripts and use its -r option for stuff that I specifically want to keep running (e.g. the fancontrol scirpt on my homeserver). I often wish '-r' would take another argument to specify how often daemon should try to restart the service.

eldaemon · Feb 15, 2021

monit is another route some people have taken to restart on failure. That said, restart on failure isn't always the best idea.

SirDice · Feb 15, 2021

8bitGlitch said:
However, I am interesting in understanding what distros and OS Unix did before SystemD had hooks into everything.

Golden oldy: sysutils/daemontools

8bitGlitch · Feb 15, 2021

Wow... thank you all for GREAT posts and keeping this discussion continuing without at 'flame war'. I too believe service fail for a 'bad/error' reason and should not be restarted; however, I didn't know the historical approach or thinking behind previous projects. In addition, I missed Solaris; I wish I could had been able to use it, and do remember the commercials in the mid/late 90s; however, I was still using my Amiga and C128 at that time.

We all agree that services/daemons that fail should remain failed until reviewed; however, it is funny that major vendors do not see it this way. Just the other day, I was review logs on a Cisco Nexus series switch that had the log buffer filling up. The OS was not out of date; however, it was generating a number of logs, in the thousands. Looking through the logs, showed 'systemd' restarted the snmpd daemon, after working with TAC there is a known bug for their implementation of snmpd; however, they just let systemd restart it.

Interesting that Android is getting put into a number of embedded items, and it does not have systemd; however, parties are pushing systemd for embedded devices. The latter is another concept, or idea that got me thinking.

kpedersen · Feb 15, 2021

Depending on the use-case, you might even just considering rebooting at 3am each night.

Code:

# crontab -e

* 3 * * * shutdown -r now

richardtoohey2 · Feb 15, 2021

8bitGlitch said:
We all agree that services/daemons that fail should remain failed until reviewed; however, it is funny that major vendors do not see it this way.

I don't know that everyone agrees - but I think it's more the "BSD way" if there's such a thing.

If you've got a "throwaway" service that's delivering not-very-important web pages ("brochureware" or catalogue websites) you might prefer uptime to correctness. If the failure means that 90% of your services are fine and 10% are having issues, do you always want to stop the whole show?

Personally I'm in the "fail early (and stop) and with lots of noise" camp but I can see for some purposes - you might want to be able to say "ignore that and carry on".

8bitGlitch · Feb 15, 2021

How are mission critical systems on a satellites are handled or something that is far away, that you cannot just power cycle manually?

PMc · Feb 15, 2021

8bitGlitch said:
How are mission critical systems on a satellites are handled or something that is far away, that you cannot just power cycle manually?

High-Availability.
For business applications there are two redundant instances, one is active the other standby, and there is some kind of heartbeat showing that the active application is healthy. When that heartbeat ceases, the active application is forced offline and the standby becomes active.
On critical technical stuff we usually have three identical and independent systems all up+running, and for anything to be done, at least two of them must agree on it.

SirDice · Feb 15, 2021

8bitGlitch said:
How are mission critical systems on a satellites are handled or something that is far away, that you cannot just power cycle manually?

They run custom software. Specifically built to do the one thing that satellite is supposed to do.

zirias@ · Feb 15, 2021

SirDice said:
They run custom software. Specifically built to do the one thing that satellite is supposed to do.

This. I wouldn't be surprised to still find systems there that don't even have the notion of a "process" but execute a fixed schedule of recurring jobs instead. And of course, all redundant.

drhowarddrfine · Feb 16, 2021

This doesn't make any sense to me. What did they do before systemd? The same thing they are doing now or pretty close to it! The only thing that uses systemd is Linux so this doesn't make any sense at all.

Beastie7 · Feb 16, 2021

I would love an implementation of SMF in FreeBSD. It's really smart software. I believe it has some configuration management functionality built-in too.

garry · Feb 16, 2021

8bitGlitch said:
..... but say OS for satellites and other remote devices. The kernel needs to keep running, so what happens if some error is detected, but the OS needs to stay up?

When I have done embedded systems (e.g. perimeter security and access control for nuclear and military sites) it was of course silly to say that an unhandled exception "needs to be fixed before continuing operation". Some such critical systems must have a mechanism for recovering from a "fatal" error. For those I depended on a hardware watchdog timer. The executive process would be responsible for resetting the timer (e.g. at least once every 10 ms). If the hardware timer ever counted down to zero it fired a non-maskable interrupt that re-started the system. In other words it was basically a jump to the ROM entry point as if the system were just turned on. (My entire operating system and control program were in that ROM).

I don't see how systemd would help me in such a system. How does systemd recover a "dead" cpu (hardware glitch, tight infinite loop in an interrupt handler, whatever). For the mythical system that "must recover from all errors and keep running" you need special approaches to software development (e.g. a realtime systems expert working in Ada) and special hardware support.

[opinion] To suppose that the developers of systemd know anything about realtime and critical systems is to ignore that they write massive, complex, and buggy code. Nothing in systemd will help you keep a system running in the face of unanticipated exceptions. If the several hundred thousand lines of code for systemd's core services is still running, and wasn't itself the source of the error, you haven't encountered a real problem yet. If you just need to keep re-starting a driver while testing it you will want to use a special test harness to set up and supervise your testing of your driver. I think that systemd would only get in the way of proper testing. I don't see any use for systemd in achieving fault tolerance.

garry · Feb 16, 2021

"Before systemd how was it historically done to keep services / daemons running?"

I think that Unix itself was an answer to the question. Multics couldn't be completed and couldn't be made reliable because it was too complex. Under Unix services were made reliable by breaking the system into small parts that are humble and cooperative.

The answer to "how to keep it running" is "keep it cohesive" not "make it more complex".

a6h · Feb 16, 2021

8bitGlitch said:
How are mission critical systems on a satellites are handled or something that is far away, that you cannot just power cycle manually?

Short answer:
WDT.

Here's a lame/simplistic explanation:

Intro:
WDT has a counter register.
Counter has value T (time i.e. timeout) # must be set
Clock increments the counter.

Source of clock:
* Internal oscillator
* System clock
* Backup RC oscillator

Procedure:
Software has to reset (kick) counter, before it reaches to the value T (periodically & regularity).
If it doesn't => counter will overflow => WDT will reset the processor.

Program:
Embedded program has a main function.
Main function has a infinite loop.
Loop calls subroutines => successful (less than T) => kick the WDT => no reset.

Failure:
Any failure, e.g. software bug, hardware failure, EMI, etc => invalid software state =>
delay in loop and/or infinite loop, etc => timeout => WDT will reset the processor.

Internal WDT:
* Internal to µC/µP
* Less reliable

External WDT:
* Stand-alone IC
* Separate clock
* Monitoring VCC
* Better reliability

Footnote:
µP: Microprocessor
uC: Microcontroller
WDT: WatchDog Timer
VCC: Voltage Common Collector
T: a time value

sko · Feb 16, 2021

This mode of operation is actually still essentialy the same today (e.g. on engine control units or other microcontrollers) as over 50 years ago on the Apollo Lunar Guidance Computer. Essentially that computer had a master control loop with a given time-frame in which all routines were ordered by importance and had to finish. If they didn't, the control loop aborted all running/further routines and restarted, throwing an alarm.
The "1201 Alarm" and "1202 Alarm" messages Armstrong reported back to mission control during lunar descend were were exactly those alarms (due to a frequency- but not phase-locked antenna controller that constantly triggered and overloaded the system).

Jose · Feb 16, 2021

I took a similar approach with a flaky program I wrote. I called it a dead man's switch. It's funny how there's really nothing new under the Sun.

It was probably a thread deadlock I couldn't find because I was young and arrogant, and didn't know about jstack yet.

8bitGlitch · Feb 16, 2021

@ Garry, Vigole and Sko:

Thank you for your input. I wanted to talk to folks that did embedded systems for projects like NASA and other mission critical aspects that might be beyond the scope of Linux and FreeBSD. I just find it odd that the Linux kernel is getting shoe horned into everything under the sun, and companies are creating problems to promote their solution. (yeah looking at you BlueHat).

Due to the latter, it got me thinking, that something had to be better and in use long before Systemd was support to help all of our embedded devices; however, if you look at Android and Cell phones - they are operating without issues, and car companies are leaving QNX for Android. So it would seem that a turn key solution, which does not use systemd is good enough for our automobiles.

Which brought me to the idea or question - "what was done historically" and "what does embedded stuff like on airplanes or satellites use".

I also suspect, or it wouldn't surprise me - that same flavor of BSD and corporate entity has something even grander in use due to the nature of the BSD license.

zirias@ · Feb 16, 2021

Sorry to say that, 8bitGlitch, but in the domain you're asking about, systemd is just irrelevant. As are many other things. systemd is ONE solution for "init" in a general-purpose OS (and, IMHO, a bad one), and this is already where the analogy breaks: systems that are mission-critical (in a sense that human lives could depend on them) are not "general purpose". They are specifically tailored to their purpose.

8bitGlitch · Feb 16, 2021

Zirias said:
Sorry to say that, 8bitGlitch, but in the domain you're asking about, systemd is just irrelevant. As are many other things. systemd is ONE solution for "init" in a general-purpose OS (and, IMHO, a bad one), and this is already where the analogy breaks: systems that are mission-critical (in a sense that human lives could depend on them) are not "general purpose". They are specifically tailored to their purpose.

I thought it would be irrelevant as well; however, if one drinks the cool-aid and listen to BlueHat, one would think systemd is critically need in the embedded market sector. When the sales pitch comes up for systemd, embedded sector is always brought up, and is one of the driving factors to sell the concept, yet we see projects like GNOME get hooks into systemd. Not to mention someone has been quoted from BlueHat saying something long the lines of [we want to be in your car, and systemd will get us there], or something to that extent.

From my initial post, and thinking when I opened this discuss - I feel educated on the historical items that came before systemd and that are still being used. Coming from CentOSvill 7th Ave, used on a limited scale, I am grateful for FreeBSD's method of controlling start up. It is very refreshing, and I can understand the concepts; however, I cannot say the same for systemd, and all the 'systemctl' commands I had to look up.

I want to thank everyone that had insight and input into this discussion and I appreciate all the feed back. I hope having the background will make me a better FreeBSD user, be able to support and promote the ecosystem next time someone says RedHat, Debian or another non BSD based Linux kernel distro is recommended.

I just have not found a way to sell it to management, beyond my personal use. I did roll out a syslog server, which I hope to have up and running over next weekend.

sko · Feb 17, 2021

8bitGlitch said:
if you look at Android and Cell phones - they are operating without issues,

I highly object.
The devices in my surrounding that are by far the most annoying, mainly due to (extremely) poor software quality and hence often bad stability and annoying bugs, are android devices, despite the fact that I still keep their numbers at a bare minimum. I can't remember how often I had to unplug my FireTV stick because it lags horribly or is completely unresponsive (I mean, really? that thing has a quad core CPU FFS!!). Same on the android phone in my car which also acts up every few days and essentially only has to stream some music, which was possible without any hazzle 15 years ago with the computing equivalent of a todays refrigerator. My impression of android is: code quality throughout the whole stack decreases rapidly, so they throw more and more abstraction layers and frameworks to it "because this makes it easier to write working code", which leads to an overall even more bloated and bug-ridden system.
The last OS I'd rely on for something that just absolutely has to work would be android (I'm not counting that mouse-only OS here). Yes, it boots on everything, but the ecosystem is a dumpster fire.
Thats why my main phone is still a Blackberry Classic with BBOS (QNIX), just because it still 'just works'™

8bitGlitch said:
I just have not found a way to sell it to management, beyond my personal use. I did roll out a syslog server, which I hope to have up and running over next weekend.

It might be because I mostly worked as a one-man show in small/mid-sized companies, but my attitude to this is just: If my boss doubts my expertise if I tell him X is best suited for that task, he can start searching for a new sysadmin. But I can imagine how it works in bigger companies, so I'd suggest starting with low-hanging fruit and e.g. promote OpenBSD for security-critical stuff and/or networking. Word about OpenBSDs security should be known even to the most ignorant levels of management and as soon as "BSD" is somewhat familiar for them it might be easier to slip a few more servers in here and there, even running FreeBSD.

drhowarddrfine · Feb 17, 2021

8bitGlitch said:
Which brought me to the idea or question - "what was done historically" and "what does embedded stuff like on airplanes or satellites use".

I was an electronic engineer and designed a medical computer for eye surgery. If you ever had eye surgery, there's a 50/50 chance the surgeon used my machine, before 2000, and current machines by the same company still use my basic design.

I used a real-time operating system that was not Unix or BSD or Linux anything else. There was no disk drive--it booted out of EPROM--but it did save data in nonvolatile memory. It had a real time clock and watchdog timer as described above. I guess sitting in an operating room with probes inflating someone's eye is as mission critical as you can get.

Before 'systemD' how was it hostorically done to keep serivces/daemons kerenel running through failures

8bitGlitch

Jose

zirias@

sko

eldaemon

SirDice

Administrator

8bitGlitch

kpedersen

richardtoohey2

8bitGlitch

PMc

SirDice

Administrator

zirias@

drhowarddrfine

Beastie7

garry

garry

a6h

sko

Jose

8bitGlitch

zirias@

8bitGlitch

sko

drhowarddrfine