uptime

With servers I have been in a situation where the physical access to the machine has been difficult and limited (drive to the DC, find the people with access). In several cases no KVM device connected, perhaps even not available. That means every reboot should be 100% sure. In such a situation one naturally does not want to reboot when the system is working and does its job...

Today I have one such box running:
Code:
$ date; uptime
Sat Dec 11 11:57:33 EET 2021
11:57AM  up 310 days, 20:05, 6 users, load averages: 1.68, 1.12, 0.91

... and I know that I can temporarily order KVM service but last time I couldn't get it working because the service providers device is probably old and uses EOL Flash. I had no idea how to get the Flash working on my FreeBSD desktop...
Very true. I managed servers accessible only via serial port remotely accesible and yes, you have to be damn certain otherwise...
 
Perhaps uptime became a religion because often these 'ancient' operating systems took a very long time to reboot.
Definitely used to be slow. In the 70s and early 80s, booting an IBM mainframe typically took 1.5 hours for cold boot (from powerup), 3/4 hour warm boot (if everything already had power). I remember one of the big advances of the IBM 43x1 was that it would boot faster; that was considered a requirement for the mid-range environment it was typically deployed into. At the same time, minicomputers (VAX, Data General, we used a Four Phase machine branded Philips) still needed 5-10 minutes to boot.

Today we actually have the opposite religion, which machines being automatically and regularly rebooted. That's sort of an admission that software (including operating systems) are imperfect, and cruft will accumulate.
 
Definitely used to be slow. In the 70s and early 80s, booting an IBM mainframe typically took 1.5 hours for cold boot (from powerup), 3/4 hour warm boot (if everything already had power). I remember one of the big advances of the IBM 43x1 was that it would boot faster; that was considered a requirement for the mid-range environment it was typically deployed into. At the same time, minicomputers (VAX, Data General, we used a Four Phase machine branded Philips) still needed 5-10 minutes to boot.
And you would sweat on every second hoping something wouldn't fail & you'd be down the entire day & looking for a new job the next.:)
 
On mainframes, things usually don't fail, and they don't crash. Or to be more accurate: Things fail all the time, but they keep running, perhaps a little slower than usual , or with less disk space or less memory. That is until a junior programmer by mistake manages to crash the OS from a user process (in Fortran, no assembly required). The third time it happened, one of the operators came out to my terminal, and asked me to "not do that again, and show us exactly what you did". I had been wondering why the machine crashed every time I wanted to compile/link/run my analysis program.

The answer, by the way, was a wonderfully subtle bug. When you program in Fortran 77, it so happens that the entry part of the program (not a subroutine, but the main body) is implicitly in a subroutine that is called "MAIN". You don't know that, it's not documented, it's just an internal convention. My problem was that I had defined a subroutine that was also called MAIN, which did most of the work of the program (I probably had other subroutines with names like SETUP, READINP and WRTRSLT. The problem happened because our system ran a combination of the Hitachi F77 compiler with the IBM linker. The compiler happily prepared a module called MAIN (with the main program), and another module called MAIN (with the subroutine). The linker would take the first one and copy it into the executable. It would then notice that a subroutine named MAIN is being called, and put another copy of the first one in (not the second one, due to an incompatibility between Hitachi and IBM). It would then notice that a subroutine called MAIN is being called, and put yet another copy of the first one in. The linker "knew" that Fortran programs can't be recursive, and it "knew" that object modules have unique names, and it was supposed to have a table of all modules it had already linked, but the incompatibility in naming convention between Hitachi and IBM broke that. So the linker would create the executable in memory, trying to copy infinitely many copies of MAIN into memory. Unfortunately, the linker is a system program, so it is exempt from things like memory quota, and it exhausted and overwrote all memory on the machine, causing an OS crash.

No problem, I would go have a snack. After a while, the machine comes back up, I remembered what I had been working on, and I restarted the compile/link/execute cycle, causing another snack break. And again. During the third time, the operators had noticed the number of the terminal that was running the last process before the machine went down, and they found me because I had my coffee mug and all my paperwork sitting there and always returned to the same place. So I explained it to them. They went and got some of the system programmers, who looked at my code, and looked at some IBM documentation, and looked at some logs of the machine, and after an hour or so they told me that I had rediscovered a bug that was known to IBM but not yet to customers. All I had to do to fix the program was to rename the subroutine to anything other than MAIN, and it worked perfectly.
 
On some old IBM you could name your binary the same as the syncer process which is running when the power is on emergency and about to fail. That process would not be interrupted, was not killable and highly annoying to operators. Together with another little known feature of the batch processing, you could run your CPU hogging code any time while the bean counters were looking at non responding spreadsheets in the mean time.
 
With servers I have been in a situation where the physical access to the machine has been difficult and limited (drive to the DC, find the people with access). In several cases no KVM device connected, perhaps even not available. That means every reboot should be 100% sure. In such a situation one naturally does not want to reboot when the system is working and does its job...

Today I have one such box running:
Code:
$ date; uptime
Sat Dec 11 11:57:33 EET 2021
11:57AM  up 310 days, 20:05, 6 users, load averages: 1.68, 1.12, 0.91

... and I know that I can temporarily order KVM service but last time I couldn't get it working because the service providers device is probably old and uses EOL Flash. I had no idea how to get the Flash working on my FreeBSD desktop...
Today it looks like I am going to decommission that machine. Have transferred all the VM-s to another machine and one of the drives in zpool is giving errors already...

Code:
# date; uptime
Thu Nov  3 21:10:17 EET 2022
 9:10PM  up 638 days,  5:18, 7 users, load averages: 1.61, 1.40, 1.28

But, yes, FreeBSD as a system seems stable.
 
I wanted to take a moment to share an exciting milestone regarding my FreeBSD server. I'm thrilled to report that the server has been running continuously for an impressive 954 days since last boot and counting!

No need to criticize, I know that this will not end well...

Code:
# date ; uptime
Fri Sep 15 19:42:56 EEST 2023
 7:42PM  up 954 days,  2:51, 10 users, load averages: 1.14, 1.48, 1.54

Somehow one failed drive in the ZFS pool came back in life by itself also (this seems magick):

Code:
        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

I must admit that I haven't had as much time to focus on server maintenance recently, but this remarkable uptime speaks volumes about the robustness of the infrastructure and the stability of FreeBSD.

In Buddhism, there is a profound wisdom that states: "All things that have a beginning must also have an end." Just as in the realm of computers, where every system that boots up will eventually power down.
 
The whole "my uptime is better than your uptime" and uptime records stuff is so 1990s. Back then, we had uptime measured as follows:

  • Windows: days
  • Linux: months
  • FreeBSD: years
Some of our FreeBSD systems setup in the 90s had uptimes of several years, I believe around five years was our record. But then the threat landscape changed and frequent security patches became a necessity. Now the uptime on Windows is per patch cycle (month), Linux similar. FreeBSD it is between one security patch affecting the kernel and the next.
 
During covid lockdown I had a little ratty T23 Thinkpad in my office providing me with an "emergency" tunnel into some of the machines there. That ended up having an uptime of a couple of years.

It was i386 (P3) and these days most updates actually reduce stability on aging platforms. Its only listening ports were for SSH and that doesn't seem to have had any major (unauthenticated) security issues for a while. Even to this day it is probably safer than most fully updated machines.
 
Back
Top