FreeBSD standalone dump

This isn't one of those dreaded "Why isn't FreeBSD more like...." posts. But after almost 50 years working with IBM mainframes, I am curious how FreeBSD sysadmins deal with full system crashes.

On IBM mainframes, booting (IPL, or Initial Program Load, in IBMese) does not clear storage. There is a separate button, System Reset that does that. So if the system crashes, you can boot a program called Standalone Dump. There is some space in low storage reserved for it, so it loads in there, and can dump storage (including pages that are paged out) to a tape. (I don't know if they still use a tape, but they did when I was a sysadmin many years ago.) There are then programs you can run after the system comes back up to analyze the dumped storage.

I am wondering how FreeBSD sysadmins deal with crashes like that. I am also wondering if the answer is that FreeBSD doesn't crash all that often. When I started as an IBM mainframe sysadmin in the '70s, the operating system crashed a lot. I don't think that happens much any more, with 50 years worth of development of recovery code.
 
Pretty cool. The only constraint is that the kernel panic processing must be in good enough shape to dump storage. I suspect that these days that is not a problem. Much more convenient (and quick) than a standalone dump would be.
 
Yes, all UNIX and UNIX-like systems dump core to swap. At boot savecore(8) will save the dump to /var/crash.

Unlike MVS (I spent 19 years as an MVS systems programmer before switching tracks to UNIX 30 years ago), the dump facility is built-into the kernel. The kernel will not dump core during early boot when its dump facility has yet to be initialized. Whereas if MVS fails during NIP, no problem, one could always IPL SAD.

On the flip side, FreeBSD includes a facility called ddb(4), an interactive debugger on the console. Think of it as an Omegamon for FreeBSD, except that the system is effectively halted while in DDB. While in Omegamon under MVS one could do many of the things we can with DDB but while the system was running, normally.

The ability to IPL mni-O/S like SAD allowed a company I worked for to produce a commercial product called Stand Alone Edit (SAE). It provided an ISPF Edit-like interface on the console allowing the systems programmer to edit a dataset like sys1.parmlib to make an un-IPL-able sytem IPL again. In FreeBSD we have single user state. It's much more convenient than booting another mini-O/S to edit a broken file. Though, in those rare cases when that doesn't even work, one can maintain an alternate partition (or disk) as a rescue mechanism, in the same vein as Stand Alone Edit was.

When I left the the MVS world (MVS/ESA) was pretty solid. Crashes were rare. Though deadlocks, a spin loop where two threads were in a deadly embrace, was still a thing. We used to get a lot of these with FreeBSD years ago, but here too, these are much less common. FreeBSD includes a number of facilities, enabled only in -CURRENT, that allow developers to more quickly identify potential deadlocks -- the WITNESS kernel build option. You'll see LOR (lock order reversal) messages on the console when this option is enabled and a potential deadlock has been discovered.

Another reason for MVS crashes were its shared memory architecture. Shared memory (the CSA) would become horribly fragmented if the system was up for more than a week or two, maybe a month in some cases. UNIX shared memory is shared between applications and not mapped into every address space, kind of like MVS Cross Memory Services.
 
I am wondering how FreeBSD sysadmins deal with crashes like that. I am also wondering if the answer is that FreeBSD doesn't crash all that often.
Sysadmins don't deal with crashes these days. It's not their role, and indeed, a "full system" crash is a very rare thing today, not only in FreeBSD but in most major systems. Even FreeBSD -CURRENT (the development branch) rarely crashes. Regarding RELEASE versions of FreeBSD, in 10 years I'm running these, I've seen exactly 3 crashes, and all for the same reason (something which seemed to be triggered by some issues with ZFS (PR 275594) and didn't cause a panic but showed a message about timeouts trying to swap in something, leading to a partial "live halt").

So, developers should deal with these when they happen. Sysadmins aren't forced any more to look into analyzing crashes, because they almost never happen. They can just report an issue, and maybe a developer might ask them to collect a dump if the crash can't be reproduced.

So far just to put this into perspective and clarify the typical roles; of course if you're interested in the debugging and analysis mechanisms, learn about them. ;)
 
I almost forgot … 10th September, I had a main-n272143-8fa5e0f21fd1 kernel panic that lasted more than eleven hours; "……very slowly printing panic-related lines on two displays. …".

1727549930631.png


Amongst the comments:

"… probably not super surprising"
 
Yes, all UNIX and UNIX-like systems dump core to swap. At boot savecore(8) will save the dump to /var/crash.

Unlike MVS (I spent 19 years as an MVS systems programmer before switching tracks to UNIX 30 years ago), the dump facility is built-into the kernel. The kernel will not dump core during early boot when its dump facility has yet to be initialized. Whereas if MVS fails during NIP, no problem, one could always IPL SAD.

On the flip side, FreeBSD includes a facility called ddb(4), an interactive debugger on the console. Think of it as an Omegamon for FreeBSD, except that the system is effectively halted while in DDB. While in Omegamon under MVS one could do many of the things we can with DDB but while the system was running, normally.

The ability to IPL mni-O/S like SAD allowed a company I worked for to produce a commercial product called Stand Alone Edit (SAE). It provided an ISPF Edit-like interface on the console allowing the systems programmer to edit a dataset like sys1.parmlib to make an un-IPL-able sytem IPL again. In FreeBSD we have single user state. It's much more convenient than booting another mini-O/S to edit a broken file. Though, in those rare cases when that doesn't even work, one can maintain an alternate partition (or disk) as a rescue mechanism, in the same vein as Stand Alone Edit was.

When I left the the MVS world (MVS/ESA) was pretty solid. Crashes were rare. Though deadlocks, a spin loop where two threads were in a deadly embrace, was still a thing. We used to get a lot of these with FreeBSD years ago, but here too, these are much less common. FreeBSD includes a number of facilities, enabled only in -CURRENT, that allow developers to more quickly identify potential deadlocks -- the WITNESS kernel build option. You'll see LOR (lock order reversal) messages on the console when this option is enabled and a potential deadlock has been discovered.

Another reason for MVS crashes were its shared memory architecture. Shared memory (the CSA) would become horribly fragmented if the system was up for more than a week or two, maybe a month in some cases. UNIX shared memory is shared between applications and not mapped into every address space, kind of like MVS Cross Memory Services.
ddb sounds a bit like DSS (Dynamic Support System), which was removed from MVS around 1978 (just after I learned to use it). You would change a byte in low storage, then press the RESTART key, and it would bring up an interactive debugger. You could also set traps, and when they hit, everything would stop and DSS would come up. It was replaced by SLIP traps, which would produce a dump and keep going. This was much more practical in a production data center.

I am familiar with Stand Alone Edit. I really wanted it when I was an MVS sysprog, but could never convince my bosses to buy it. We had a single-pack system for emergencies, but people kept forgetting that it was supposed to be a self-contained system and putting in links to other volumes, which caused problems when we needed it for emergencies.

I use single-user mode a lot when I have configuration problems, and I agree, it is *much* more convenient than booting up a mini system.

BTW, working with MVS for about 45 years (after six years as a sysprog, I went to work for a software vendor) really gives me an appreciation for all the messages FreeBSD puts out during startup, just like MVS. One thing I hate about Windows is that it will just sit there for five minutes with the indicator spinning, then will give you an inscrutable STOP code, and you don't really have any hints as to what the problem might be.
 
Sysadmins don't deal with crashes these days. It's not their role, and indeed, a "full system" crash is a very rare thing today, not only in FreeBSD but in most major systems. Even FreeBSD -CURRENT (the development branch) rarely crashes. Regarding RELEASE versions of FreeBSD, in 10 years I'm running these, I've seen exactly 3 crashes, and all for the same reason (something which seemed to be triggered by some issues with ZFS (PR 275594) and didn't cause a panic but showed a message about timeouts trying to swap in something, leading to a partial "live halt").

So, developers should deal with these when they happen. Sysadmins aren't forced any more to look into analyzing crashes, because they almost never happen. They can just report an issue, and maybe a developer might ask them to collect a dump if the crash can't be reproduced.

So far just to put this into perspective and clarify the typical roles; of course if you're interested in the debugging and analysis mechanisms, learn about them. ;)
Actually MVS systems programmers didn't usually deal with crashes either. My company was a large IBM user, and we had a resident IBM Field Engineer who had his own office and looked at all the system dumps. I have no idea what smaller customers did. Back then, IBM supplied source listings for MVS, but not the actual source, and system programmers regularly modified it with machine-code patches to make it do what they needed. So the only reason a systems programmer would be looking at standalone dumps was if it was caused by local mods. (That was a long time ago. Since then, IBM stopped providing source listings and provided a ton of user exit points, which was better for everyone.)
 
ddb sounds a bit like DSS (Dynamic Support System), which was removed from MVS around 1978 (just after I learned to use it). You would change a byte in low storage, then press the RESTART key, and it would bring up an interactive debugger. You could also set traps, and when they hit, everything would stop and DSS would come up. It was replaced by SLIP traps, which would produce a dump and keep going. This was much more practical in a production data center.

I am familiar with Stand Alone Edit. I really wanted it when I was an MVS sysprog, but could never convince my bosses to buy it. We had a single-pack system for emergencies, but people kept forgetting that it was supposed to be a self-contained system and putting in links to other volumes, which caused problems when we needed it for emergencies.

Sadly, company folded in 1990. I doubt anyone managed to salvage the SAE sources before they disappeared.

I use single-user mode a lot when I have configuration problems, and I agree, it is *much* more convenient than booting up a mini system.

Agreed.

BTW, working with MVS for about 45 years (after six years as a sysprog, I went to work for a software vendor) really gives me an appreciation for all the messages FreeBSD puts out during startup, just like MVS. One thing I hate about Windows is that it will just sit there for five minutes with the indicator spinning, then will give you an inscrutable STOP code, and you don't really have any hints as to what the problem might be.
The only thing I miss are the message IDs, like the IEA or IEF messages. It was so very handy to look up the messages in the Messages and Codes manual (or manuals these days). Unfortunately, this is not the UNIX way.

I did spend three or four months on a special project working with an MVS application group with my present employer about eleven years ago. I kept typing h, i, j, and k to move around the editor, forgetting I was using ISPF instead of vi. Once a person starts using vi(1), it's difficult to get out of one's system. The PA2 key was my lifesaver.
 
Actually MVS systems programmers didn't usually deal with crashes either. My company was a large IBM user, and we had a resident IBM Field Engineer who had his own office and looked at all the system dumps. I have no idea what smaller customers did. Back then, IBM supplied source listings for MVS, but not the actual source, and system programmers regularly modified it with machine-code patches to make it do what they needed. So the only reason a systems programmer would be looking at standalone dumps was if it was caused by local mods. (That was a long time ago. Since then, IBM stopped providing source listings and provided a ton of user exit points, which was better for everyone.)
Crashes were rare. Spin loops, not so rare. Then again, spin loops used to plague FreeBSD too. Not so much now but we can still run across the occasional one these days.

Yeah. The FE had his own office.

Source listings were provided on microfiche. Sources were written in PL/S, with Assembler translations. JES/2 and JES/3 sources were shipped with the system. At one site we had about 40k source code mods to JES/2. It took me about four months to replace the source code mods with exits.

In addition to source code listings and program logic manuals, which described in detail the inner workings of the nucleus (kernel), they also shipped CPU logic manuals for the IBM 3080 series of machines. We had machine logic manuals for 3081 and 3083. The firmware was that buggy that one needed the CPU logic manuals in order to discern whether to open a trouble ticket with IBM that the CPU was not working as documented.

The Intel and AMD documentation are similar to the 370 POP (Principles of Operation) but don't have the level of detail the CPU logic manuals had.

Subsequent CPUs were not shipped with CPU logic manuals because they'd finally figured out where the firmware bugs were and fixed them.

I still have S/360 (from my high school days), 370/XA, and 370/ESA POP manuals in my bookshelf here. More for nostalgia than anything else.
 
The only thing I miss are the message IDs, like the IEA or IEF messages. It was so very handy to look up the messages in the Messages and Codes manual (or manuals these days). Unfortunately, this is not the UNIX way.
True. The message ID prefixes matched the module name prefixes, but there was a funny quirk. Suppose you had a module that was an interface between TCAM (TeleCommunications Access Method, for non-MVSers) and VTAM (Virtual Telecommunications Access Method). Since it was a TCAM module, it should have the IED prefix assigned to TCAM. But since it was a VTAM module, it should have the IST prefix assigned to VTAM. So it would have both! If the module name was, say, IEDINTFC, it would have an alias of ISTINTFC. I only saw this with modules that did not issue messages; I don't know which prefix they used for messages.

Another nice thing about message IDs. It made localization easier. With the Natural Language Support subsystem, you could set up a template for an existing message, based on message ID, that showed where the variable fields in the message were, then define prototypes for other languages, and it would grab the variable fields and stuff them into the translated version. This was very handy since so much of MVS code was written without a thought of translation.

I am not sure how FreeBSD handles this. I just scanned Chapter 25, Localization, of the FreeBSD handbook, but did not see how you get the right messages for your local. I will have to investigate more.
 
Source listings were provided on microfiche. Sources were written in PL/S, with Assembler translations. JES/2 and JES/3 sources were shipped with the system. At one site we had about 40k source code mods to JES/2. It took me about four months to replace the source code mods with exits.

When I worked for General Motors, one of my colleagues at another division had so many local mods to JES2 4.0, that when IBM came out with JES2 4.1, it was easier for him to refit the 4.1 changes to 4.0 than it would have been to refit his local mods to 4.1.

That fact that we had the PL/S source but no PL/S compiler was irritating. At the 1978 SHARE (IBM user conference), they were handing out buttons that said "We want to fix it in the language it broke in."

There was a PL/S Language Reference Manual, but it was it was IBM Company Confidential. One of our IBM SEs (translation for non-MVSers: SEs, Systems Engineers, help with the configuration of the operating system and licensed applications, FEs, Field Engineers, debug operating system problems, and CEs, Customer Engineers, install the hardware and debug hardware problems) was very accommodating, though, and although she could not let me read the manual, she went out of her way to make sure I saw which desk drawer she put it in when she went home at night.
 
BTW, when I worked for a software vendor, I was working on a project where I needed XML versions of our product's messages. My plan was to use Natural Language Support to generate XML versions, with the variable fields plugged into XML elements. The project got cancelled, but I thought I was a nice way to get an XML version of the messages without making any changes to the application.
 
When I worked for General Motors, one of my colleagues at another division had so many local mods to JES2 4.0, that when IBM came out with JES2 4.1, it was easier for him to refit the 4.1 changes to 4.0 than it would have been to refit his local mods to 4.1.

We had a redistribution of assignments. Our manager got the six of us together in a meeting room. His first words were, "Roy doesn't do JES." Later on during the meeting when JES came up for discussion he asked, "who wants to do JES?" My colleague, Larry, piped up, "I do." The manager replied, "ok, Cy does JES." My thought was, like thanks a lot Larry. I think I was destined to do JES/2 at that site all along. I suspected he wanted to establish who was boss and make sure JES would be cleaned up. Roy would never have removed his 40k lines of source code mods. It took a few months but the source mods were replaced with exits. And I managed them using SMP/E rather than the cobbled build JCL Roy had.

Roy and one other team mate would play mind games with the manager. The manger threw me into the mix in an attempt to regain control of team. I did my job and survived.
 
The only thing I miss are the message IDs, like the IEA or IEF messages. It was so very handy to look up the messages in the Messages and Codes manual (or manuals these days). Unfortunately, this is not the UNIX way.
Manual is a foreign word nowadays. Try it on anyone under 50, no matter how technically sophisticated.
 
In some ways, that may be a good thing. When I was doing technical support for a bunch of applications groups, they would come in and ask questions that were clearly in the documentation, and I wondered why they didn't read it. Then one day I looked up at our bookcase of binders full of IBM documentation, which was about ten feet long and from the floor to the ceiling, and I realized that they were too intimidated by the mass of documentation and didn't know where to start. Man pages, as well as search engines, have made it much easier for programmers to find answers.
 
In my experience people don't read the man pages either, and there's a reason why LMGTFY is a thing.

I read the manuals. Maybe I'm the only one.

I would much rather have a concise and well written manual than be expected to watch some harebrain, 5-minute Youtube video full of corny music and stupid cutesy transitions.
 
Back
Top