How does one investigate failure to boot?

I'll start with this brief outline of my adventure installing FreeBSD on Dell R730 server that led to this thread. Here's how it went today:
- got a new (refirb really) PowerEdge R730 server,
- tweak some obvious settings in BIOS,
- update every firmware in the system to Dell's latest and greatest via Dell's Lifecycle Controller,
- that included its PERC H730 RAID controller (uses LSI SAS3108 chip really),
- all updates went smoothly,
- plug in USB stick with BSD image,
- connect to remote console via iDrac - essentially KVM over wire IIUC, so everything is done via HTML5 console as if I have monitor and keyboard plugged into that server,
- all goes well until it doesn't - stuff flashes past - good luck reading that - until something bad enough happens that simply sends system into immediate reboot without any sort of warning or grace and u're staring at BIOS startup screen etc.

Q1:
First thought. WTF just happened and how do I even begin investigating this? I can't possibly read this fast. So here's question number 1. In this scenario is there a way to recover dmesg or some kind of print out? Is it being written to that USB flash drive or anywhere at all? Is there a way to write it somewhere I can later review?

I thought of nothing better to do than to try and screen-capture something while it flashes past.
I got lucky. Twice. Here's that srcreencap:
rpviewer (9).png


First time lucky simply because I managed to capture that. Second - because I spotted that mfi mentioned there and I knew that was the RAID controller driver from my past adventures with R720 and its PERC H710 controller. Which leads us to ...

Q2:
Being lucky like this sucks. I'll take it this time, but it may not happen tomorrow. How would you go about (a) retracing your steps and reproducing what happened - that's essentially my Q1 above (b) go about solution when you've no clue what any of that stuff on the screen means?

Well. I knew roughly what mfi was about, so naturally that RAID controller was our prime suspect. Some man reading and googling later I arrive at mrsas(4)(). Turns out my chip and that card should in fact be using that driver but for whatever reason mfi(4)() takes priority unless you override that. So, at boot I drop into loader prompt:
Code:
$ set mrsas_load="YES"
$ set hw.mfi.mrsas_enable="1"
$ boot
bingo, we hit jackpot and we boot - all is well with the world.

We install the system, drop into shell and write mrsas_load="YES" to /boot/loader.conf and hw.mfi.mrsas_enable="1" to /boot/device.hints, reboot and confirm everything works.

Q3:
That behavior of ... just wiping state and rebooting seems incredibly abnormal. IMO it is only ever allowed to happen when hardware actually fails. So, this driver somehow faulted ... but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen? Is this normal expected behavior or are we looking at genuine bug?

Dear oldhats. What do you do when stuff like this happens to you?
Thank you
 
Ideally, your machine should have a serial interface, COM1 or such.
Ideally, you would switch console to serial where you can log the whole story.
You can also introduce kernel debugger etc. (there is manpages on all the stuff). The kernel debugger would then engage at your panic, and you can look into all the cpu registers etc.etc. ;)

In fact, I never did that. I always did it the way You do it here: see what we've got, apply some logical thinking, and solve the problem.

Another common practice is called "minimal config": exclude all of the optional hardware until we have only the minimum necessary, and so try to figure out which component introduces the failure.
 
Ideally, your machine should have a serial interface, COM1 or such.
Ideally, you would switch console to serial where you can log the whole story.
I've seen that mentioned in various places but every single time the writer assumed the reader would totally know what they were talking about and I still have no idea what that serial console is and what I can do with it.
 
You can also introduce kernel debugger etc. (there is manpages on all the stuff). The kernel debugger would then engage at your panic, and you can look into all the cpu registers etc.etc. ;)
that was going to be a follow up question, but I didn't want to detract and derail conversation. I'd totally want to know how to attach the debugger (e.g. when you are not on the same machine but remote like that) and investigate for reals. If anyone could teach me that'd be rad. Scientific approach and step debugging tramp luck .. or at least they should.
 
I've seen that mentioned in various places but every single time the writer assumed the reader would totally know what they were talking about and I still have no idea what that serial console is and what I can do with it.
Well, a serial console is a serial console. Question: what knowledge are you missing? You know what a modem is? what a nullmodem is? what a terminal is? that traditionally unix machine would allow login not only on console (and via network), but also on a number of terminals attached to the serial ports (directly wired or with modems and telephony dialup)?

The whole point of serial console is that you move the boot console away from keyboard+monitor and onto one of the serial interfaces. Obviousely you therefore need another end to connect to, either a terminal (not very common anymore today) or another computer/laptop/whatever.

BTW: do you have physical access to that Dell machine, or is it in some compute center somewhere else on the world (which would make things a bit more difficult).

And another question out of personal curiosity: on which OS do you run the browser that would give you an iDRAC console? (I didn't get that working with current Firefox on FreeBSD - I can replay the boot logs, but cannot get it to work interactive)
 
there is manpages on all the stuff
do you remember which mans, please?

Well, a serial console is a serial console. Question: what knowledge are you missing? You know what a modem is? what a nullmodem is? what a terminal is? that traditionally unix machine would allow login not only on console (and via network), but also on a number of terminals attached to the serial ports (directly wired or with modems and telephony dialup)?
I realize I may not even know what console is anymore. To me its always been what I described: user interaction between me and machine - "chess" like turn-based game. Would "not serial" console even be useful? Like, "concurrent" console - now that would be a mess. Isn't every console serial? Modem is this box that sends bips and bops over telephone wire. I don't recall what nullmodem is exactly but I seem to remember you could enter those modem AT codes by hand - that what you mean? I've no idea what terminal is. Seriosly, it means many things in many fields. Now, with serial port - we are getting somewhere. That I think is an actual port I can see in the back of my server. So IIUC there maybe a way to redirect boot output to that port or something that stands for it. How do I do this? Remotely? Do I communicate Telnet style or smth. So, yeah, essentially how do I capture that output? What are the steps to take? I have more questions than when we started :)
 
Guys, not so fast please. 1st of your issues:
  • The BeaSD keeps the boot messages in /var/run/dmesg.boot. So you can always less /var/run/dmesg.boot.
  • The periodic(8) stuff keeps /var/log/dmesg.{to,yester}day
Now I'll read on where I left off in your 1st post, hopefully I have more nonsense to comment on. I like that, it's my beloved hobby...
 
but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen?
That screenshot looks like the kernel debugger, kdb, not BIOS. It was invoked due to a kernel panic. kdb then caused the reboot. You could try the following sysctls:
Code:
debug.debugger_on_panic: 0
kern.panic_reboot_wait_time: -1

There should be a crashdump according the handbook: Chapter 10. Kernel Debugging
After rebooting, your system should save a dump in /var/crash along with a matching summary from crashinfo(8).
It's unfortunate that this happened.
 
Yep, before you start hooking up remote kernel debuggers (which are definitely cool, don't get me wrong), have it stop when it encounters a panic(9), not reboot it instantly. That alone will give you more time to actually look at the dump on screen. As you already found out you can tell quite a lot from it (you recognized mfi(4) and prior experience already provided you with a possible solution). If that doesn't provide enough clues you can take a look at the crash dump. But it requires quite a bit of kernel knowledge to make sense of it. That crash dump can certainly be useful in the hands of a seasoned kernel developer.


drop into shell and write mrsas_load="YES" to /boot/loader.conf
Don't need to load it, mrsas(4) is built in with the GENERIC kernel.
 
Oh, really? mergemaster(8) offers to merge it, doesn't freebsd-update(8) have a similar mechanism?
Hmm. Good question. By default, yes: fgrep device.hints /usr/src/usr.sbin/freebsd-update/freebsd-update.conf
Code:
MergeChanges /etc/ /boot/device.hints
More or less, it "belongs" to the kernel; i.e. it is installed with the kernel when you build your own kernel version. IIRC it gets overwritten by make installkernel. That's non-trivial to verify, because the Makefiles are deeply nested ad ultimo.
 
Hmm. Good question. By default, yes: fgrep device.hints /usr/src/usr.sbin/freebsd-update/freebsd-update.conf
But that's the answer, freebsd-update(8) should not just overwrite it (full snippet):
Code:
# When upgrading to a new FreeBSD release, files which match MergeChanges
# will have any local changes merged into the version from the new release.
MergeChanges /etc/ /boot/device.hints

More or less, it "belongs" to the kernel; i.e. it is installed with the kernel when you build your own kernel version. IIRC it gets overwritten by make installkernel. That's non-trivial to verify, because the Makefiles are deeply nested until ultimo.
No, it was never touched here by either installworld or installkernel targets. I need a change to it on my server to have the console on COM2, as this is strangely the one on my board that's actually wired – so I would notice immediately.

I assume it would be written by the distribution target.
 
Also, in this specific case, the use of a systems management console (iDrac) might give you access to: system management logs, console output logs, virtual serial console, virtual media (usb, cd-rom, ??) and so on. Time to learn how to use what you have access to.
 
Back
Top