I'll start with this brief outline of my adventure installing FreeBSD on Dell R730 server that led to this thread. Here's how it went today:
- got a new (refirb really) PowerEdge R730 server,
- tweak some obvious settings in BIOS,
- update every firmware in the system to Dell's latest and greatest via Dell's Lifecycle Controller,
- that included its PERC H730 RAID controller (uses LSI SAS3108 chip really),
- all updates went smoothly,
- plug in USB stick with BSD image,
- connect to remote console via iDrac - essentially KVM over wire IIUC, so everything is done via HTML5 console as if I have monitor and keyboard plugged into that server,
- all goes well until it doesn't - stuff flashes past - good luck reading that - until something bad enough happens that simply sends system into immediate reboot without any sort of warning or grace and u're staring at BIOS startup screen etc.
Q1:
First thought. WTF just happened and how do I even begin investigating this? I can't possibly read this fast. So here's question number 1. In this scenario is there a way to recover
I thought of nothing better to do than to try and screen-capture something while it flashes past.
I got lucky. Twice. Here's that srcreencap:
First time lucky simply because I managed to capture that. Second - because I spotted that
Q2:
Being lucky like this sucks. I'll take it this time, but it may not happen tomorrow. How would you go about (a) retracing your steps and reproducing what happened - that's essentially my Q1 above (b) go about solution when you've no clue what any of that stuff on the screen means?
Well. I knew roughly what
bingo, we hit jackpot and we boot - all is well with the world.
We install the system, drop into shell and write
Q3:
That behavior of ... just wiping state and rebooting seems incredibly abnormal. IMO it is only ever allowed to happen when hardware actually fails. So, this driver somehow faulted ... but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen? Is this normal expected behavior or are we looking at genuine bug?
Dear oldhats. What do you do when stuff like this happens to you?
Thank you
- got a new (refirb really) PowerEdge R730 server,
- tweak some obvious settings in BIOS,
- update every firmware in the system to Dell's latest and greatest via Dell's Lifecycle Controller,
- that included its PERC H730 RAID controller (uses LSI SAS3108 chip really),
- all updates went smoothly,
- plug in USB stick with BSD image,
- connect to remote console via iDrac - essentially KVM over wire IIUC, so everything is done via HTML5 console as if I have monitor and keyboard plugged into that server,
- all goes well until it doesn't - stuff flashes past - good luck reading that - until something bad enough happens that simply sends system into immediate reboot without any sort of warning or grace and u're staring at BIOS startup screen etc.
Q1:
First thought. WTF just happened and how do I even begin investigating this? I can't possibly read this fast. So here's question number 1. In this scenario is there a way to recover
dmesg
or some kind of print out? Is it being written to that USB flash drive or anywhere at all? Is there a way to write it somewhere I can later review?I thought of nothing better to do than to try and screen-capture something while it flashes past.
I got lucky. Twice. Here's that srcreencap:
First time lucky simply because I managed to capture that. Second - because I spotted that
mfi
mentioned there and I knew that was the RAID controller driver from my past adventures with R720 and its PERC H710 controller. Which leads us to ...Q2:
Being lucky like this sucks. I'll take it this time, but it may not happen tomorrow. How would you go about (a) retracing your steps and reproducing what happened - that's essentially my Q1 above (b) go about solution when you've no clue what any of that stuff on the screen means?
Well. I knew roughly what
mfi
was about, so naturally that RAID controller was our prime suspect. Some man
reading and googling later I arrive at mrsas(4)(). Turns out my chip and that card should in fact be using that driver but for whatever reason mfi(4)() takes priority unless you override that. So, at boot I drop into loader prompt:
Code:
$ set mrsas_load="YES"
$ set hw.mfi.mrsas_enable="1"
$ boot
We install the system, drop into shell and write
mrsas_load="YES"
to /boot/loader.conf and hw.mfi.mrsas_enable="1"
to /boot/device.hints, reboot and confirm everything works.Q3:
That behavior of ... just wiping state and rebooting seems incredibly abnormal. IMO it is only ever allowed to happen when hardware actually fails. So, this driver somehow faulted ... but surely it shouldn't flat out abort mission and leave me staring at BIOS load screen? Is this normal expected behavior or are we looking at genuine bug?
Dear oldhats. What do you do when stuff like this happens to you?
Thank you