Hang during boot when poking disks

Tsuroerusu · Aug 14, 2017

I have two rather large servers with identical hardware (Dual Opteron with Supermicro's H8DG6-F motherboard and three LSI SAS 9211-8i cards in IT-mode), but one of them seem to hang at a particular point during the boot process from time to time (Please see the attached screenshot of the console). When this happens, the power consumption goes through the roof, and I have to power cycle the box, and then it will usually boot fine.

I am just a bit baffled why this happens, because when the machine is running it seems fine. It seems to hang at the same point in the boot process, and the last line visible in the console is:
"da0 at mps1 bus 0 scbus1 target 0 lun 0"

Does anybody have an idea as to what might be causing this?

It seems to clearly get past probing the SAS controllers themselves, it's only it sniffs da0 that it seems to get into trouble and then go for a tail spin.

SirDice · Aug 14, 2017

Is there perhaps a difference in firmware versions for both cards?

Edit: all three cards even.

Tsuroerusu · Aug 14, 2017

SirDice said:
Is there perhaps a difference in firmware versions for both cards?

The three discrete cards all have exactly the same firmware. The only difference is that there also is an onboard LSI SAS controller, ironically the same LSI2008 chip that the discrete cards use, and that has a slightly different revision. However, the other server, which does not hang at boot-up, also has this and it is no problem on there.

Is there some sort of debug-mode I can enable to better get a handle on what is happening in the background?

Here's the output of LSI's sas2flash tool:

Code:

# ./sas2flash -o -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Advanced Mode Set

        Adapter Selected is a LSI SAS: SAS2008(B2)  

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:04:00:00
1  SAS2008(B2)     20.00.04.00    14.01.30.04    07.39.00.00     00:03:00:00
2  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:43:00:00
3  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:41:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

SirDice · Aug 14, 2017

I have a LSI-SAS9207-8i and it works just fine. I'm also managing a bunch of servers for a client, they have multiple machines with various LSI based cards. And never really had any issues with it. Perhaps the controller is a bit of a red herring and the issue is actually with something that tries to initialize after the disks (the disks would be the last thing you see).

(Thread moved to storage, perhaps someone else has seen this before)

Tsuroerusu · Aug 14, 2017

SirDice said:
I have a LSI-SAS9207-8i and it works just fine. I'm also managing a bunch of servers for a client, they have multiple machines with various LSI based cards. And never really had any issues with it.

Yeah, my experience with LSI controllers is the same, they have been totally solid for me.

SirDice said:
Perhaps the controller is a bit of a red herring and the issue is actually with something that tries to initialize after the disks (the disks would be the last thing you see).

That could be a possibility, the weird thing though is that it never mentions getting to the second SSD. At the moment this troublesome server has just two Intel DC S3610 SSDs. da0 is the first one and da1 would be the second one.

ralphbsz · Aug 14, 2017

I agree. When it prints "da0 at mps1 bus 0 scbus1 target 0 lun 0", it is done with disk da0. What is it working on right now? We don't know (since it doesn't finish), but most likely the next disk on the same controller. You say the power consumption goes sky high; that probably means that the CPU is working like mad. So here's my educated guess: Something is wrong with disk da1 (the second SSD); or to be more exact: something is wrong with the interaction of da1, the LSI card and its firmware, and the FreeBSD driver that's eating all the CPU. Just pull that disk out, reboot, and see what it happens. If that fixed the problem, switch the physical connection of the two disks; that will probably cause them to be initialized in the opposite order. This experiment will tell us whether the problem is always this particular SSD, or whether the problem is whatever disk happens to be second. You could even move the problem disk to the other server, and see what happens.

Did you check that the two SSDs (and all disks in general) have up-to-date firmware versions?

My next proposal is unrealistic: Once you have narrowed down the problem to a particular disk, attach a SAS analyzer to the connection. My educated guess is that the hang with high power consumption happens because something in the device driver (FreeBSD) or Firmware (LSI) goes into a tight loop, caused by some unusual condition; attaching the SAS analyzer will help us find out whether we're communicating with the disk in that loop, and if yes, it will probably tell us what conditions causes the upset. Alas, you probably don't own a SAS analyzer, and even if you could borrow/rent/buy one (they are tens of thousands), learning how to use it takes a week or two.

In general, I agree with LSI controllers being the best quality that's out there. But they are not perfect. Sometimes their drivers or firmware have bugs. Some firmware versions have such nasty bugs that in my previous jobs we had a list of firmware versions that are prohibited from being used at customer sites (from vague memory it was something about FW 19.x versus 20.x, but don't rely on that). The good news is that LSI/Broadcom (the disk controller division was merged/sold, first to Avago, then to Broadcom) is very good about fixing problems once they hear about them. Again, it's unrealistic for you as an individual user to reach into their engineering department, but a phone call to their support line might be useful (both to hear whether they have advice, and to let them know that a problem has been seen).

Tsuroerusu · Aug 15, 2017

ralphbsz said:
I agree. When it prints "da0 at mps1 bus 0 scbus1 target 0 lun 0", it is done with disk da0. What is it working on right now? We don't know (since it doesn't finish), but most likely the next disk on the same controller. You say the power consumption goes sky high; that probably means that the CPU is working like mad. So here's my educated guess: Something is wrong with disk da1 (the second SSD); or to be more exact: something is wrong with the interaction of da1, the LSI card and its firmware, and the FreeBSD driver that's eating all the CPU. Just pull that disk out, reboot, and see what it happens. If that fixed the problem, switch the physical connection of the two disks; that will probably cause them to be initialized in the opposite order. This experiment will tell us whether the problem is always this particular SSD, or whether the problem is whatever disk happens to be second. You could even move the problem disk to the other server, and see what happens.

Did you check that the two SSDs (and all disks in general) have up-to-date firmware versions?

My next proposal is unrealistic: Once you have narrowed down the problem to a particular disk, attach a SAS analyzer to the connection. My educated guess is that the hang with high power consumption happens because something in the device driver (FreeBSD) or Firmware (LSI) goes into a tight loop, caused by some unusual condition; attaching the SAS analyzer will help us find out whether we're communicating with the disk in that loop, and if yes, it will probably tell us what conditions causes the upset. Alas, you probably don't own a SAS analyzer, and even if you could borrow/rent/buy one (they are tens of thousands), learning how to use it takes a week or two.

In general, I agree with LSI controllers being the best quality that's out there. But they are not perfect. Sometimes their drivers or firmware have bugs. Some firmware versions have such nasty bugs that in my previous jobs we had a list of firmware versions that are prohibited from being used at customer sites (from vague memory it was something about FW 19.x versus 20.x, but don't rely on that). The good news is that LSI/Broadcom (the disk controller division was merged/sold, first to Avago, then to Broadcom) is very good about fixing problems once they hear about them. Again, it's unrealistic for you as an individual user to reach into their engineering department, but a phone call to their support line might be useful (both to hear whether they have advice, and to let them know that a problem has been seen).

Thanks for this really useful reply, ralphbsz. In line with your recommendations, I have performed the following tests.

First I should say that while the problem does occur during a reboot (I'd say on average one in four reboots), it seems to happen on every boot-up just after powering-on.

The disks have the same firmware version, there is a newer version available. Intel doesn't say much about it, but thomas-krenn.com says it fixes:

Code:

These firmware versions contain the following enhancements:
• Optimized drive shutdown sequence for better handling during poor system shutdown
• Improved power on behavior when resuming from an unsafe shutdown.
• Improvements to PS3 resume behavior
• Improvements to PHY initialization process
• Improvements to PERST# and CLKREQ# detection for corner case issues
• Improved end of life management of bad blocks for better reliability

These firmware versions contain fixes for the following issues:
• Fixed potential issue of incorrect data may be read during resume from low power state

Several of these fixes sound like something that could be related to the issue I am facing.

I have also observed that once the system is hung it stays hung, and even if I yank the disk that it seemed to be complaining about it is stuck in that hung state.

I just tried powering-on the system with only da0/SSD0 in, and it still hangs with the same error message.

I then shut the power, plugged in da1/SSD1 and disconnected da0/SSD0 and booted up, and on the first boot the system came up fine. I then rebooted 9 times and and on the 10th time it hung at boot with the same error as before.

Then I proceeded to swap the physical location of the disks, first trying to move SSD0 to the location of SSD1 (With only SSD0 plugged in), and this time I managed to get through 10 reboots, and a few cold boots without the hang. Although, while I didn't get the error this time, the fact that I only got it the 9th time in the previous test makes this result somewhat tentative.

Then I shut down, plugged in SSD1, so that both SSDs were in, but in opposite locations as opposed to the original locations. First boot from cold went fine (At this point, I needed do zpool scrub a few times to be back in business), but after one or two reboots, it hanged again, same error, however this time "da0" would be SSD1.

So from this we can conclude: Both SSDs can trigger the hang, both on their own, and when together in either location.

Another thing that just struck me is that, the other server (Where the hang does not occur) has six hard drives in addition to the two SSDs. On both servers the SSDs are connected to the onboard controller which is second on the PCI-E bus (The first discrete card comes first on the PCI-E bus) as can be seen on the listing that I posted earlier. I have two hard drives on each controller card, so that makes the first two hard drives be da0 and da1, and the two SSDs are da2 and da3. This means that on the server where I don't get this problem the SSDs do not come in as da0 and da1. I should mention though, that I installed and configured FreeBSD on both of them "simultaniously", which is why I noticed that only one of them hangs. During the installation, I had the hard drives unplugged, but relatively soon after finishing the installation, I installed the hard drives on the first server.

Because of this, I took two old 500 GB SATA hard drives, and put onto the first discrete controller, and put the SSDs back in their original locations. With the first discrete controller card being first on the PCI-E bus and the onboard controller coming in second, this makes the two SSDs be da2 (SSD0) and da3 (SSD1). First boot from power-on went smoothly, and afterwards I maanged about 15 reboots until I had to get some sleep. I just did two power-ups and it seems to boot-up reliably now.

So it would appear that the combination of LSI firmware, the SSDs and the FreeBSD mps driver somehow does not like these SSDs being first in the line of being probed. So until I install drives in this second server, these old 500 GB drives seem to fix the issue. I might well update the firmware, just to get the fixes that Intel mentioned, but at least the server seems to work now.

What do you make of all of this?

ralphbsz · Aug 16, 2017

Based on the data, your conclusion seems logical: The combination of Intel drive -> LSI (Broadcom) firmware -> FreeBSD driver doesn't like it. The fact that it only breaks if the two offending SSDs are the first two is weird.

My only two ideas are: (a) upgrade firmware, to fix any potential cause for the problem. The cause may be deeply hidden (it might be an "unusual" response from one of the drives, which then exposes a rare bug in the firmware/driver), so don't explicitly look for a fix for your specific problem. (b) If that doesn't help, contact LSI (Broadcom), and hope that they're willing to help (you are not a large customer, so this might be a little difficult).

Terry_Kennedy · Aug 31, 2017

ralphbsz said:
My educated guess is that the hang with high power consumption happens because something in the device driver (FreeBSD) or Firmware (LSI) goes into a tight loop, caused by some unusual condition

The LSI firmware runs on the embedded processor in the RAID chip on the motherboard or add-in card. When it has problems, that controller will go at least partially catatonic and you'll (eventually) see some bus reset messages from the FreeBSD kernel as it tries to prod the card back into life. The BIOS / UEFI code runs on the host CPU, but that's out of the way and replaced with the real driver by the time the FreeBSD kernel starts printing messages.

The problem could also be isolated by breaking to the debugger on the console and see where the system is really hanging (presumably in the LSI driver, but where in that driver?). A verbose boot (from the boot menu) might reveal something, but probably not. Sometimes the problem is that an earlier kernel operation (possibly not even for the device last reported) has set up a "time bomb" that goes off a random time later, leading to a hang or panic. That is probably a lot less work than a hardware bus analyzer and may point to the problem. But there is probably an even easier solution - see below.

[...] a list of firmware versions that are prohibited from being used at customer sites (from vague memory it was something about FW 19.x versus 20.x, but don't rely on that).

The Phase 20 firmware had some nasty bugs in 20.00.00.00 which were partially remediated by 20.00.04.00, but you really want to be running 20.00.07.00 to get all of the fixes.

For the OP, go into the BIOS menu (Control-C) when prompted during boot. [You do have the BIOS enabled, right? Some people delete it because it "wastes time" on boot, but compared to everything else that is going on (particularly on a server-grade motherboard) it doesn't save a lot of time.] On the first screen, look at the adapter boot order and whether each is enabled for boot only / OS only / boot + OS / disabled. Compare with the other system that works well. While you're in there, check the firmware version on each card to make sure they're the same (and the same as the working system). Supermicro has 20.00.07.00 for both IT and IR modes, but you'll probably need to contact Supermicro support to find out where on their FTP site it is - Supermicro wins the "unintentionally disorganized FTP site" prize, while Dell wins the "deliberately obfuscated" prize and HP / IBM tie for the "you gonna pay for that?" prize.

Tsuroerusu · Aug 31, 2017

Terry_Kennedy said:
The Phase 20 firmware had some nasty bugs in 20.00.00.00 which were partially remediated by 20.00.04.00, but you really want to be running 20.00.07.00 to get all of the fixes.

For the OP, go into the BIOS menu (Control-C) when prompted during boot. [You do have the BIOS enabled, right? Some people delete it because it "wastes time" on boot, but compared to everything else that is going on (particularly on a server-grade motherboard) it doesn't save a lot of time.] On the first screen, look at the adapter boot order and whether each is enabled for boot only / OS only / boot + OS / disabled. Compare with the other system that works well. While you're in there, check the firmware version on each card to make sure they're the same (and the same as the working system). Supermicro has 20.00.07.00 for both IT and IR modes, but you'll probably need to contact Supermicro support to find out where on their FTP site it is - Supermicro wins the "unintentionally disorganized FTP site" prize, while Dell wins the "deliberately obfuscated" prize and HP / IBM tie for the "you gonna pay for that?" prize.

The joys of firmware bugs, 'ey? Anyway, as far as I can tell, Supermicro actually only has 20.00.04.00 for the LSI2008-based controllers, whereas they do have 20.00.07.00 for the newer LSI2308-based controllers.

Please see these paths:
ftp://ftp.supermicro.com/driver/SAS/LSI/2008/IR_IT/Firmware/IT
ftp://ftp.supermicro.com/driver/SAS/LSI/2308/Firmware/IT

Given that Broadcom has actually EoL'ed the LSI2008-chips (The controllers are no longer listed on their site, and you need to go to Archive.org to find the firmware downloads), I find it unlikely that Supermicro will update the firmware again.

The systems are exactly the same in terms of hardware, firmware etc. I built them simultaneously as siblings, so to speak. And yes, I do have the SAS bios on all the adapters, it never occurred to me to not have it on there.

However, what you mentioned actually might have a little clue. You stated that the PH20 firmware has some really bad bugs that were partially fixed in 20.00.04.00. Well, if you look at my earlier postings, the problem occurs when the two SSDs are probed first, and they reside on a controller with precisely 20.00.04.00 firmware. I have not tried booting them on the others, that might well be worth a shot, just to see. However, as I stated earlier, by merely just plugging in two old-ass 500 GB hard drives, so that they come in as da0 and da1 on a 20.00.07.00 controller, then the hang doesn't occur and the machine operates fine.

Thanks a lot for mentioning this stuff about the firmware revisions, this does provide some degree of insight into the problem.

Terry_Kennedy · Sep 1, 2017

Tsuroerusu said:
The joys of firmware bugs, 'ey? Anyway, as far as I can tell, Supermicro actually only has 20.00.04.00 for the LSI2008-based controllers, whereas they do have 20.00.07.00 for the newer LSI2308-based controllers.

Please see these paths:
ftp://ftp.supermicro.com/driver/SAS/LSI/2008/IR_IT/Firmware/IT
ftp://ftp.supermicro.com/driver/SAS/LSI/2308/Firmware/IT

Yup. As I said, you have to ask them for it because it lives somewhere else in the FTP tree.

Given that Broadcom has actually EoL'ed the LSI2008-chips (The controllers are no longer listed on their site, and you need to go to Archive.org to find the firmware downloads), I find it unlikely that Supermicro will update the firmware again.

Broadcom support for these chips / cards is still on their site, they just moved it to a different category to speed up searches for other users. Go to Broadcom Support Documents and Downloads, select the Legacy Products for group and the All Legacy Products for family, then open the product name drop-down (it will take a little while to populate). You may have to search in there - they aren't consistent between LSI xxx, SASxxx, MegaRAID xxx, etc). For example, I can select SAS 9201-16i Host Bus Adapter and Firmware and I get 85 results.

The only reason you want the Supermicro image instead of the generic one is due to the way they package the files - the generic updater doesn't like the motherboard chips, and the Supermicro updater uses the files packaged a bit differently. It isn't like the firmware was modified for the Supermicro board, it is just due to the way it is packaged.

Hang during boot when poking disks

Tsuroerusu

Attachments

SirDice

Administrator

Tsuroerusu

SirDice

Administrator

Tsuroerusu

ralphbsz

Tsuroerusu

ralphbsz

Terry_Kennedy

Tsuroerusu

Terry_Kennedy