LSI HBA SAS9305-24i / no disk spinup at boot

NOTE: This thread is a continuation of a thread posted in the FreeNAS forums. As the issue is related to FreeBSD 10.x and not FreeNAS 9.10, I decided to ask you guys for additional help. Original thread


A few weeks ago we received two new Supermicro machines with LSI SAS9305-24i HBAs and 8 10TB disks preinstalled. We set up FreeNAS 9.10, 10.0.2 and 10.0.4 and tried to create volumes. We figured out that the GUI and the CLI do not show our HGST disks.

When running camcontrol devlist it shows the 8 disks on scbus0.

Code:
<HGST HUH721010AL5200 A21D> at scbus0 target 0 lun 0 (pass0)
<HGST HUH721010AL5200 A21D> at scbus0 target 1 lun 0 (pass1)
<HGST HUH721010AL5200 A21D> at scbus0 target 2 lun 0 (pass2)
<HGST HUH721010AL5200 A21D> at scbus0 target 3 lun 0 (pass3)
<HGST HUH721010AL5200 A21D> at scbus0 target 4 lun 0 (pass4)
<HGST HUH721010AL5200 A21D> at scbus0 target 5 lun 0 (pass5)
<HGST HUH721010AL5200 A21D> at scbus0 target 6 lun 0 (pass6)
<HGST HUH721010AL5200 A21D> at scbus0 target 7 lun 0 (pass7)
<INTEL SSDSC2BB240G7 N2010101> at scbus9 target 0 lun 0 (pass8,ada0)
<INTEL SSDSC2BB240G7 N2010101> at scbus10 target 0 lun 0 (pass9,ada1)

We have tried to flash the newest firmware to the controller using
sas3flash -o -c 0 -f SAS9305_24i_IT_p.bin
but this failed since the FreeNAS machine rebooted a few seconds after running the command, without actually doing anything. After downloading the sas3flash and sas3ircu binaries from the Broadcom website the flashing process succeeded but didn't change the behaviour of the system.

sas3ircu 0 display shows me that the controller has been recognized and the lists the 8 connected disks

Code:
Avago Technologies SAS3 IR Configuration Utility.
Version 15.00.00.00 (2016.11.21)
Copyright (c) 2009-2016 Avago Technologies. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS3224
  BIOS version                            : 8.33.00.00
  Firmware version                        : 14.00.00.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 1023
  Concurrent commands supported           : 5888
  Slot                                    : 61
  Segment                                 : 0
  Bus                                     : 3
  Device                                  : 0
  Function                                : 0
  RAID Support                            : No
------------------------------------------------------------------------
IR Volume information
------------------------------------------------------------------------
------------------------------------------------------------------------
Physical device information
------------------------------------------------------------------------
Initiator at ID #0

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 0
  SAS Address                             : ******
  State                                   : Ready (RDY)
  Size (in MB)/(in sectors)               : 9537535/19532873727
  Manufacturer                            : HGST
  Model Number                            : HUH721010AL5200
  Firmware Revision                       : A21D
  Serial No                               : ******
  Unit Serial No(VPD)                     : ******
  GUID                                    : N/A
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

Device is a Hard disk
  Enclosure #                             : 1
  Slot #                                  : 1
  SAS Address                             : ******
  State                                   : Available (AVL)
  Manufacturer                            : HGST
  Model Number                            : HUH721010AL5200
  Firmware Revision                       : A21D
  Serial No                               : ******
  Unit Serial No(VPD)                     : ******
  GUID                                    : N/A
  Protocol                                : SAS
  Drive Type                              : SAS_HDD

(removed output for the other disks)

After searching in the FreeBSD and FreeNas forums and doing some research on Google we found out that using the command
camcontrol start pass0
We are able to spinup the disk 0. Repeating this for the other 7 disks spins them all up. But still, they don't show up in the GUI nor in the CLI. Doing a reboot (not shutdown) the disks are still up, recognized, mounted and are shown in the GUI/CLI. But when starting the machine after a shutdown the disks do not spin up anymore. Except when running the above commands again.

We had a look at the Mainboard-BIOS as well as the HBA-BIOS but were unable to find any setting that affects this behaviour. Whatever we try does not change the disk spinup.

When checking the dmesg log files we see the following error
Code:
(da7:mpr0:0:7:0): SERVICE ACTION IN(16). CDB: 9e 10 00 00 00 00 00 00 00 00 00 00 00 20 00 00
(da7:mpr0:0:7:0): SCSI sense: NOT READY asc:4,1c (Logical unit not ready, additional power use not yet granted)
(da7:mpr0:0:7:0):
(da7:mpr0:0:7:0): Field Replaceable Unit: 0
(da7:mpr0:0:7:0): Command Specific Info: 0
(da7:mpr0:0:7:0):
(da7:mpr0:0:7:0): Descriptor 0x80: f5 56
(da7:mpr0:0:7:0): Descriptor 0x81: 00 00 00 00 00 00
(da7:mpr0:0:7:0): fatal error, failed to attach to device

In the /boot/loader.conf file we also tried to set the spinup wait time to 30 seconds, instead of 3.
Code:
dev.mpr.0.spinup_wait_time: 30

To further debug the issue I installed CentOS 7 and without having to touch anything the disk are spinning up are attached to the system. Creating partitions and formating works just fine.
Looks like FreeNAS/FreeBSD does something differently during the boot.

When installing the latest FreeBSD 11 relase I can see this output of camcontrol devlist
Code:
<HGST HUH721010AL5200 A21D>        at scbus0 target 0 lun 0 (pass0)
<HGST HUH721010AL5200 A21D>        at scbus0 target 1 lun 0 (pass1)
<HGST HUH721010AL5200 A21D>        at scbus0 target 2 lun 0 (pass2)
<HGST HUH721010AL5200 A21D>        at scbus0 target 3 lun 0 (pass3)
<HGST HUH721010AL5200 A21D>        at scbus0 target 4 lun 0 (pass4)
<HGST HUH721010AL5200 A21D>        at scbus0 target 5 lun 0 (pass5)
<HGST HUH721010AL5200 A21D>        at scbus0 target 6 lun 0 (pass6)
<HGST HUH721010AL5200 A21D>        at scbus0 target 7 lun 0 (pass7)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus5 target 0 lun 0 (ses0,pass8)
<INTEL SSDSC2BB240G7 N2010101>     at scbus10 target 0 lun 0 (ada0,pass9)
<INTEL SSDSC2BB240G7 N2010101>     at scbus11 target 0 lun 0 (ada1,pass10)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus12 target 0 lun 0 (ses1,pass11)
This shows two additional SGPIO devices that do not show up with FreeNAS 9.10/FreeBSD 10.


Now the questions: What else should/can we try to have the disks spinning up at boot correctly?

Hardware:
- Intel Server Board S2600CWTR
- Supermicro 24 bay 3.5" chassis with 2 PSU SC846BA-R920B
- 128GB ECC RAM (4x32GB)
- Intel Xeon E5-2620v4
- 8x10TB HGST SAS 12G drives, hot-swappable (data)
- 2x Intel Enterprise SSD (OS)
 
I don't know how to help you convince your LSI cards to spin up the disks. You'll need to deal with LSI documentation, or their tech support people (I know some of those people, and they are very friendly and competent).

But: I know that the Linux kernel DOES spin up SCSI disks. If it detects a block device during initialization time, and it responds to a "Test Unit Ready" command with "not ready", then Linux will send a "Start Unit" command. At least most of the time it will; this might be configurable.

Here are two suggestions for working around the problem. One: Your disks MIGHT be connected via SAS expanders: I don't know whether you have enough LSI cards to dedicate a SAS port directly from the LSI HBA to each disk, or whether there are SAS expanders on your SuperMicro chassis disk backplane. If there are SAS expanders: Those can also spin up the drive; there are low-level SAS primitives called something like "notify spin up" which are below the SCSI protocol layer and which automatically spin up the drive when the SAS SMP protocol (the management protocol that handles device discovery) first sees the disk. You might study the documentation for the SuperMicro chassis, or contact their tech support, and ask the whether their expanders can be configured to spin up drives.

And if all that fails: You could manually add camcontrol start to some rc.* file, and spin them up yourself. That's likely pretty late, and might cause trouble with the order of discovering and mounting file systems (where does the rc.* file come from if the disk it's on isn't spinning yet?), but perhaps you can boot from a disk that's always on.
 
Thanks for your response. As far as I can tell there are SAS Expanders on the backplane, as there are 6 cables connecting from the HBA (we have only one) to the backplane, but 24 disks can be used in this system. I don't see (and don't have) any cable connecting directly from the disk to the HBA.
I belive that the issue is either a bug somewhere in the mpr driver resp. FreeBSD itself or that it is configurable somewhere. But I am unable to find it. The only thing I know is, that it can almost not be a hardware issue, as with CentOS 7 it works just fine.

Sure, I can use camcontrol start passX in an rc.* script, but I don't think that this should be the solution. Since we have two dedicated SSDs for the OS which are always on it would not be an issue. But the script needs to be executed before FreeBSD tries to attach the devices and mount the ZFS volume. Since I am completely new to FreeBSD I have no idea which rc.* script I should use or where to find any related settings for the disk spin up.

But: I know that the Linux kernel DOES spin up SCSI disks. If it detects a block device during initialization time, and it responds to a "Test Unit Ready" command with "not ready", then Linux will send a "Start Unit" command. At least most of the time it will; this might be configurable.
Which is exactly my point, why does Linux do it but FreeBSD doesn't?
 
First the unimportant stuff: I just looked up the spec for your 9305-24i HBA. It has 24 physical ports, which are grouped into six SFF8643 internal x4 cables. So it is possible that there are no SAS expanders in this setup, and each of those 24 SAS ports goes directly to a disk drive. Also, in your camcontrol devlist output, the only SES enclosure controllers you list are not connected via SAS but via SATA (the lines "AHCI SGPIO Enclosure"); but there is no law that all SAS expanders have to be SES enclosure controllers. Therefore we can neither prove nor disprove that there are SAS expanders in your system, so far. The only way to be sure would be either to contact SuperMicro or read their documentation to find out how their disk backplane of that enclosure is built, or to disassemble the server and physically look for expander chips, or to use some SAS diagnostic tools. On Linux that can be done by inspecting things in the /sys file system (it's difficult but possible); I don't know how to do it on FreeBSD. It can also be done by communicating with the 9305 HBA using its management tools. But in reality, this is a side-show: The only reason to find SAS expanders would be to convince them to spin up the disks for us; but we have several other options to do that.

You asked: Why does Linux spin up drives, and FreeBSD doesn't? To begin with, I'm not 100% sure that FreeBSD doesn't do it; it's possible that this is configurable somewhere in the block device or SAS stack. Matter-of-fact, looking at the man pages for the mpr(4)(), mpt(4)() and mps(4)() drivers shows some discussion of spinup/spindown, so maybe those drivers can be convinced to take care of it (or maybe they interface to functionality in the HBA firmware that could). Underlying this is a deep philosophical question: What's the purpose of an operating system, and what tasks should it do? How much should the OS modify the hardware state to make life easier for other components? Actually, in some parts of the high-end RAID/Storage/FS community for high-end system, the fact that Linux "tries to be helpful" is considered a problem: If my file system or RAID software spin a drive down deliberately, that's because I want it to be down and stay down. If it needs to be spun up again, I have my fingers in all layers of the IO stack, and I can spin it up when I want. I don't like the fact that every time Linux reboots it goes and spins up drives behind my back. This gets particularly annoying when external disk enclosures are connected to multiple hosts, the left host is currently controlling the disks because the right host is having maintenance done to it, at some point the right host gets rebooted (for example due to a kernel upgrade), and then suddenly disks get spun up that the left host should have be controlling. I understand why Linux does this: that OS is designed for casual users (desktop and laptop people who don't know nor care about the details of their hardware), and is not really intended for enterprise-grade industrial strength use, where control of the hardware is the hand of complex software stacks. In spite of the fact that RHEL has the word "enterprise" in it, Linux remains deep in its heart an OS written by college students for other college students.

So, what options do you have? Number one: Get the full documentation (or contact tech support) for the LSI (=Avago = Broadcom) HBA, and determine whether it can be configured to spin the drives up for you. If yes, you'll have to control the card to do that; either go in through the configuration screens in the BIOS, or find the special programs that do that. Second, get the documentation for the disk drives you use (the SCSI manuals). They probably implement the standard SCSI mode pages that determine whether the drives spin up automatically on power-up or hard reset. Next, maybe try to find SAS expanders in your hardware (if there are any), and contact the vendor to see whether they can manage the spinup (unlikely, but possible). As a last resort, put it in /etc/rc.local/.

You ask about the order. Since this is not your root file system, it doesn't really matter: Configure your ZFS to not attempt auto-mounting, then put the spinup commands in rc.local, and put the ZFS mount command after it. Done.
 
Thanks for your help, I'll give it a try.

I already tried the mpr configuration options that are listed in the man pages, without success. The controller BIOS also does not have many settings, and those that there are don't change any of the behaviours.
I'll try to contact the vendor/manufacturer again as well as the workaround with rc.local.
 
Meanwhile we figured out that it looks like the issue here is the disk drive itself. We exchanged two of them with older disks and they spin-up just fine. The issue might be the new SAS3 feature POWER DISABLE which is documented by HGST here. We still have to figure out how we can disable this feature or tell FreeBSD to start the disk manually, as the rc.local script approach did not work. The disks are spinning-up too late and the disks are not attached, so show with size 0. I haven't yet figured out how I can "rescan" and "re-attach" the disk after spin-up.
 
It could be that it isn't the drive (you say older ones work), but some mode page settings on the drives, which might be different between drive generations. With a few hours of research, you could figure out all the mode pages that control power (it's pretty complicated), and use sgutils to dump them from both drive generations and find the difference. A big investment of time.

Have you thought about upgrading the firmware on the drives? I have not personally experienced drive firmware influencing how drives power up, but it's possible. Before going down that path it might be a good idea to talk to tech support at HGST (who will probably have to refer you to engineering).

On the power disable feature turning into a problem (disks get stuck in reset): Supposedly that only affects SATA disks, not SAS.
 
Our vendor contacted HGST and they said there is no firmware update or alternative firmware available to disable the PIN3 behaviour. So a dead end there.
I also thought that it should only affect SATA drives. But as we have SAS disks I guess the error is somewhere on the backplane (SAS2) where PIN3 is wired, not according to the SAS2 standard.
I still don't know why that issue is not present in CentOS.

For now we stick with a startup script that checks if the ZFS pool is mounted and restarts in case it is not.
 
Do your SAS cables have SGPIO "sub-cables"? Did you plug them into the backplane of your server?
I had a similar problem. Removing these cables from the backplane made it work in the end.
 
Back
Top