Other FreeBSD does not find two out of eight chassis slots

OS FreeBSD-12

HW: Supermicro SuperServer 5027R-WRF with eight bay chassis.

Code:
    grep -i ses /var/run/dmesg.boot
    ses0 at ahciem0 bus 0 scbus7 target 0 lun 0
    ses0: <AHCI SGPIO Enclosure 1.00 0001> SEMB S-E-S 2.00 device
    ses0: SEMB SES Device

This is an offline spare that developed problems while in service and was replaced. The problem that developed was that two of the bays began to flash a red led while any drive was inserted. The remaining six bays did not. I have checked the wiring and everything looks ok.

When I run `sesutil map` I only see six bays:

Code:
    # sesutil map
    ses0:
        Enclosure Name: AHCI SGPIO Enclosure 
        Enclosure ID:                0
        Element 0, Type: Array Device Slot
            Status: Unsupported (0x00 0x00 0x00 0x00)
        Element 1, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 000
        Element 2, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 001
        Element 3, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 002
        Element 4, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 003
        Element 5, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 004
        Element 6, Type: Array Device Slot
            Status: Unknown (0x06 0x00 0x00 0x00)
            Description: SLOT 005

The system manual states that a blinking red led on a drive caddy indicates that the drive is 'rebuilding'. However, the built-in RAID is not in use and the four occupied slots are configured as a zfs pool. That pool is clean.

One oddity I noticed is that the blinking led only starts after the FreeBSD boot process starts. Looking into the log files it appears that this issue first arose following th upgrade from FreeBSD-10.3 to 11.1. This is likely a co-incidence as I did not remark on the changed behaviour until sometime later, but originally all these bays were filled with 3Tb drives and there was no problem before the upgrade.

The question I have is: what is the likely cause of this problem?
 
Disk enclosures have controllers in them; that's the device that you talk to from the computer via SES. They are typically microcontrollers that are built into SAS expanders (note: SAS != SES, that's not a spelling error). It seems that in your case the controller has a mind of its own, and has decided to (a) turn red LEDs on in blinking mode, and (b) hide two slots from the computer (SCSI host) when the host sends a map command.

I have no idea what kind of enclosure controller you are using; they are many varieties. My guess is that this controller is designed to work with a specific RAID system, which is not present in your setup. Most likely FreeBSD during initialization is sending some commands to the controller (at least a SCSI reset command), which causes confusion.

Debugging this in detail is likely a lot of work, and may require detailed knowledge of the SES protocol (which fortunately I'm beginning to forget).
 
But do the drives show up in camcontrol devlist -v?
Yes, camcontrol sees the drive in the suspect slot. The suspect slots had the drives removed when the system was taken out of service. The top four slots were never configured into the zpool and only held drives to be used as warm spares.

I placed a spare drive into one of the slots that has the blinking led problem and ran camcontrol. It found the drive <ATA WDC WD1002FAEX-0 1D05>:

Code:
#  camcontrol devlist -v
scbus0 on isci0 bus 0:
<ATA WDC WD1002FAEX-0 1D05>        at scbus0 target 3 lun 0 (pass0,da0)
<>                                 at scbus0 target -1 lun ffffffff ()
scbus1 on ahcich0 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus1 target 0 lun 0 (pass1,ada0)
<>                                 at scbus1 target -1 lun ffffffff ()
scbus2 on ahcich1 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus2 target 0 lun 0 (pass2,ada1)
<>                                 at scbus2 target -1 lun ffffffff ()
scbus3 on ahcich2 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus3 target 0 lun 0 (pass3,ada2)
<>                                 at scbus3 target -1 lun ffffffff ()
scbus4 on ahcich3 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus4 target 0 lun 0 (pass4,ada3)
<>                                 at scbus4 target -1 lun ffffffff ()
scbus5 on ahcich4 bus 0:
<>                                 at scbus5 target -1 lun ffffffff ()
scbus6 on ahcich5 bus 0:
<>                                 at scbus6 target -1 lun ffffffff ()
scbus7 on ahciem0 bus 0:
<AHCI SGPIO Enclosure 1.00 0001>   at scbus7 target 0 lun 0 (ses0,pass5)
<>                                 at scbus7 target -1 lun ffffffff ()
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun ffffffff (xpt0)

When I switch the drive to one of the spare slots that does not have the blinking led issue I see this:

Code:
#  camcontrol devlist -v
scbus0 on isci0 bus 0:
<ATA WDC WD1002FAEX-0 1D05>        at scbus0 target 1 lun 0 (da0,pass0)
<>                                 at scbus0 target -1 lun ffffffff ()
scbus1 on ahcich0 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus1 target 0 lun 0 (pass1,ada0)
<>                                 at scbus1 target -1 lun ffffffff ()
scbus2 on ahcich1 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus2 target 0 lun 0 (pass2,ada1)
<>                                 at scbus2 target -1 lun ffffffff ()
scbus3 on ahcich2 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus3 target 0 lun 0 (pass3,ada2)
<>                                 at scbus3 target -1 lun ffffffff ()
scbus4 on ahcich3 bus 0:
<WDC WD30EFRX-68AX9N0 80.00A80>    at scbus4 target 0 lun 0 (pass4,ada3)
<>                                 at scbus4 target -1 lun ffffffff ()
scbus5 on ahcich4 bus 0:
<>                                 at scbus5 target -1 lun ffffffff ()
scbus6 on ahcich5 bus 0:
<>                                 at scbus6 target -1 lun ffffffff ()
scbus7 on ahciem0 bus 0:
<AHCI SGPIO Enclosure 1.00 0001>   at scbus7 target 0 lun 0 (ses0,pass5)
<>                                 at scbus7 target -1 lun ffffffff ()
scbus-1 on xpt0 bus 0:
<>                                 at scbus-1 target -1 lun ffffffff (xpt0)

The difference is slight:

Code:
scbus0 on isci0 bus 0:

<ATA WDC WD1002FAEX-0 1D05>        at scbus0 target 3 lun 0 (pass0,da0)

<>                                 at scbus0 target -1 lun ffffffff ()
vs.
Code:
scbus0 on isci0 bus 0:

<ATA WDC WD1002FAEX-0 1D05>        at scbus0 target 1 lun 0 (da0,pass0)

<>                                 at scbus0 target -1 lun ffffffff ()

I had problems with this system when I upgraded FreeBSD (see Thread 70131 ). Reviewing the log files more carefully it was shortly after this upgrade that the problem with the LEDs was noticed.

Since this is a spare machine anyway, and as one of its disk drives has now started to report read errors, I am going to reinstall FreeBSD from scratch and see if the issue was specific to FreeBSD-12.
 
FreeBSd-13.1 exhibits the same behaviour. So, I guess the choices are to leave those two bays empty or put up with the flashing leds.
 
Still haven't isolated which drives are on which controller? Look at page 5-12 in the manual and determine which ones are connected to the chipset (I-SATA0 through I-SATA5) and which ones are on SCU (S-SATA1 through S-SATA4)
 
There are 8 cables coming from the enclosure. Four go to the connectors labeled S-SATA1 through S-SATA4 and four go to the connectors labeled I-SATA0 through I-SATA3. This arrangement makes no sense to me but that is how the system arrived from the factory in 2017.
 
According to the manual
:
Code:
S-SATA1 ~ S-SATA4  =  SCU-based SATA 3.0 ports (6Gb/s)
I-SATA0 ~ I-SATA5  =  Intel-based SATA ports 
                     (I-SATA0 and I-SATA1 = SATA 3.0, 
                      S-SATA1~S-SATA4 = SATA 2.0)
 
The two slots that produce the blinking LEDs are connected to S-SATA3 and S-SATA4.
Welp, there's our first clue.

Spending five minutes to search the internet says this is an Intel "Patsburg SCU" and it shows up as an isci(4)X device where X is some number and that "PCH RAID" should be turned off in the BIOS.

I'd check that BIOS setting, search the logs for isci, and twiddle some settings from the man page.
 
I checked for and turned off the SCU in BIOS. There was no mention of RAID that I could see.

Code:
# grep -i isci  /var/log/*
/var/log/dmesg.today:isci0: <Intel(R) C600 Series Chipset SAS Controller (SATA mode)> port 0xe000-0xe0ff mem 0xfa47c000-0xfa47ffff,0xfa000000-0xfa3fffff irq 16 at device 0.0 on pci6
/var/log/dmesg.today:da0 at isci0 bus 0 scbus0 target 3 lun 0

/var/log/messages:Jul 22 09:37:54 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 3 lun 0
/var/log/messages:Jul 22 09:37:54 vhost03 kernel: (da0:isci0:0:3:0): Periph destroyed
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 1 lun 0
/var/log/messages:Jul 22 09:54:35 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 1 lun 0
/var/log/messages:Jul 22 09:54:35 vhost03 kernel: (da0:isci0:0:1:0): Periph destroyed
/var/log/messages:Jul 22 09:55:08 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 2 lun 0
/var/log/messages:Jul 22 09:58:01 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 2 lun 0
/var/log/messages:Jul 22 09:58:01 vhost03 kernel: (da0:isci0:0:2:0): Periph destroyed
/var/log/messages:Jul 22 09:58:31 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 0 lun 0
The only isci entries have to do with the insertion and removal of a USB flash drive.


The blinking LED remains.
 
Last edited by a moderator:
You don't want to turn off the SCU, otherwise the ports won't work? Just the RAID feature.

Looks like maybe Control-M lets you into the RAID BIOS if it is enabled? The manual points to this file:

"Onboard SATA RAID Oprom" Should be turned off under "South Bridge" settings. This would presumably turn off Control-M, if that's enabled.

Other notes in the manual say to make sure the SATA mode per port being set to "AHCI Mode" and not "RAID Mode"

Code:
(da0:isci0:0:1:0): Periph destroyed

Do a grep "Jul 22 09:3" /var/log/* to get more information about why that occurred. You may have to turn on the logging parameters described in the man page if changing the BIOS settings doesn't work.
 
RAID is/was off. SATA mode is/was off. SCU is enabled. There is nothing under South-Bridge BIOS settings that refers to RAID. The SCU detects the drive in the cage.

Code:
grep "Jul 22 09:3" /var/log/*
/var/log/auth.log:Jul 22 09:30:33 vhost03 sshd[49259]: user root login class  [preauth]
/var/log/auth.log:Jul 22 09:30:33 vhost03 syslogd: last message repeated 2 times
/var/log/auth.log:Jul 22 09:30:33 vhost03 sshd[49259]: Accepted publickey for root from 216.185.71.41 port 31095 ssh2: RSA SHA256:cJBXJBwve7zD8D1AM24vWsFYwrhz68ntuYbEiaxLp94
/var/log/messages:Jul 22 09:37:54 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 3 lun 0
/var/log/messages:Jul 22 09:37:54 vhost03 kernel: da0: <ATA WDC WD1002FAEX-0 1D05>  s/n WD-WCATR9946843 detached
/var/log/messages:Jul 22 09:37:54 vhost03 kernel: (da0:isci0:0:3:0): Periph destroyed
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0 at isci0 bus 0 scbus0 target 1 lun 0
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0: <ATA WDC WD1002FAEX-0 1D05> Fixed Direct Access SPC-3 SCSI device
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0: Serial Number WD-WCATR9946843
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0: 300.000MB/s transfers
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0: Command Queueing enabled
/var/log/messages:Jul 22 09:38:30 vhost03 kernel: da0: 953869MB (1953525168 512 byte sectors)
 
You're not giving the exact responses that are in the manual, so I'm never really sure if the right things are getting set or the manual is wrong. Either way, it seems like the SCU controller is the problem and it's detaching the disks for whatever reason. I would assume it is because something is trying to RAID them or the SCU is just physically broken. I'd mash control-M during bootup a bunch of times to see if there's a RAID BIOS enabled and maybe completely reset the BIOS settings and re-set all your preferences, including ensuring the RAID options are off (maybe with the disks out during the reset and verification).

Another option is that the backplane or a cable is broken. Try connecting a SATA data cable that is connected to the SCU and is blinking red to a bay in the backplane that otherwise has worked fine connected to the Intel chipset. Maybe completely swap cables with a blinking bay.
 
Back
Top