Other Reboot hangs with "root mount waiting for: CAM" if drives are in standby_z state

I've put together a new FreeBSD (14.3-RELEASE-p2) machine to use as (among other things) a backup server--it's an i5-3570K with two mirrored SSDs attached to the on-board SATA, and 12 drives connected to a pair of LSI adapters (a 9211-8I and 9210-8i, both using the mps driver). The drives are in a zfs pool which is (or will be) used exclusively as a backup destination, and will be imported prior to backup and exported after (so it's used no more than once a night--I haven't settled on the backup frequency yet).

From cold boot the system works great. I've tested with the 12 drives connected and disconnected; I get about 6 repetitions of "root mount waiting for: CAM" when they're removed, and about 10-15 when they're installed. The drives are not the fastest to initialize, so this seems reasonable.

The drives have EPC power management, and use the idle_a (0 sec), idle_b (120 sec), idle_c (600 sec), and standby_z (900 sec) states. I do not want to turn off the power management from drives that are used no more than once a day (for no more than an hour or two)--the savings are significant, especially if the drives reach standby_z state.

Unfortunately that standby_z state is an issue. As long as the drives are in one of the idle states (a/b/c), rebooting the server is fine. It spends more time doing the "root mount waiting for: CAM" loop the deeper the idle state, but it eventually will bring the drives up and continue the boot. However, if the drives have reached standby_z, it hangs indefinitely at that point, and the only solution is to hard power down the server with the power button. I can't actually tell whether the system is hardlocked or if it's just caught in an infinite loop that it never recovers from (the lights on the keyboard still work, but the I wasn't able to scroll back to see any other messages).

However, while FreeBSD is up and running, the drives will come back from standby_z just fine. It take some time for each drive to do so, but if I (for example) import the zpool while the drives are in standby_z, it takes about 38 seconds for the drives to return to ready and for the pool to import (note that because it is a RAIDZ2, it'll import as soon as 10 of the drives become available, which isn't ideal but seems to work). While this is happening however CAM is spamming syslog with errors (command timeout/not ready) starting about 8 seconds after the import command is issued.

I've noticed is that the drives report standby_z as their power state when queried ("Current power state: Standby_z(0x00)"), but if they were previously in standby_z and are coming back to idle_a, they instead report "Current power state: PM0:Active or PM1:Idle(0xff)". Once they finish coming back though they will correctly report "Current power state: Idle_a(0x81)".

Anybody have any ideas of what I can do to to fix any of this? The machine is serves other functions that need to be available 24/7 so powering it off when the backup isn't running is not an option. Thus far the only solution I've come up with is to use camcontrol to permanently disable the standby_z power state, but there have to be other drives out there with that power state, so it seems like there ought to be some other solution...
 
you might want to check to make sure the drives are running their latest firmware. Also see if the LSI card can be configured to send START/STOP UNIT at boot?
 
However, if the drives have reached standby_z, it hangs indefinitely at that point, and the only solution is to hard power down the server with the power button. I can't actually tell whether the system is hardlocked or if it's just caught in an infinite loop that it never recovers from (the lights on the keyboard still work, but the I wasn't able to scroll back to see any other messages).
Where exactly does it hang? Is it the BIOS? In that case, you'll have to either fix the drive firmware, or the BIOS, or reconfigure your drives to not go to sleep. Anecdote about that below.

If it is not the BIOS, then look at the FreeBSD messages as they scroll by: what is the OS attempting to do when it "hang indefinitely"? If it is when starting a service, then you could fix it by adding another service that starts earlier, and whose only purpose is to wake up the disks.

Anecdote: About 10 years ago, I was working on systems that had several hundred disks attached. And we discovered that a combination of older firmware in our SAS HBAs with certain disks errors would make it impossible to boot. The problem is that during boot you don't get any feedback on which disk is causing the problem, so diagnosing it was painful: disconnect or pull out half the disks, reboot and see whether it comes up, half the time it does not (so it must be the other half), split disks in half again, lather rinse and repeat. Typically takes several hours, since every boot takes several minutes. Ultimately, after years of cursing this problem, we discovered that the machine was actually not hung, but the SCSI subsystem in the BIOS had so many retries that a boot took several hours, but eventually succeeded. And once the OS is up, all the tools for drilling down and finding the problem are available. Note that this was not about sleeping (we didn't care about noise or power consumption, this was a high-end HPC system), but about defective disks.
 
you might want to check to make sure the drives are running their latest firmware. Also see if the LSI card can be configured to send START/STOP UNIT at boot?

The drives are brand new. I suppose it's possible they're not running the latest firmware, but I don't see a newer one anywhere. Seagate model ST26000DM000, firmware version EN03.

I don't think the LSI cards can be configured to do that. I know they do support staggered spinup, but I think that only takes effect for cold boot, and the system works fine for cold boot. It's rebooting that's the issue.

Where exactly does it hang? Is it the BIOS? In that case, you'll have to either fix the drive firmware, or the BIOS, or reconfigure your drives to not go to sleep. Anecdote about that below.

If it is not the BIOS, then look at the FreeBSD messages as they scroll by: what is the OS attempting to do when it "hang indefinitely"? If it is when starting a service, then you could fix it by adding another service that starts earlier, and whose only purpose is to wake up the disks.

Anecdote: About 10 years ago, I was working on systems that had several hundred disks attached. And we discovered that a combination of older firmware in our SAS HBAs with certain disks errors would make it impossible to boot. The problem is that during boot you don't get any feedback on which disk is causing the problem, so diagnosing it was painful: disconnect or pull out half the disks, reboot and see whether it comes up, half the time it does not (so it must be the other half), split disks in half again, lather rinse and repeat. Typically takes several hours, since every boot takes several minutes. Ultimately, after years of cursing this problem, we discovered that the machine was actually not hung, but the SCSI subsystem in the BIOS had so many retries that a boot took several hours, but eventually succeeded. And once the OS is up, all the tools for drilling down and finding the problem are available. Note that this was not about sleeping (we didn't care about noise or power consumption, this was a high-end HPC system), but about defective disks.

It is not the BIOS, it's well after that. It hangs while FreeBSD is booting. Here's the relevant part of the dmesg from a successful boot. When it hangs, "Root mount waiting for: CAM" just repeats indefinitely, and it never reaches the ada0 (and subsequent) lines. I'm pretty sure these messages are all from the kernel and happen well before any services start.

Code:
Trying to mount root from ufs:/dev/mirror/root []...
uhub1: 8 ports with 8 removable, self powered
uhub0: 2 ports with 2 removable, self powered
uhub3: 2 ports with 2 removable, self powered
Root mount waiting for: CAM usbus1 usbus3
ugen1.2: <vendor 0x8087 product 0x0024> at usbus1
uhub4 on uhub3
uhub4: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus
1
ugen3.2: <vendor 0x8087 product 0x0024> at usbus3
uhub5 on uhub0
uhub5: <vendor 0x8087 product 0x0024, class 9/0, rev 2.00/0.00, addr 2> on usbus3
uhub4: 6 ports with 6 removable, self powered
uhub5: 8 ports with 8 removable, self powered
ugen3.3: <vendor 0x13ba Barcode Reader> at usbus3
ukbd0 on uhub5
ukbd0: <vendor 0x13ba Barcode Reader, class 0/0, rev 1.10/0.01, addr 3> on usbus3
kbd2 at ukbd0
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ada0 at ahcich2 bus 0 scbus4 target 0 lun 0
ada0: <Samsung SSD 870 EVO 500GB SVT02B6Q> ACS-4 ATA SATA 3.x device

I did find a tunable for the mps driver, but I'm not exactly sure what it does:
hw.mps.spinup_wait_time=NNNN
where NNNN represents the number of seconds to wait for SATA devices to spin up when the device fails the initial SATA Identify command.

I dug through the source of the mps driver and it seems only be referenced in a call from mpssas_add_device to mpssas_get_sas_address_for_sata_disk(). I'm not sure when this actually gets called, whether it is before or after the waiting message, but it's worth a try to set it to like 30 and see if it helps.
 
Firmware for Seagate drives: For amateurs, the easiest solution is to install Seagate's management software under Windows, and use it to check. There is also a tool on Seagate's web site where you can enter your model and serial number, and they'll tell you what firmware is available. Professionals and big companies (under NDA with Seagate) have their own software, and get copies of firmware.

Hang in "waiting for CAM" means that the OS is waiting to actually talk to a disk drive. My hunch would be that a low-level FreeBSD driver (could be the interaction between the OS and the MPS driver) doesn't know how to deal with drives that are in the sleep state. Have you tried the same procedure in

The model ST26000DM000 seems to be a HAMR drive, by default formatted as CMR. You said they are new. Do you mean "reconditioned"? They could be surplus from medium-size customers (not your local small bank or insurance company, but also not from the FAANG). Have you checked the lifetime statistics with SMART? Sorry if this sounds insulting, but the computer parts market is full of scammers.
 
hearing this, we would try and set, in /boot/loader.conf the option hw.mps.enable_ssu=3 — this ought cause the drivers to send STOP UNIT to the hard drives during the "shutdown" part of rebooting, which should hopefully knock some sense into the drives during the subsequent boot.
 
Firmware for Seagate drives: For amateurs, the easiest solution is to install Seagate's management software under Windows, and use it to check. There is also a tool on Seagate's web site where you can enter your model and serial number, and they'll tell you what firmware is available. Professionals and big companies (under NDA with Seagate) have their own software, and get copies of firmware.

Hang in "waiting for CAM" means that the OS is waiting to actually talk to a disk drive. My hunch would be that a low-level FreeBSD driver (could be the interaction between the OS and the MPS driver) doesn't know how to deal with drives that are in the sleep state. Have you tried the same procedure in

The model ST26000DM000 seems to be a HAMR drive, by default formatted as CMR. You said they are new. Do you mean "reconditioned"? They could be surplus from medium-size customers (not your local small bank or insurance company, but also not from the FAANG). Have you checked the lifetime statistics with SMART? Sorry if this sounds insulting, but the computer parts market is full of scammers.

They are new, they are not reconditioned. They are Barracuda HAMR drives I shucked myself from Seagate Expansion enclosures in sealed boxes, purchased direct from Seagate.

hearing this, we would try and set, in /boot/loader.conf the option hw.mps.enable_ssu=3 — this ought cause the drivers to send STOP UNIT to the hard drives during the "shutdown" part of rebooting, which should hopefully knock some sense into the drives during the subsequent boot.
I can give that a shot, but I've been doing a lot more experimenting. The cards have a setting for "how many drives do you want me to try and spin up at once" and it can be set as high as 15. (There's a lot of other timeouts in there I tweaked as well.) After making those tweaks, the 4 drives attached to the 9211 actually kick out of standby simultaneously when it hits that waiting for CAM message. If prevent the other 8 drives from going into standy, FreeBSD will actually boot even with those 4 in standby.

However no tweaks to the 9210 seem to make any difference, it doesn't even attempt to spin them up during the boot sequence (they are quite audible when coming out of standby if you are nearby). At this point I am suspecting that there is actually something unique about it that is causing issues. I think I have another 9211 in a different machine (that doesn't need any sort of standby support), and I'm going to see if swapping them makes any difference (the 9210 should be fine in it).

Edit: I am incorrect. The card in the other machine is also a 9210-8i. I'm going to order another 9211-8i and see if that solves the problem entirely. Without the 9210-8i in the system, everything seems to work perfectly fine (though of course I only have 4 drives connected).
 
oh! yeah, do you also have the latest firmwares on those controllers? Our mpr device had some weird behavior until we pulled it up to 16.00.14.00.
 
Same firmware, same BIOS (though BIOS shouldn't matter because I have the adapter BIOS disabled). All the same settings in the adapter configuration utility as well. The 9211 works (kicks the drives out of standby during boot) and the 9210 just doesn't.

The drives are in 4x3 enclosures with activity lights, and when coming out of standby_z the light goes out, then comes back on when the drive is back in idle_a. I've watched it when I kick the drives in the OS (change camcontrol epc settings) and when I go into the adapter settings and ask it to show the topology (where it does it one at a time on both cards, you can watch one light turn off and then back on).

If all the drives are on standby during warm boot, when it hits the "root mount waiting on: CAM" section you can watch all 4 lights of the drives connected to the 9211 go out simultaneously and then come back on now (it was one at a time before I messed around with the adapter settings). The ones connected to the 9210 don't even flicker. The funny thing is the only reason I even have the 9211 is because I pulled it out of my file server to replace it with a 9207 to address some performance issues there (it was a marginal improvement, but that's all I expected).

I'll try hw.mps.enable_ssu=3 if for some reason replacing the 9210 with the 9211 doesn't solve the issue. I'll update when I get the new card installed.
 
Root mount waiting for: CAM
Root mount waiting for: CAM
ada0 at ahcich2 bus 0 scbus4 target 0 lun 0
ada0: <Samsung SSD 870 EVO 500GB SVT02B6Q> ACS-4 ATA SATA 3.x device
This log sequence kind of strongly hints that ada0 is causing the waiting.
 
It does look that way, but it's not. I can simply remove the HDDs from the from hot swap bays while it is hung and the boot will suddenly continue. Plus the hang never happens unless at least some of the drives are in standby_z prior to reboot. I ran this machine for weeks before I finally had everything to put the HDDs in and it never hung. It also doesn't hang if the 9210-9i is removed from the system, or has no drives plugged in to it.

For some reason once it gets past the wait, ad0 and ada1 are always the first drives detected. I'm guessing it enumerates them in bus order and the motherboard bus is always first. Here's what the successful boot looks like entirely, from the waiting for CAM to the swap being mounted:
Code:
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
Root mount waiting for: CAM
ada0 at ahcich2 bus 0 scbus4 target 0 lun 0
ada0: <Samsung SSD 870 EVO 500GB SVT02B6Q> ACS-4 ATA SATA 3.x device
ada0: Serial Number S6PWNS0T505167E
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 476940MB (976773168 512 byte sectors)
ada0: quirks=0x3<4K,NCQ_TRIM_BROKEN>
ada1 at ahcich3 bus 0 scbus5 target 0 lun 0
ada1: <Samsung SSD 870 EVO 500GB SVT02B6Q> ACS-4 ATA SATA 3.x device
ada1: Serial Number S6PWNS0T505419W
ada1: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada1: Command Queueing enabled
ada1: 476940MB (976773168 512 byte sectors)
ada1: quirks=0x3<4K,NCQ_TRIM_BROKEN>
ses0 at ahciem0 bus 0 scbus10 target 0 lun 0
ses0: <AHCI SGPIO Enclosure 2.00 0001> SEMB S-E-S 2.00 device
ses0: SEMB SES Device
ses0: ada0,pass12 in 'Slot 00', SATA Slot: scbus4 target 0
ses0: ada1,pass13 in 'Slot 01', SATA Slot: scbus5 target 0
da0 at mps0 bus 0 scbus0 target 1 lun 0
da0: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da0: Serial Number ZXA0LPYR
da0: 600.000MB/s transfers
da0: Command Queueing enabled
da0: 24796160MB (50782535680 512 byte sectors)
da1 at mps0 bus 0 scbus0 target 3 lun 0
da1: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da1: Serial Number ZXA0QF71
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 24796160MB (50782535680 512 byte sectors)
da8 at mps1 bus 0 scbus1 target 3 lun 0
da8: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da8: Serial Number ZXA0P8FB
da8: 600.000MB/s transfers
da8: Command Queueing enabled
da8: 24796160MB (50782535680 512 byte sectors)
da9 at mps1 bus 0 scbus1 target 5 lun 0
da9: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da9: Serial Number ZXA1378E
da9: 600.000MB/s transfers
da9: Command Queueing enabled
da9: 24796160MB (50782535680 512 byte sectors)
da10 at mps1 bus 0 scbus1 target 7 lun 0
da10: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da10: Serial Number ZXA1278J
da10: 600.000MB/s transfers
da10: Command Queueing enabled
da10: 24796160MB (50782535680 512 byte sectors)
da2 at mps0 bus 0 scbus0 target 4 lun 0
da2: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number ZXA09YK4
da2: 600.000MB/s transfers
da2: Command Queueing enabled
da2: 24796160MB (50782535680 512 byte sectors)
da11 at mps1 bus 0 scbus1 target 11 lun 0
da11: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da11: Serial Number ZXA0SN0L
da11: 600.000MB/s transfers
da11: Command Queueing enabled
da11: 24796160MB (50782535680 512 byte sectors)
da3 at mps0 bus 0 scbus0 target 5 lun 0
da3: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da3: Serial Number ZXA0M23V
da3: 600.000MB/s transfers
da3: Command Queueing enabled
da3: 24796160MB (50782535680 512 byte sectors)
da4 at mps0 bus 0 scbus0 target 6 lun 0
da4: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da4: Serial Number ZXA0YR6X
da4: 600.000MB/s transfers
da4: Command Queueing enabled
da4: 24796160MB (50782535680 512 byte sectors)
da5 at mps0 bus 0 scbus0 target 7 lun 0
da5: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da5: Serial Number ZXA0PY51
da5: 600.000MB/s transfers
da5: Command Queueing enabled
da5: 24796160MB (50782535680 512 byte sectors)
da6 at mps0 bus 0 scbus0 target 94 lun 0
da6: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da6: Serial Number ZXA0RVWB
da6: 600.000MB/s transfers
da6: Command Queueing enabled
da6: 24796160MB (50782535680 512 byte sectors)
da7 at mps0 bus 0 scbus0 target 95 lun 0
da7: <ATA ST26000DM000-3Y8 EN03> Fixed Direct Access SPC-4 SCSI device
da7: Serial Number ZXA0RZC5
da7: 600.000MB/s transfers
da7: Command Queueing enabled
da7: 24796160MB (50782535680 512 byte sectors)
GEOM_MIRROR: Device mirror/swap launched (2/2).
GEOM_MIRROR: Device mirror/root launched (2/2).
 
Zaragon my only guess at the moment is that it is something between the controller (its firmware) and the disks.
Maybe the controller sees the sleeping disks but does not know how to properly / quickly wake them up.
So, the controller reports their presence to the driver, but when the driver probes some information about the disks it takes a long time.
Disks being SATA may also play a role.
 
If you don't use anything on the USB bus to boot from you can disable it with

I could, but it only takes like a second for it to enumerate the USB and get stuck to waiting on only CAM. Probably not really related, especially since it boots fine otherwise if the problem drives are not in standby_z (either one of the other idle modes, or not present).

Zaragon my only guess at the moment is that it is something between the controller (its firmware) and the disks.
Maybe the controller sees the sleeping disks but does not know how to properly / quickly wake them up.
So, the controller reports their presence to the driver, but when the driver probes some information about the disks it takes a long time.
Disks being SATA may also play a role.

That's my thought as well, something unique to the 9210 which is an IT-only card (supposedly cannot be flashed with IR firmware), versus the 9211 which is an IR card (even though I have it flashed with IT firmware). There's almost certainly additional hardware/logic on the physical card itself to support the IR functionality even if the IR firmware isn't present. Is any of it active when IT firmware is flashed? I have no way of knowing. There's definitely a variable at play here because drives hooked to one card work and the same drives hooked to another one don't.

If I can find another drive in my pile of currently unused hardware that supports EPC/standby_z I can try to reproduce the problem with them to see if it makes any difference. Would help to know if this is 100% an mps hardware/firmware/driver issue as opposed to being drive model dependent. My WDC WD160EDGZ drives definitely support EPC since I had problems with them in my file server but I don't know if they support standby_z.

Can you also share all dev.mps sysctl values?

Here they are. Based on which one has what drives attached to them, mps1 is the 9211-8i (with 4 drives attached) which has no issues while mps0 is the 9210-8i (with 8 drives attached) that has the issue.

Code:
dev.mps.1.use_phy_num: 1
dev.mps.1.dump_reqs_alltypes: 0
dev.mps.1.encl_table_dump:
dev.mps.1.mapping_table_dump:
dev.mps.1.spinup_wait_time: 60
dev.mps.1.chain_alloc_fail: 0
dev.mps.1.enable_ssu: 1
dev.mps.1.max_io_pages: -1
dev.mps.1.max_chains: 16384
dev.mps.1.chain_free_lowwater: 16369
dev.mps.1.chain_free: 16384
dev.mps.1.io_cmds_highwater: 21
dev.mps.1.io_cmds_active: 0
dev.mps.1.msg_version: 2.0
dev.mps.1.driver_version: 21.02.00.00-fbsd
dev.mps.1.firmware_version: 20.00.07.00
dev.mps.1.max_evtframes: 32
dev.mps.1.max_replyframes: 2048
dev.mps.1.max_prireqframes: 128
dev.mps.1.max_reqframes: 2048
dev.mps.1.msix_msgs: 1
dev.mps.1.max_msix: 16
dev.mps.1.disable_msi: 0
dev.mps.1.disable_msix: 0
dev.mps.1.debug_level: 0x3,info,fault
dev.mps.1.%iommu: rid=0x200
dev.mps.1.%parent: pci2
dev.mps.1.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1734 subdevice=0x1177 class=0x010400
dev.mps.1.%location: slot=0 function=0 dbsf=pci0:2:0:0
dev.mps.1.%driver: mps
dev.mps.1.%desc: Avago Technologies (LSI) SAS2008
dev.mps.0.use_phy_num: 1
dev.mps.0.dump_reqs_alltypes: 0
dev.mps.0.encl_table_dump:
dev.mps.0.mapping_table_dump:
dev.mps.0.spinup_wait_time: 60
dev.mps.0.chain_alloc_fail: 0
dev.mps.0.enable_ssu: 1
dev.mps.0.max_io_pages: -1
dev.mps.0.max_chains: 16384
dev.mps.0.chain_free_lowwater: 16347
dev.mps.0.chain_free: 16384
dev.mps.0.io_cmds_highwater: 44
dev.mps.0.io_cmds_active: 0
dev.mps.0.msg_version: 2.0
dev.mps.0.driver_version: 21.02.00.00-fbsd
dev.mps.0.firmware_version: 20.00.07.00
dev.mps.0.max_evtframes: 32
dev.mps.0.max_replyframes: 2048
dev.mps.0.max_prireqframes: 128
dev.mps.0.max_reqframes: 2048
dev.mps.0.msix_msgs: 1
dev.mps.0.max_msix: 16
dev.mps.0.disable_msi: 0
dev.mps.0.disable_msix: 0
dev.mps.0.debug_level: 0x3,info,fault
dev.mps.0.%iommu: rid=0x100
dev.mps.0.%parent: pci1
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 subdevice=0x3040 class=0x010700
dev.mps.0.%location: slot=0 function=0 dbsf=pci0:1:0:0
dev.mps.0.%driver: mps
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
dev.mps.%parent:

The only value I'm setting myself is `hw.mps.spinup_wait_time=60` in /boot/loader.conf, but it made no difference, so I'm not sure I need it.

Here's everything in the dmesg relative to mps0 & mps1. No obvious differences that I can see.

Code:
regal:/root# dmesg | grep mps0
mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xf7ec0000-0xf7ec3fff,0xf7e80000-0xf7ebffff irq 16 at device 0.0 on pci1
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
da0 at mps0 bus 0 scbus0 target 1 lun 0
da1 at mps0 bus 0 scbus0 target 3 lun 0
da2 at mps0 bus 0 scbus0 target 4 lun 0
da3 at mps0 bus 0 scbus0 target 5 lun 0
da4 at mps0 bus 0 scbus0 target 6 lun 0
da5 at mps0 bus 0 scbus0 target 7 lun 0
da6 at mps0 bus 0 scbus0 target 94 lun 0
da7 at mps0 bus 0 scbus0 target 95 lun 0
regal:/root# dmesg | grep mps1
mps1: <Avago Technologies (LSI) SAS2008> port 0xd000-0xd0ff mem 0xf7dc0000-0xf7dc3fff,0xf7d80000-0xf7dbffff irq 17 at device 0.0 on pci2
mps1: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
da8 at mps1 bus 0 scbus1 target 3 lun 0
da9 at mps1 bus 0 scbus1 target 5 lun 0
da11 at mps1 bus 0 scbus1 target 11 lun 0
da10 at mps1 bus 0 scbus1 target 7 lun 0
 
Zaragon I don't have any suggestions except to collect more diagnostic info.
You can set, for example, dev.mps.0.debug_level=0x7f in loader.conf.
You might need to increase msgbuf size to something large (e.g. kern.msgbufsize=10m), so the logs are not lost as they are produced before syslogd is started and can save them to log file(s).
 
I'll give those two sysctls a shot and let you know when I have more data. This machine is now my firewall / NAT gateway so I can't take it down during business hours (I work remotely). My new (to me) 9211-8i should also be here today so I am hoping to also confirm whether or not this is a 9210-specific issue (or if the 9211 I have is just somehow special).
 
Okay, I've gotten the new card in. The debugs weren't particularly helpful, spewed a lot of messages before getting to the hang point, but the only messages during that time are:
Code:
mps1: mpssas_startup_decrement releasing simq
mps1: mpssas_startup_decrement refcount 0
mps0: mpssas_startup_decrement releasing simq
mps0: mpssas_startup_decrement refcount 0
However, with the new card in, I've been able to observe some new behavior. Every scenario below involves powering on the PC, booting FreeBSD, and using camcontrol to monitor the power state of the drives until all of them are in standby_z. Once they have, then reboot.

With both cards in place and connected, the behavior is exactly what it was before.

If I remove the adapter that only has 4 drives before starting the entire process (notably, this is the adapter that is able to spin its 4 drives up when both cards are in place), then the remaining adapter will actually spin up all 8 of its drives (4 at a time, probably one SAS port first and then the other), the drives will be recognized, and the system will boot, every time. Same thing happens if I remove the 8-drive card and boot with just the 4 drive card.

With both cards in place, the 4-drive card will spin up its drives successfully, but it will never recognize them. I don't get any messages about da0-da3 whatsoever, just repeated messages about waiting for CAM.

Here's where it gets fun. If I hit the reset button after it spins up those 4 drives? When it comes back around to the same place, the 8-drive card will spin up its drives, and then the system will boot with all 12 drives recognized.

So there is some sort of race condition / lock / resource contention that is only happening when two cards are present, and both cards need to spin up drives. It feels like a software issue at this point since both physical cards are capable of spinning up the drives in isolation (even while both are present in the box) as long as it doesn't need to spin up a drive on both cards.

Does anyone else have any suggestions or is it just time to hit bugzilla? I've included the dmesg from a successful boot with the mps debugs on, but I don't know how helpful that will be. I can't capture the unsuccessful boot because it never gets anywhere I can read it except on the console, and I don't know what I'm looking for.
 

Attachments

The useful stuff is in those messages.
Yeah, they're included in the dmesg above. There's nothing different that I can see between when it hangs, and when it doesn't--but I am comparing by eye, because it won't complete the boot after that now.

If you have a specific message you're looking for that might be different, let me know, I can take a picture or two.
 
Back
Top