Solved How to make hotplugging work?

PMc

Daemon

Reaction score: 691
Messages: 1,388

I am in search of a trick/program/port to make hotplugging work with SATA disks. I can't find it anywhere, it seems it is nowhere mentioned.

With SCSI things were dead simple: just hot-plug it, and it works. It doesn't matter if it is supported or not. Just make the bus idle, plug the connector (rather more than less cleanly), and run "camcontrol rescan" afterwards.

With SATA things seem similarly simple: hotplug does just not work.

The first thing we were told was that it cannot work in IDE/ATA, we would need to switch to AHCI where it would be supported. Nowadays everything uses AHCI, hotplug is supported, and does not work nevertheless

At some time there was a command atacontrol, that very much looked like it would make things work. It had the necessary cmdline options, at least. (I never dared to try it out.)

But in Rel.10 atacontrol was removed, and we were told that we should use camcontrol instead. So hotplugging was no longer supported for (S)ATA, only hot*un*plugging was now possible: one can unplug a device from the running system, plug in a new device that will nicely connect and spin up, and will then become visible to the OS after the next reboot.

Occasionally people complain about that, like here: https://www.truenas.com/community/threads/hotswap-not-working-anymore.75960/ - but then there is only shrugging, and nobody knows a solution.

The problem is, one would need something that does the same thing that camcontrol rescan does for SCSI devices: make the adapter/bus find and recognize new devices. This is what atacontrol apparently did. camcontrol cannot do it for SATA devices, because it needs a bus to scan, and with IDE/ATA/SATA the bus is contained in the device.

I observed this missing hotplug capability on versions up to 12.3, and I observed it on different consumer boards, so I thought the operation is just no supported on these. But now I found that on a haswell xeon server board the behaviour is exactly the same: SATA hotplug is not possible; after a device is unplugged, no other device is recognized on that mainboard connector (plugging it into a different connector might work if there is an unused one).
At that point one would need to run the functionality as on initial boot - but there seems to be no tool known that would do that.
 

mark_j

Daemon

Reaction score: 795
Messages: 1,371

Hotplugging should "just" work. The only thing I can think of off-hand is in your BIOS/UEFI is there an option for hot-plugging to be enabled?

If the SATA drive is a modern (not molex-style power), then it should "just work" as the logic is in the drive hardware.
 

Phishfry

Beastie's Twin

Reaction score: 2,838
Messages: 5,840

I have not seen such behavior. Unmount filesystem and camcontrol eject ${disk}. Boom

Same with hotswap plugin. Camcontrol just works. No need to scan anything. Just like direct attach devices.

Are you sure you have hotswap enabled in the BIOS for SATA?
 
OP
PMc

PMc

Daemon

Reaction score: 691
Messages: 1,388

Hotplugging should "just" work. The only thing I can think of off-hand is in your BIOS/UEFI is there an option for hot-plugging to be enabled?
No. There was such an option on the consumer boards, but I don`t remember how it was set.
On the Xeon server board there is no such option, only enable/disable for the connector as a whole, and a switch to choose LSI raid versus Intel raid. (I want neither.)

If the SATA drive is a modern (not molex-style power), then it should "just work" as the logic is in the drive hardware.
It never worked.

I just gave it a try for details, as there are some spare ports still available on the board.

Controller involved:
ahci0: <Intel Wellsburg AHCI SATA controller>
ahci1: <Intel Wellsburg AHCI SATA controller>
Disks involved:
ada0: <WDC WD5000AAKS-00A7B2 01.03B01> ATA8-ACS SATA 2.x device (mfd ~2011)
ada2: <ST3000DM008-2DM166 CC26> ACS-2 ATA SATA 3.x device (mfd ~2017)

Operation:
  1. Unplug power from ada2:
    nothing at all happens, disconnect does not get detected (until software crash is achieved or some midlevel subsystem like ZFS gets i/o failures)
  2. Unplug ata cable from ada2 also
    nothing at all happens
  3. replug both wires
    nothing at all happens
  4. During operation 1, the power wire that also connects to ada0 gets slightly bent. Consequentially, ada0 spins down and up again (sata power connectors are broken by design). This is instantly detected and ada0 is reported as unavailable.
  5. fully unplug and replug ada0
    nothing at all happens. The disk is not detected anymore.
  6. plug in a replacement disk
    nothing at all happens. The replacement disk is not detected
  7. plug any of the disks into a spare mainboard connector
    the disk gets immediately detected.
  8. unplug it again
    the unplug gets immediately detected.
I don't believe that this can be considered operational. It rather looks like a typical dry-weather-umbrella that fails to work when actually needed.

Actually, I am not even interested in hotplugging. My concern is with the ST3000DM008 Seagate drive. Seagate has an understanding of specs, that when a drive is supposed to require 11.75-12.25 Volts power, then it will immediately disconnect at 11.749 Volts.
Now given the phantastic quality of the usual power connectors, a loss of 0.25 Volt on a connector is easily achieved; usually half a year of operation will do.

This wouldn't be a problem, since ZFS would nicely attach the drive back into the array when told to do so - IF there were a means to get the drive back after re-settling the connections.
 

mark_j

Daemon

Reaction score: 795
Messages: 1,371

Does a devctl rescan help the PCI bus see it?

Edit: I must add. This is not a custom kernel and you forgot PCI_HP option? Just asking in case.
 
  • Thanks
Reactions: PMc
OP
PMc

PMc

Daemon

Reaction score: 691
Messages: 1,388

I figured out a bit more:
  • on the consumer board the hotplugging might indeed be switched of in BIOS.
  • on the server board things are weird: there are 10 connectors on 2 controllers:
    ahci0 -> ahcich0..3
    ahci1 -> ahcich4..9

  • ahcich6..9 will detect any drive immediatels when plugged in, and unavail it immediately when plugged out
  • ahcich0..5 will not detect anything, except some kind of supply power instabilities on the device. It will unavail the device in the latter case, or when failed from application (i/o error), and will never accept a new device afterwards.
  • This does not line up with the two controllers. There is no mention whatsoever in any documentation. There is no option in the BIOS that would individually configure the connectors, except enable/disable (and that just switches them off).
  • It is unclear whether this is a general "feature" of the Wellsburg/C612 chipset, or the integrator has smoked some really bad grass when making this up.
Does a devctl rescan help the PCI bus see it?
Ahh, now we get to something! This appears to be the swiss knive I was looking for!

devctl rescan ahcich2 -> Device not configured
devctl rescan ahci0 -> Device not configured
devctl rescan pci1 -> no message
devctl attach ahcich2 -> Device busy
devctl enable ahcich2 -> Device busy
devctl detach ahcich2 -> Device detached
devctl attach ahcich2 -> Karamba!! normal detection message and everything works again. :)

Edit: I must add. This is not a custom kernel and you forgot PCI_HP option? Just asking in case.
Indeed, I am very capable of producing such kind of flaws. But in this case, I didn't forget it - I switched it off deliberately! I had assumed this concerns PCI hotplug, which is a different thing: ripping out PCI cards from the running system and replacing them. This is possible with certain machine hw designs, and with PCI-e it should be more widely possible. But I am not interested in doing that.

And, according to current state-of-affairs, PCI_HP is indeed not needed for option 1 (use only four upper ports for respective devices) or option 2 (the swiss knife) above.
 

mark_j

Daemon

Reaction score: 795
Messages: 1,371

It seems, as per the datasheet (p453), that the vendor must be hard coding this functionality. (see bit 28 of CAP—Host Capabilities Register (D31:F2)CAP—Host Capabilities Register (D31:F2))
Why they think the owner of the box shouldn't be able to select such things defies logic.
So, all is solved now?
 
OP
PMc

PMc

Daemon

Reaction score: 691
Messages: 1,388

It seems, as per the datasheet (p453), that the vendor must be hard coding this functionality. (see bit 28 of CAP—Host Capabilities Register (D31:F2)CAP—Host Capabilities Register (D31:F2))

Bit 28? There is talk about a mechanical switch. That may exist in some hw design, but usually not on separately marketed mainboards.

If one starts reading this, there is more to be found. Look for instance at Bit 5 in the same table (CAP.SXS) and then at Bit 21 of the individual PxCMD (13.4.2.7, page 466). This is about external devices, and it sounds like only external devices would need hotplug capabilitiy. And that could make sense, because the usual number of SATA devices that are considered to fit into a case is 6. They might have thought that the other 4 could be used for external connectors, and only these would need hotplug. (Normal people don't open their case and re-settle the connectors when a failure occurs. They buy a new drive instead.) This would also explain why these 4 are the uppermost device numbers.

But then, if this is the right place, wouldn't it be (at least theoretically) possible to make this tuneable from device.hints?

So, all is solved now?

*laugh* all solved?

I've had 4 issues with the 12.3-RC - only two of them got fixes in the RELEASE.
Further 3 issues now, one of them was already noted by others and currently has a working fix/patch, the others can somehow be workarounded.
Then this one, which might or might not yield a patch.
And still a nice stack of things to evaluate/verify/analyze, resembled as physical notepad (real paper) on my desk.
(This is not counting the ipfw issues.)

And I had mostly worried if the board would fit into standard tower case mechanically and thermally - which then was the least problem - 10 disks are in, another 4 coming, and cooling works fine although the really fast fans were not available when I ordered.
 
Top