Solved mpr driver -- "Dual Domain" disk array, both machines cannot see all disks

girgen@

Developer
Hi,

We have two machines connected to the same disk enclosures. It's called "Dual domain" disks, and the purpose is naturally to achive high reilience. If one machine dies, the other takes over. At the moment, the machine storage1 is active and storage2 is "passive". storage1 is running FreeBSD-12.0, machine2 is running FreeBSD-12.1. Can that be an explanation?

Now, we needed to add more space and popped in eight new disks. Here's the problem: only three of the disks shows in the dmesg and /dev on the active storage1, whereas all eight showed up in the passive storage2.

The machines are identical and both run the mpr driver.:
Code:
mpr0: <Avago Technologies (LSI) SAS3216> port 0x2000-0x20ff mem 0x92c00000-0x92c0ffff irq 16 at device 0.0 numa-domain 0 on pci5
mpr0: Firmware: 09.00.100.00, Driver: 18.03.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>

Here's from messages on machine1 when the disks where put in:
Code:
Apr 20 15:23:12 storage1 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002d> enclosureHandle<0x0002> slot 49
Apr 20 15:23:12 storage1 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:23:13 storage1 kernel: ses1: da17,pass21: Element descriptor: '{"Name":"DriveBay10"}   '
Apr 20 15:23:13 storage1 kernel: ses1: da17,pass21: SAS Device Slot Element: 1 Phys at Slot 9, Not All Phys
Apr 20 15:23:13 storage1 kernel: ses1:  phy 0: SAS device type 1 id 0
Apr 20 15:23:13 storage1 kernel: ses1:  phy 0: protocols: Initiator( None ) Target( SSP )
Apr 20 15:23:13 storage1 kernel: ses1:  phy 0: parent 51402ec000ff95fd addr 5000cca264b77e39
Apr 20 15:23:13 storage1 kernel: da17 at mpr0 bus 0 scbus2 target 101 lun 0
Apr 20 15:23:13 storage1 kernel: da17: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:23:13 storage1 kernel: da17: Serial Number 9RK7XB1C
Apr 20 15:23:13 storage1 kernel: da17: 1200.000MB/s transfers
Apr 20 15:23:13 storage1 kernel: da17: Command Queueing enabled
Apr 20 15:23:13 storage1 kernel: da17: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:24:03 storage1 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0027> enclosureHandle<0x0002> slot 48
Apr 20 15:24:03 storage1 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:24:13 storage1 kernel: ses1: da18,pass22: Element descriptor: '{"Name":"DriveBay9"}'
Apr 20 15:24:13 storage1 kernel: ses1: da18,pass22: SAS Device Slot Element: 1 Phys at Slot 8, Not All Phys
Apr 20 15:24:13 storage1 kernel: ses1:  phy 0: SAS device type 1 id 0
Apr 20 15:24:13 storage1 kernel: ses1:  phy 0: protocols: Initiator( None ) Target( SSP )
Apr 20 15:24:13 storage1 kernel: ses1:  phy 0: parent 51402ec000ff95fd addr 5000cca264bf1135
Apr 20 15:24:13 storage1 kernel: da18 at mpr0 bus 0 scbus2 target 100 lun 0
Apr 20 15:24:13 storage1 kernel: da18: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:24:13 storage1 kernel: da18: Serial Number 9RKD2H3C
Apr 20 15:24:13 storage1 kernel: da18: 1200.000MB/s transfers
Apr 20 15:24:13 storage1 kernel: da18: Command Queueing enabled
Apr 20 15:24:13 storage1 kernel: da18: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:24:35 storage1 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0025> enclosureHandle<0x0002> slot 47
Apr 20 15:24:35 storage1 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:24:48 storage1 kernel: ses1: da19,pass23: Element descriptor: '{"Name":"DriveBay8"}'
Apr 20 15:24:48 storage1 kernel: ses1: da19,pass23: SAS Device Slot Element: 1 Phys at Slot 7, Not All Phys
Apr 20 15:24:48 storage1 kernel: ses1:  phy 0: SAS device type 1 id 0
Apr 20 15:24:48 storage1 kernel: ses1:  phy 0: protocols: Initiator( None ) Target( SSP )
Apr 20 15:24:48 storage1 kernel: ses1:  phy 0: parent 51402ec000ff95fd addr 5000cca264bf21d1
Apr 20 15:24:48 storage1 kernel: da19 at mpr0 bus 0 scbus2 target 99 lun 0
Apr 20 15:24:48 storage1 kernel: da19: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:24:48 storage1 kernel: da19: Serial Number 9RKD3LDC
Apr 20 15:24:48 storage1 kernel: da19: 1200.000MB/s transfers
Apr 20 15:24:48 storage1 kernel: da19: Command Queueing enabled
Apr 20 15:24:48 storage1 kernel: da19: 13351936MB (27344764928 512 byte sectors)

and this is from machine2:
Code:
Apr 20 15:21:07 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002c> enclosureHandle<0x0003> slot 52
Apr 20 15:21:07 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:21:16 storage2 kernel: da17 at mpr0 bus 0 scbus2 target 90 lun 0
Apr 20 15:21:16 storage2 kernel: da17: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:21:16 storage2 kernel: da17: Serial Number 9RKBZ0KL
Apr 20 15:21:16 storage2 kernel: da17: 1200.000MB/s transfers
Apr 20 15:21:16 storage2 kernel: da17: Command Queueing enabled
Apr 20 15:21:16 storage2 kernel: da17: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:22:34 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002b> enclosureHandle<0x0003> slot 51
Apr 20 15:22:34 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:22:40 storage2 kernel: ses0: da18,pass21 in '{"Name":"DriveBay12"}   ', SAS Slot: 1+ phys at slot 11
Apr 20 15:22:40 storage2 kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Apr 20 15:22:40 storage2 kernel: ses0:  phy 0: parent 51402ec000fe3b3d addr 5000cca264be3859
Apr 20 15:22:40 storage2 kernel: da18 at mpr0 bus 0 scbus2 target 89 lun 0
Apr 20 15:22:40 storage2 kernel: da18: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:22:40 storage2 kernel: da18: Serial Number 9RKBM1DC
Apr 20 15:22:40 storage2 kernel: da18: 1200.000MB/s transfers
Apr 20 15:22:40 storage2 kernel: da18: Command Queueing enabled
Apr 20 15:22:40 storage2 kernel: da18: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:23:07 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002d> enclosureHandle<0x0002> slot 49
Apr 20 15:23:07 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:23:14 storage2 kernel: ses1: da19,pass22 in '{"Name":"DriveBay10"}   ', SAS Slot: 1+ phys at slot 9
Apr 20 15:23:14 storage2 kernel: ses1:  phy 0: SAS device type 1 phy 1 Target ( SSP )
Apr 20 15:23:14 storage2 kernel: ses1:  phy 0: parent 51402ec000ff95ff addr 5000cca264b77e3a
Apr 20 15:23:14 storage2 kernel: da19 at mpr0 bus 0 scbus2 target 101 lun 0
Apr 20 15:23:14 storage2 kernel: da19: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:23:14 storage2 kernel: da19: Serial Number 9RK7XB1C
Apr 20 15:23:14 storage2 kernel: da19: 1200.000MB/s transfers
Apr 20 15:23:14 storage2 kernel: da19: Command Queueing enabled
Apr 20 15:23:14 storage2 kernel: da19: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:24:00 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0027> enclosureHandle<0x0002> slot 48
Apr 20 15:24:00 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:24:12 storage2 kernel: ses1: da20,pass23 in '{"Name":"DriveBay9"}', SAS Slot: 1+ phys at slot 8
Apr 20 15:24:12 storage2 kernel: ses1:  phy 0: SAS device type 1 phy 1 Target ( SSP )
Apr 20 15:24:12 storage2 kernel: ses1:  phy 0: parent 51402ec000ff95ff addr 5000cca264bf1136
Apr 20 15:24:12 storage2 kernel: da20 at mpr0 bus 0 scbus2 target 100 lun 0
Apr 20 15:24:12 storage2 kernel: da20: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:24:12 storage2 kernel: da20: Serial Number 9RKD2H3C
Apr 20 15:24:12 storage2 kernel: da20: 1200.000MB/s transfers
Apr 20 15:24:12 storage2 kernel: da20: Command Queueing enabled
Apr 20 15:24:12 storage2 kernel: da20: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:24:32 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0025> enclosureHandle<0x0002> slot 47
Apr 20 15:24:32 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:24:32 storage2 kernel: ses1: da21,pass24 in '{"Name":"DriveBay8"}', SAS Slot: 1+ phys at slot 7
Apr 20 15:24:32 storage2 kernel: ses1:  phy 0: SAS device type 1 phy 1 Target ( SSP )
Apr 20 15:24:32 storage2 kernel: ses1:  phy 0: parent 51402ec000ff95ff addr 5000cca264bf21d2
Apr 20 15:24:49 storage2 kernel: da21 at mpr0 bus 0 scbus2 target 99 lun 0
Apr 20 15:24:49 storage2 kernel: da21: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:24:49 storage2 kernel: da21: Serial Number 9RKD3LDC
Apr 20 15:24:49 storage2 kernel: da21: 1200.000MB/s transfers
Apr 20 15:24:49 storage2 kernel: da21: Command Queueing enabled
Apr 20 15:24:49 storage2 kernel: da21: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:25:41 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002a> enclosureHandle<0x0003> slot 50
Apr 20 15:25:41 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:25:41 storage2 kernel: ses0: da22,pass25 in '{"Name":"DriveBay11"}   ', SAS Slot: 1+ phys at slot 10
Apr 20 15:25:41 storage2 kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Apr 20 15:25:41 storage2 kernel: ses0:  phy 0: parent 51402ec000fe3b3d addr 5000cca264bdca2d
Apr 20 15:26:03 storage2 kernel: da22 at mpr0 bus 0 scbus2 target 88 lun 0
Apr 20 15:26:03 storage2 kernel: da22: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:26:03 storage2 kernel: da22: Serial Number 9RKBBPYL
Apr 20 15:26:03 storage2 kernel: da22: 1200.000MB/s transfers
Apr 20 15:26:03 storage2 kernel: da22: Command Queueing enabled
Apr 20 15:26:03 storage2 kernel: da22: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:26:46 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x0028> enclosureHandle<0x0003> slot 49
Apr 20 15:26:46 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:26:51 storage2 kernel: ses0: da23,pass26 in '{"Name":"DriveBay10"}   ', SAS Slot: 1+ phys at slot 9
Apr 20 15:26:51 storage2 kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Apr 20 15:26:51 storage2 kernel: ses0:  phy 0: parent 51402ec000fe3b3d addr 5000cca264bebf9d
Apr 20 15:26:51 storage2 kernel: da23 at mpr0 bus 0 scbus2 target 87 lun 0
Apr 20 15:26:51 storage2 kernel: da23: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:26:51 storage2 kernel: da23: Serial Number 9RKBX1NL
Apr 20 15:26:51 storage2 kernel: da23: 1200.000MB/s transfers
Apr 20 15:26:51 storage2 kernel: da23: Command Queueing enabled
Apr 20 15:26:51 storage2 kernel: da23: 13351936MB (27344764928 512 byte sectors)
Apr 20 15:26:54 storage2 kernel: mpr0: Found device <401<SspTarg>,End Device> <12.0Gbps> handle<0x002e> enclosureHandle<0x0003> slot 48
Apr 20 15:26:54 storage2 kernel: mpr0: At enclosure level 0 and connector name (    )
Apr 20 15:27:03 storage2 kernel: ses0: da24,pass27 in '{"Name":"DriveBay9"}', SAS Slot: 1+ phys at slot 8
Apr 20 15:27:03 storage2 kernel: ses0:  phy 0: SAS device type 1 phy 0 Target ( SSP )
Apr 20 15:27:03 storage2 kernel: ses0:  phy 0: parent 51402ec000fe3b3d addr 5000cca264bce109
Apr 20 15:27:03 storage2 kernel: da24 at mpr0 bus 0 scbus2 target 86 lun 0
Apr 20 15:27:03 storage2 kernel: da24: <HP MB014000JWUDB HPD1> Fixed Direct Access SPC-4 SCSI device
Apr 20 15:27:03 storage2 kernel: da24: Serial Number 9RKAW5MC
Apr 20 15:27:03 storage2 kernel: da24: 1200.000MB/s transfers
Apr 20 15:27:03 storage2 kernel: da24: Command Queueing enabled
Apr 20 15:27:03 storage2 kernel: da24: 13351936MB (27344764928 512 byte sectors)

As you can see, storage1 stops at da19 whereas storage2 sees all the new disks.

Tried camcontrol rescan all but that didn't help.

Is there any other way to get the mpr controller to actually rescan and find all the disks? Will they be seen after a reboot? Will upgrading to 12.1 help?

Thank,
Palle
 
Using camcontrol and the dmesg output, you are reporting to use what the FreeBSD OS is seeing. To help debug this, let's go one level lower: Ask the SAS controller. That might help differentiate where the problem is. Communication with the LSI SAS controllers is done using the mptutil programs, but it's been a few years since I used them directly.

The next question is going to be: What is the SAS topology? Given this many disk drives, you definitely have SAS expanders. Given that you have SES enclosure controllers showing up, most likely you have external JBOD enclosures, which use expander chips also as enclosure controllers. Depending on the type of hardware, you might be able to communicate with the expanders/controllers, and ask them how many disk drives they're seeing.

And let me ask a really dumb question: Could it be that some of the new disks you inserted are not actually dual-ported SAS disks, but single-ported SATA disks, with the SAS fabric hiding that fact from you (SATA disks on a SAS fabric look like SAS disks, at least under Linux)? That would be a super-simple explanation.
 
I thought you would need to use a multipath control utility to pull this off.
Not if they have the disks visible only one 1 path from each host; each host is single-path.

An interesting question is how they are going to handle the failover, to make sure at any moment in time the disks are mounted from at most one host, and most of the time from exactly one. But that's not the problem we're discussing here.
 
I just used sas3ircu to display all disks, and the difference is the same there as in the camcontrol.
All eight new disks have the same model number, MB014000JWUDB, so I doubt that some would not be dual domain.

2 HP machines storage1, storage2 are connected with mini-SAS connectors to 2 disk enclosures.

servers are: ProLiant DL360 Gen9
with Avago SAS9305-16e HBA controller installed for storage management

the two storage enclosures are : HP D3600

We use zfs and handle fail-over, including avoiding split brain, through a set of quorum that decides which machine is active at any time, but as ralphbsz writes, that's another discussion... :)
 
Is there any way you can connect to the enclosure? I just looked at the documentation for the HP D3600 online, and I don't see any connectors on the controller (in the back). I've worked with enclosures that have Ethernet ports on their SAS controllers, allowing diagnostic connections. You could ask the enclosure controllers (both of them) what disk drives they see. On the HP boxes, I have NO idea how to do that, never worked with them.

You did talk to the Avago (=Broadcom =LSI) cards already, using the sas3ircu utility (the mptutil package is probably too old for these cards). That confirms that the 9305 card is not seeing the disks, which means either the SAS controller in the enclosure is "hiding" them, or the 9305 is not communicating with the enclosure correctly.

Have you opened a trouble ticket with the supplier? In my world, this is where technical support should get some engineering resources into the problem.
 
Do you have spare disk to replace? What do you see in HP SIM?

I'm not sure that those disks have Dual-port.

Dual domain:
In a dual domain deployment, two paths exist from the disk enclosure to the host. In a dual domain deployment, both I/O modules in the disk enclosure are used. Because dual domain deployments provide two paths to the storage, access is ensured, even in the event of device, cable, or power failure. In dual domain environments, dual-port disk drives are required.
 
Is there any way you can connect to the enclosure? I just looked at the documentation for the HP D3600 online, and I don't see any connectors on the controller (in the back). I've worked with enclosures that have Ethernet ports on their SAS controllers, allowing diagnostic connections. You could ask the enclosure controllers (both of them) what disk drives they see. On the HP boxes, I have NO idea how to do that, never worked with them.

You did talk to the Avago (=Broadcom =LSI) cards already, using the sas3ircu utility (the mptutil package is probably too old for these cards). That confirms that the 9305 card is not seeing the disks, which means either the SAS controller in the enclosure is "hiding" them, or the 9305 is not communicating with the enclosure correctly.

Have you opened a trouble ticket with the supplier? In my world, this is where technical support should get some engineering resources into the problem.

Don't think there are any connectors apart from the SAS connector.

Talking to the supplier seems like the next stop, you're right. I just wanted to rule out the possibility that the OS did something fishy.
 
Do you have spare disk to replace? What do you see in HP SIM?

I'm not sure that those disks have Dual-port.

The eight disk all have identical model numbers, and three out of eitght are seem by both machines, so I'm pretty sure the disks are actually Dual port?
 
¨This problem was fixed by moving a connector, connecting the disk cabinet with the HBA controller, from one socket to another. Silly, but it worked for me.
 
Back
Top