11.1 device ses0 masking da5

Alan Lundin · Dec 21, 2017

I recently upgraded two Dell R515s, each with 8 disks, from 11.0 to 11.1. After the upgrade, device ses0 is found, but it uses the same target and LUN as one of the disks, which is now masked out and can't be seen. The exact same thing happens on each of the R515s.

Code:

# camcontrol devlist
<IBM ULTRIUM-HH4 E4J1>             at scbus0 target 4 lun 0 (sa0,pass0)
<DELL PV-124T 0091>                at scbus0 target 4 lun 1 (pass1,ch0)
<ATA Hitachi HUA72202 A3HA>        at scbus1 target 3 lun 0 (pass2,da0)
<ATA Hitachi HUA72202 A3HA>        at scbus1 target 4 lun 0 (pass3,da1)
<ATA Hitachi HUA72202 A3HA>        at scbus1 target 5 lun 0 (pass4,da2)
<ATA TOSHIBA MG03ACA2 FL1D>        at scbus1 target 6 lun 0 (pass5,da3)
<ATA Hitachi HUA72202 A3HA>        at scbus1 target 7 lun 0 (pass6,da4)
<DP BACKPLANE 1.10>                at scbus1 target 8 lun 0 (pass7,ses0)
<ATA Hitachi HUA72202 A3HA>        at scbus1 target 9 lun 0 (pass8,da5)
<ATA ST2000NM0011 PA09>            at scbus1 target 10 lun 0 (pass9,da6)
<TSSTcorp DVD+-RW TS-L633J D150>   at scbus6 target 0 lun 0 (cd0,pass10)

dmesg shows that now the ses0 device is found before the da(4) devices are probed, so I presume ses(4) devices now gets discovery priority.

This didn't happen with versions 11.0, 10.x, and 9.x. I booted 11.0 off a rescue CD, and ses0 isn't found so all 8 disks *are* found.

I tried compiling the kernel with the 'device ses' removed, but oddly, though /dev/ses0 isn't created, the ses hardware is found and given a passthough device. The 6th disk is still masked from visibility.

Code:

<DP BACKPLANE 1.10>                at scbus1 target 8 lun 0 (pass7)

Is there any way to move the ses probe back in the probe ordering, provide da hints to override, or otherwise make the useless ses0 device go away, and get my disk back?

BTW, I say ses0 is useless because though it responds to, it doesn't provide any information other than 'ses0: OK'.

ralphbsz · Dec 23, 2017

What is your SCSI hardware? Is this SAS or parallel?

I've heard stories of people using nasty tricks on parallel SCSI hardware, to allow more than 7 or 15 targets on the bus (6 or 14 when using dual-initiator setups for failover). Those tricks involve having two targets that share an address, or use the same address as the initiator. These tricks violate the SCSI standard, but as long as you live within a closed universe (of firmware, drivers, and targets), they happen to work.

Can you post your dmesg output of SCSI discovery?

And ses devices can be very useful, because you can use them to actually manage your disks (turn power off and on, control human-visible indicator lights, find out which disk is physically where, measure their supply current and temperature, and such things). Just because you don't happen to use or need that functionality (right now) doesn't mean other people don't enjoy it.

Alan Lundin · Dec 23, 2017

Thanks ralphbsz. I really appreciate your responding to this.

ralphbsz said:
What is your SCSI hardware? Is this SAS or parallel?

It's an LSI Fusion-MPT 2 controller (mps device) attached to 8 SATA drives.

I've heard stories of people using nasty tricks on parallel SCSI hardware, to allow more than 7 or 15 targets on the bus (6 or 14 when using dual-initiator setups for failover). Those tricks involve having two targets that share an address, or use the same address as the initiator. These tricks violate the SCSI standard, but as long as you live within a closed universe (of firmware, drivers, and targets), they happen to work.

I'm not sure anything tricky is needed as this hardware has worked fine with versions 9.0 through 11.0.

Can you post your dmesg output of SCSI discovery?

You bet:

Code:

...
Dec 21 06:48:43 geim kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0xfc00-0xfcff mem 0xef1f0000-0xef1fffff,0xef180000-0xef1bffff irq 24 at device 0.0 on pci1
Dec 21 06:48:43 geim kernel: mps0: Firmware: 02.15.63.00, Driver: 21.02.00.00-fbsd
Dec 21 06:48:43 geim kernel: mps0: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
Dec 21 06:48:43 geim kernel: pcib2: <ACPI PCI-PCI bridge> at device 4.0 on pci0
Dec 21 06:48:43 geim kernel: pci2: <ACPI PCI bus> on pcib2
Dec 21 06:48:43 geim kernel: mps1: <Avago Technologies (LSI) SAS2008> port 0xec00-0xecff mem 0xef3f0000-0xef3fffff,0xef380000-0xef3bffff irq 44 at device 0.0 on pci2
Dec 21 06:48:43 geim kernel: mps1: Firmware: 07.15.04.00, Driver: 21.02.00.00-fbsd
Dec 21 06:48:43 geim kernel: mps1: IOCCapabilities: 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>
...
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 2a023f2457733a58
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 2a023f2457733a58
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c4696e1ffc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c4696e1ffc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c4696e3dbc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c4696e3dbc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 3f47341e86957c75
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 3f47341e86957c75
Dec 21 06:48:43 geim kernel: ugen0.2: <vendor 0x0d3d USB CAT5> at usbus0
Dec 21 06:48:43 geim kernel: ukbd0 on uhub1
Dec 21 06:48:43 geim kernel: ukbd0: <EP1> on usbus0
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c469cc4fbc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c469cc4fbc5
Dec 21 06:48:43 geim kernel: kbd2 at ukbd0
Dec 21 06:48:43 geim kernel: ugen2.2: <vendor 0x0424 product 0x2514> at usbus2
Dec 21 06:48:43 geim kernel: uhub6 on uhub5
Dec 21 06:48:43 geim kernel: uhub6: <vendor 0x0424 product 0x2514, class 9/0, rev 2.00/0.00, addr 2> on usbus2
Dec 21 06:48:43 geim kernel: uhub6: MTT enabled
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c4698e8edc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c4698e8edc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c349fe7edc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c349fe7edc5
Dec 21 06:48:43 geim kernel: mps1: SAS Address for SATA device = 6a739c469cd2f1c5
Dec 21 06:48:43 geim kernel: mps1: SAS Address from SATA device = 6a739c469cd2f1c5
Dec 21 06:48:43 geim kernel: uhub6: 4 ports with 4 removable, self powered
Dec 21 06:48:43 geim kernel: ses0 at mps1 bus 0 scbus1 target 8 lun 0
Dec 21 06:48:43 geim kernel: ses0: <DP BACKPLANE 1.10> Fixed Enclosure Services SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: ses0: 150.000MB/s transfers
Dec 21 06:48:43 geim kernel: ses0: SCSI-3 ENC Device
Dec 21 06:48:43 geim kernel: sa0 at mps0 bus 0 scbus0 target 4 lun 0
Dec 21 06:48:43 geim kernel: cd0 at ata2 bus 0 scbus6 target 0 lun 0
Dec 21 06:48:43 geim kernel: sa0: <IBM ULTRIUM-HH4 E4J1> Removable Sequential Access SPC-4 SCSI device
Dec 21 06:48:43 geim kernel: cd0: <TSSTcorp DVD+-RW TS-L633J D150> Removable CD-ROM SCSI device
Dec 21 06:48:43 geim kernel: da3 at mps1 bus 0 scbus1 target 6 lun 0
Dec 21 06:48:43 geim kernel: da6 at mps1 bus 0 scbus1 target 10 lun 0
Dec 21 06:48:43 geim kernel: sa0: Serial Number 1068051751
Dec 21 06:48:43 geim kernel: da3: da6: cd0: Serial Number R8B06GMB508112
Dec 21 06:48:43 geim kernel: sa0: 600.000MB/s transfers<ATA ST2000NM0011 PA09> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: <ATA TOSHIBA MG03ACA2 FL1D> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: cd0: 150.000MB/s transfersda6: Serial Number Z1P64N6V
Dec 21 06:48:43 geim kernel: da3: Serial Number 35A1K5NXF
Dec 21 06:48:43 geim kernel: (da6: 300.000MB/s transfers
Dec 21 06:48:43 geim kernel: da6: Command Queueing enabled
Dec 21 06:48:43 geim kernel: da3: 300.000MB/s transfersSATA, da6: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: da3: Command Queueing enabled
Dec 21 06:48:43 geim kernel: UDMA5, da1 at mps1 bus 0 scbus1 target 4 lun 0
Dec 21 06:48:43 geim kernel: ATAPI 12bytes, PIO 8192bytes)
Dec 21 06:48:43 geim kernel: da1: da3: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: cd0: Attempt to query device size failed: NOT READY, Medium not present - tray closed
Dec 21 06:48:43 geim kernel: <ATA Hitachi HUA72202 A3HA> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: da0 at mps1 bus 0 scbus1 target 3 lun 0
Dec 21 06:48:43 geim kernel: da2 at mps1 bus 0 scbus1 target 5 lun 0
Dec 21 06:48:43 geim kernel: da1: Serial Number JK11D1BEHHGWGZ
Dec 21 06:48:43 geim kernel: da2: da0: da1: 300.000MB/s transfers<ATA Hitachi HUA72202 A3HA> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: da1: Command Queueing enabled
Dec 21 06:48:43 geim kernel: <ATA Hitachi HUA72202 A3HA> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: da2: Serial Number JK11D1BEHHK3UZ
Dec 21 06:48:43 geim kernel: da1: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: da0: Serial Number JK11D1BEH6NVGZ
Dec 21 06:48:43 geim kernel: da2: 300.000MB/s transfersda4 at mps1 bus 0 scbus1 target 7 lun 0
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: da0: 300.000MB/s transfersda4: da2: Command Queueing enabled
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: da0: Command Queueing enabled
Dec 21 06:48:43 geim kernel: <ATA Hitachi HUA72202 A3HA> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: da2: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: da0: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: da4: Serial Number JK11D1BEHHER5Z
Dec 21 06:48:43 geim kernel: da5 at mps1 bus 0 scbus1 target 9 lun 0
Dec 21 06:48:43 geim kernel: da4: 300.000MB/s transfers
Dec 21 06:48:43 geim kernel: da4: Command Queueing enabled
Dec 21 06:48:43 geim kernel: da5: <ATA Hitachi HUA72202 A3HA> Fixed Direct Access SPC-3 SCSI device
Dec 21 06:48:43 geim kernel: da5: Serial Number JK11D1BEHHKAKZ
Dec 21 06:48:43 geim kernel: da4: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: da5: 300.000MB/s transfersch0 at mps0 bus 0 scbus0 target 4 lun 1
Dec 21 06:48:43 geim kernel:
Dec 21 06:48:43 geim kernel: da5: Command Queueing enabled
Dec 21 06:48:43 geim kernel: ch0: <DELL PV-124T 0091> Removable Changer SCSI-2 device
Dec 21 06:48:43 geim kernel: da5: 1907729MB (3907029168 512 byte sectors)
Dec 21 06:48:43 geim kernel: ch0: Serial Number CJ2LR40546
Dec 21 06:48:43 geim kernel: ch0: 600.000MB/s transfers
Dec 21 06:48:43 geim kernel: ch0: Command Queueing enabled
Dec 21 06:48:43 geim kernel: ch0: 16 slots, 1 drive, 1 picker, 0 portals
Dec 21 06:48:43 geim kernel: ch0: quirks=0x2<NO_DVCID>
...

On versions prior to 11.1, the sixth disk was found at

mps1 bus 0 scbus1 target 8 lun 0

but ses0 takes that address now, and the disk has completely disappeared.

And ses devices can be very useful, because you can use them to actually manage your disks (turn power off and on, control human-visible indicator lights, find out which disk is physically where, measure their supply current and temperature, and such things). Just because you don't happen to use or need that functionality (right now) doesn't mean other people don't enjoy it.

Sorry about that. I didn't mean to imply the ses device was useless in general. It's just useless on my hardware as it doesn't see any sub-objects:

Code:

  # sesutil status
   ses0: OK
  # sesutil map
   ses0:
  # getencstat -v /dev/ses0
   /dev/ses0: Enclosure Status <OK>

So in this particular case, the only effect I can see is that it masks out one of my disks.

Thanks for any help.

ralphbsz · Dec 24, 2017

No idea. The mps driver initially sees all 8 disks: it reports, 8 times in a row, "SAS Address for/from SATA device", each time with a different SATA WWN. I have no idea who numbers the targets at the SCSI busses; that's probably the mps driver. And it is also not clear who loses the one disk: is the problem that it is assigned the wrong target number? Or is the problem that the da driver forgets to examine it?

My only suggestion: Open a bug report, and mention both mps and da in it. Maybe a developer recognizes the symptoms.

Oh, and a dirty hack: There is no way to totally disable the SES0 device? Since it is the backplane, you probably can't just unplug it (since the SAS cable probably goes physically to the backplane, and from there feeds both the SES device and the disks). Most likely, the backplane contains a SAS expander, and those often do double duty as a SES target. But see whether you can find any Dell documentation. Perhaps there is a way to go into its configuration (might be accessible via BIOS utilities or special interfaces), and just turn it off. That might help, and it doesn't seem to be useful for you.

An interesting question is: Why does Dell provide a SES target, when it doesn't seem to be usable (sesutil map doesn't see any elements)? No idea. Perhaps it is old enough to not be standard-conforming (the SES standard was created relatively late, and there is hardware out there that predates the standard). Plus the SES standard makes a college student's kitchen sink look clean and organized; it was a garbage dump, and every vendor threw everything into it, so incompatible implementations aren't uncommon. Perhaps Dell OEM'ed the SAS expander chip from some vendor, and was too cheap to customize it enough to turn off the default SES target that comes with the sample firmware from the vendor.

Alan Lundin · Dec 24, 2017

Thanks for looking at this ralphbsz.

I too was hoping I could rearrange the devices in the BIOS, so that was one of the first things I looked at before I came here, but sadly they don't have anything there that helps this problem.

I'm limited on how much I can fool around with a system used for production, so I think I'll take a quick look at cam/scsi/scsi_enc.c to see if I can kill the probe early and be done with it. I have a couple days I can do reboots before we are back in production mode. If I can't find a working source-code mod, I'll probably move back to 11.0 and hope the lack of ports support won't haunt us.

Thanks again. It was very kind of you to look at our problem and share your knowledge.

Terry_Kennedy · Dec 26, 2017

Alan Lundin said:
It's an LSI Fusion-MPT 2 controller (mps device) attached to 8 SATA drives.

Is it a Dell PERC controller or just a generic card you've installed in the system? The Dell cards have special knowledge of the Dell backplanes and vice versa.

Alan Lundin · Dec 26, 2017

It is a Dell PERC controller. I don't remember off the top of my head which, but I'd pretty sure it was one of the PERC H3x0 series -- whichever was available around 2011.

ralphbsz · Dec 26, 2017

If you have a different SAS card sitting around (name-brand not OEM LSI/Avago/Broadcom, or PMC-Sierra), try it. The problem *might* go away. But I don't think spending $300 to buy a new card is a good investment; the problem might also be some strange incompatibility between the 11.1 kernel and the Dell SAS backplane.

Obnoxious side remark: I should probably say that I hate working with SES. But then, the standard is so buggy and ambiguous, and implementations are so bizarre, it pays many people's salary (including a good fraction of mine for several years), so it is an excellent employment assistance program for people in the storage industry.

Alan Lundin · Dec 26, 2017

If you have a different SAS card sitting around (name-brand not OEM LSI/Avago/Broadcom, or PMC-Sierra), try it. The problem *might* go away.

I could see a different card could work, but unfortunately, I don't have an extra.

...the problem might also be some strange incompatibility between the 11.1 kernel and the Dell SAS backplane.

That seems like a good bet since it had no problem at all with 11.0 and before. I was able to completely remove the scsi_env* code for a kernel build, but something is still enabling the DP Backplane in 11.1, so I'll likely need to go back to 11.0. I've filed a bug report, so I'll wait a few days to see if anyone there has some insight or a quick patch.

...I should probably say that I hate working with SES. ...

I hear you. It seems like an area where things could easily get out-of-hand quickly.

SirDice · Dec 27, 2017

I noticed you have two similar cards:

Code:

Dec 21 06:48:43 geim kernel: mps0: <Avago Technologies (LSI) SAS2008> port 0xfc00-0xfcff mem 0xef1f0000-0xef1fffff,0xef180000-0xef1bffff irq 24 at device 0.0 on pci1
Dec 21 06:48:43 geim kernel: mps0: Firmware: 02.15.63.00, Driver: 21.02.00.00-fbsd
...
Dec 21 06:48:43 geim kernel: mps1: <Avago Technologies (LSI) SAS2008> port 0xec00-0xecff mem 0xef3f0000-0xef3fffff,0xef380000-0xef3bffff irq 44 at device 0.0 on pci2
Dec 21 06:48:43 geim kernel: mps1: Firmware: 07.15.04.00, Driver: 21.02.00.00-fbsd

Both cards appear to be the same but there's a huge difference in firmware versions. Have you looked for updated firmware? I would try to get them both at the same firmware version. It may not solve the problem though.

And, as there are two cards, is it possible one of the connections got swapped from one card to the other?

Alan Lundin · Dec 27, 2017

That's a good catch. I missed that and haven't looking into firmware availability.

But in this case, the card with the problem is the one with the newer firmware. If I updated the older one, might I not be risking creating yet another problem?

ralphbsz · Dec 28, 2017

And if you *down*-grade the newer one, you would be risking bricking it. Risk is what adds spicy to life. Personally, I'm going to have a turkey sandwich without mustard and a glass of milk, since my tolerance for risk is low.

Alan Lundin · Dec 28, 2017

And if you *down*-grade the newer one, ...

Yep, I'm with you.

Alan Lundin · Dec 28, 2017

This is sort of a ridiculous, 'Hail-Mary' question, but would anyone here happen to know or have an intuition about possible problems if I just rebooted to the 11.0 kernel while keeping the 11.1 'world?'

ralphbsz · Dec 29, 2017

Never tried something this crazy. It would probably work; during upgrades, mismatches like that do occur temporarily, and at least the upgrade tends to finish. But see the discussion about risk and spices above: You are rubbing your whole computer with tabasco sauce. Not suitable for the main course of the meal (like a production server), but makes for a nice appetizer.

Alan Lundin · Dec 29, 2017

It turns out I couldn't do it anyway. I thought I had saved the 11.0 kernel, but alas it was gone on both machines. A 'make buildkernel' on the the 11.0 tree from a 11.1 system failed, so I'm rebuilding world now.

We go back into production on Jan 2nd, so my plan now is to install and boot off the 11.0 kernel and perhaps let the 'missing' disk rebuild (it's in a ZFS RAIDZ2 pool) today. Then wait until the last minute to see if the developer's respond to the bug report before reverting completely back to 11.0.

Michael Usher · Jan 18, 2018

I encountered the same problem. Fortunately, I was building up a new system, so it wasn't in production yet.

I found a workaround in the mps() manual page. Add the following to /boot/loader.conf:

Code:

hw.mps.use_phy_num="-1"

I also tried to use device.hints as outlined in cam() to change the target number of ses0 or to set da0 to the overloaded target, but these attempts failed.

Note that this workaround is risky if you have any duplicate physical numbers in the SAS hierarchy. I have no expanders, so it is fine.

Alan Lundin · Jan 18, 2018

I wouldn't have made that connection. I'll give it a try as soon as I can reboot the machine (which is heavy use right now

).

Thanks for the response and info!!!

Alan Lundin · Jan 22, 2018

Michael Usher said:
... Add the following to /boot/loader.conf:

Code:

hw.mps.use_phy_num="-1"

...

Michael, you are officially my hero!!! I found the action slow this morning, so I took the opportunity to reboot with your loader suggestion, and HOO, HOO it worked

. Both machines are currently resilvering.

Many, many thanks to you Michael!

11.1 device ses0 masking da5

Administrator