Other Any know issues with 64 or more targets per SAS controller?

We have a couple of FreeBSD 12.1 servers using HP H241 SAS HBAs talking to disks in HP D6020 external SAS cabinets (70 10TB disks per cabinet). Now we've decided to double that space and got a second fully equipped D6020 (70x12TB) but I'm having problem getting FreeBSD to see all the drives at the same time. It should be around 140 in total (70 per controller) plus a few (6) connected to the internal server bays (on it's own controller) but only 48 are seen per controller. A "camcontrol devlist" shows targets up to 63 (it starts at target 16) but no higher.

Are there some configurable limits in place so it doesn't probe more than max 63 targets per controller? I've been trying to check manual pages but I've so far not found anything useful...

The H241 controller & ciss driver isn't the most stable combo btw. If it loses contact with the D6020 box we get an instant kernel reboot btw... :)
 
I know that the LSI SAS HBA (the 92... and 93... series) can handle over 300 dual-ported drives, or ~700 devices per HBA. So the limit is not fundamental to the SAS architecture. I have no idea about the ciss driver though, nor about the Compaq/HP HBAs.

I know Terry Kennedy (recently active in another thread here on the forum) runs FreeBSD machines with many many disks. Let's hope he sees this, or if not, you may be able to ping him.
 
I know that the LSI SAS HBA (the 92... and 93... series) can handle over 300 dual-ported drives, or ~700 devices per HBA. So the limit is not fundamental to the SAS architecture. I have no idea about the ciss driver though, nor about the Compaq/HP HBAs.

I know Terry Kennedy (recently active in another thread here on the forum) runs FreeBSD machines with many many disks. Let's hope he sees this, or if not, you may be able to ping him.

Yeah, in theory the HP H241 can handle many disks (according to the documentation) so it's probably something about the ciss driver.

I normally prefer the LSI SAS3008-based controllers but alas we didn't have that choice for the HP servers. However, I'm considering installing some third-party LSI controllers instead of the HP ones if I can't get this to work properly. We use Dell-branded (HBA330) LSI HBAs on our Dell servers, but there we don't have this silly amount of drives per server so not really a comparable situation.
 
Seems the CISS driver actually does see all 70 drives, it's just that some (16) of them gets lost on it's way up to the CAM layer. Ho hum, what fun...
Code:
# cciss_vol_status -V /dev/ciss0|egrep 1200|wc -l
      70

# camcontrol devlist |egrep 1200|wc -l
      48

# cciss_vol_status -V /dev/ciss0
Controller: Smart HBA H241
  Board ID: 0x21c8103c
  Logical drives: 0
  Running firmware: 5.04
  ROM firmware: 5.04
  Physical drives: 70
         connector 2E box 1 bay 1                 HP      MB012000JWDFD                                    5PGTHKAC     HPK
         connector 2E box 1 bay 2                 HP      MB012000JWDFD                                    5PG7HWAE     HPK
         connector 2E box 1 bay 3                 HP      MB012000JWDFD                                    5PGU3VVC     HPK
... 
        connector 2E box 2 bay 31                 HP      MB012000JWDFD                                    5PGMXMKE     HK
         connector 2E box 2 bay 32                 HP      MB012000JWDFD                                    5PGTSTWC     HK
         connector 2E box 2 bay 33                 HP      MB012000JWDFD                                    5PGN890E     HK
         connector 2E box 2 bay 34                 HP      MB012000JWDFD                                    5PGTU3EC     HK
         connector 2E box 2 bay 35                 HP      MB012000JWDFD                                    5PGU59MC     HK
/dev/ciss0: (Smart HBA H241) Enclosure D6020 (S/N: 7CE952P06X) on Bus 2, Physical Port 2E status: OK.
/dev/ciss0: (Smart HBA H241) Enclosure D6020 (S/N: 7CE952P06X) on Bus 3, Physical Port 2E status: OK.
/dev/ciss0(Smart HBA H241:0): Non-Volatile Cache status:
                   Cache configured: No

# camcontrol devlist 
<HP MB012000JWDFD HPD2>            at scbus1 target 16 lun 0 (da0,pass0)
<HP MB012000JWDFD HPD2>            at scbus1 target 17 lun 0 (da1,pass1)
<HP MB012000JWDFD HPD2>            at scbus1 target 18 lun 0 (da2,pass2)
<HP MB012000JWDFD HPD2>            at scbus1 target 19 lun 0 (da3,pass3)
...
<HP MB012000JWDFD HPD2>            at scbus1 target 62 lun 0 (da46,pass46)
<HP MB012000JWDFD HPD2>            at scbus1 target 63 lun 0 (da47,pass47)
<HP MB010000JWAYK HPD3>            at scbus4 target 49 lun 0 (da48,pass48)
<HP MB010000JWAYK HPD3>            at scbus4 target 50 lun 0 (da49,pass49)
<MK000240GWEZF HPG6>               at scbus6 target 0 lun 0 (ada0,pass50)
<MK000240GWEZF HPG6>               at scbus7 target 0 lun 0 (ada1,pass51)
<MK000960GWCFA HPG0>               at scbus8 target 0 lun 0 (ada2,pass52)
<MK000960GWCFA HPG0>               at scbus9 target 0 lun 0 (ada3,pass53)
<MK000480GWCEV HPG0>               at scbus10 target 0 lun 0 (ada4,pass54)
<MK000480GWCEV HPG0>               at scbus11 target 0 lun 0 (ada5,pass55)
 
Last edited by a moderator:
If you know that the ciss driver does the right thing, but the problem occurs between it and the CAM layer, then you have isolated it to a part of FreeBSD (since both of those pieces are FreeBSD software, not outside). Time to open a PR, get on the mailing list, or contact developers.
 
If you know that the ciss driver does the right thing, but the problem occurs between it and the CAM layer, then you have isolated it to a part of FreeBSD (since both of those pieces are FreeBSD software, not outside). Time to open a PR, get on the mailing list, or contact developers.

Yeah, already did that :)


(Trying to pinpoint it in the kernel now with some additional printf-debugging :)
 
Btw, the source code for cciss_vol_status contains some really silly bugs.. I can understand why it's marked with "contains vulnerabilities". Like using "&&" instead of "&" for bitwise-and in a number of places and accessing outside an array in another place. Easily fixed btw..

Looks like nobody has updated that one for a number of years now.
 
Alright. I think I might have found out what's happening inside the ciss driver now.

It probably works reasonably fine for controllers used in "RAID" mode. But when used in "HBA" mode (which we do since we use ZFS) and only serving physical disks it incorrectly uses a "max number of supported logical drives" value it gets from the controller (64 for the HP H241 in our case) as an upper limit on the "target" number presented to CAM. And when probing physical drives it offsets the target number by 16 (probably because the controller says we have 0 logical drives - which the driver incorrectly overrides with a compile time value of 16 since some old controllers set that value to 0), so... 64-16 = 48 drives. (And it also fails to detect the two SES devices that also is there).

Time for some kernel driver code hacking to force this code to behave better.
 
Ok, after some kernel-driver-debugging it turns out the "ciss" driver have (atleast) two bugs when using HP HBAs (atleast when used in JBOD mode - which one really want to use - and probably also when used in RAID mode if you leave most of the drives outside "logical volumes" and have a many drives attached):

1. It sets a "max_target" value to "max_logical_volumes" (which for a HP H241 is 64). "max_target" is used by FreeBSD to limit the number of devices to probe. And since the H241/D6020-combo seems to start enumerating physical drives at 16 - it won't probe any drives after drive 63. So .. 64-16 = 48. Fixed that one by setting max_target to max(highest target found+1,max_logical_volumes).

2. It sets "initiator_id" also to "max_logical_volumes". So after one has fixed #1 it will silently skip drives with target id = max_logical_volumes. Fixed that one by setting it to max_target+1 instead.

With those two fixes in use it finds all drives (and the two SES devices that are found at target 119 & 121 which previously went undetected...)
 
Back
Top