Other maximum number of disks?

Hi,
simple question (which i cannot find answer, hint where to look is also appreciated):

how much /dev/da* (or /dev/pass*) devices is possible to have in FreeBSD (11+) ?

background: server is talking with Fibrechannel storage array, and for each defined "disk" on array, there are 8 disks on FreeBSD side (due to FC multipathing).
for example if there are 18 disks (LUNs) are presented from storage array, FreeBSD will see it as 162 disks. where is upper limit?
its obvious that Fibrechannel is limited to 255 devices (lun numbers) but that will be seen to FreeBSD as 2040 /dev/da* entries.

thx
 
cam unit number is uint32 so looks ok
i tried this and it worked

for i in $(jot 2000);do dd if=/dev/zero of=file_$i bs=16k count=1;mdconfig -t vnode file_$i;done

# gstat -f ^md -b|wc -l
2002
 
I keep rereading the OP and thinking "please for the sake of your sanity use labels to make whatever filesystems/zpools you use"
 
mer: While I don't use FreeBSD on SAN I remember very vividly hitting this kind of issue on few big accounts on HP-UX (back then 11.11 and 11.23, 11.31 made it way better).
Issue OP is asking about though is the same, especially when multipath is working on top of those devices. Labeling does not avoid possible issues he may run into (unable to zone in more disks as he hit the limit).
 
  • Like
Reactions: mer
I keep rereading the OP and thinking "please for the sake of your sanity use labels to make whatever filesystems/zpools you use"
Absolutely. When dealing with systems at scale, a few tools are needed for comfort and convenience. One: Use multipath at a very low level, so instead of having to deal with eight different /dev/daX devices that are all the same disk, there is only one /dev/multipath/A device. Second, if you are dealing with JBODs, use some good enclosure management software, so you don't have to identify the disk as /dev/daX or /dev/multipath/A, but as "middle rack, third enclosure from the bottom, disk in the 7th slot counting left to right". Third, write some software (scripts or part of a larger system) that keeps really good records. For example: Seagate model 1234 serial number ABC987 was last seen in the middle rack, third enclosure from the bottom, disk in the 7th slot counting left to right, on November 21 at 2:49 in the afternoon. It was identified as /dev/multipath/A, which consisted of individual paths /dev/daX and /dev/daY. Its partition label says it is called RalphBSz_data_4, and it is part of the ZFS pool home_data. Then anytime this information changes, update this (automatically), or leave log records so you can track the history. When things go wrong, these kinds of records are super useful, in particular if some part of the hardware makes it so you can't inquire what is where.

Quiz: Where would you store those records? On the disks themselves? If yes, how? <- Just kidding, don't answer that, it's a really complex question.
 
  • Like
Reactions: mer
I was curious to see if I can push it through the 4096 without looking at the source code (because sometimes it's just fun to try). The way I have my lab set though I was not able to use multipath here. I used Solaris as target and pushed 4274 disks to FreeBSD. I did rescan camcontrol rescan and .. ? kernel panic :). Perfect, I found a bug.

So after fresh reboot I see the following:
Code:
# camcontrol devlist |grep COMSTAR | wc -l
    4274
# camcontrol inquiry /dev/da4261
pass4262: <SUN COMSTAR 1.0> Fixed Direct Access SPC-3 SCSI device
pass4262: Serial Number 8287
pass4262: 150.000MB/s transfers, Command Queueing Enabled

It would be actually interesting to see how good you can leverage 8 paths to a single disk. We used to have 8 paths on EVA storages, 4 per controller, 2 ctrls per storage. Usually 2 paths per controller were optimized (storage end). Nowadays our setup tends to go 4 (2 in each fabric) or even 2 (but then FS is mirrored so other redundancy exists).
 
thx for all answers.
I'm planning to use gmultipath, so one multipath device for one virtual disk on SAN to keep things simple.
our storage array use 2 path per controller, which are 2 per site, and we have 2 sites which are in sync (so we have datacentar level redundancy) = 8 paths per initiator.
on top multipath devices there will be ZFS, as filesystem which should handle 80+TB in single filesystem. (any suggestion for other filesystem which is appropriate for this?)
 
2 paths per ctrl, 2ctrl per site make sense. So 4 paths per site for a disk. But why (and even how) would you access the same disk from other site (I'm curious, never seen that kind of implementation). I understand the disk sync between DCs (business-copy like feature) but those storages are separate entities being synced over FC to have 1:1 data.
ZFS is a good choice, I'd stick with that.

That's a question to solution architect, it does require the knowledge of the underlying setup. Generally though good place to start is: how many LUs to spread that 80TB over (is there a future plan to expand that FS). Will ZFS do the mirror (assuming not as you mentioned DC sync)?. For heavy loads you can go deeper. Depending on storage type and controller you may have an option to optimize port usage to balance the load (each LU uses dedicated pair of ports, spread the port usage over the LUs used in raidzX).
 
2 paths per ctrl, 2ctrl per site make sense. So 4 paths per site for a disk. But why (and even how) would you access the same disk from other site (I'm curious, never seen that kind of implementation). I understand the disk sync between DCs (business-copy like feature) but those storages are separate entities being synced over FC to have 1:1 data.
ZFS is a good choice, I'd stick with that.

That's a question to solution architect, it does require the knowledge of the underlying setup. Generally though good place to start is: how many LUs to spread that 80TB over (is there a future plan to expand that FS). Will ZFS do the mirror (assuming not as you mentioned DC sync)?. For heavy loads you can go deeper. Depending on storage type and controller you may have an option to optimize port usage to balance the load (each LU uses dedicated pair of ports, spread the port usage over the LUs used in raidzX).
storage system doing synchronous replication between sites. so there is guarantie that each byte writen on one side is also stored on other site (this is not DR replication - which is asynchronous). from initiator point-of-view all 8 paths are equal, as reading and writing is taken care of by storage system. that feature you can find in few storage systems (NetApp Metrocluster, EMC Metrocluster, Pure ActiveSync, i belive HP 3PAR was able to do something very similar with streched cluster configuration ... and probably others too).

primary idea of ZFS is to use as filesystem and volume management. ZFS features for redundancy/de-duplication/compression are actually already handled by storage, and are not needed. (ideally ZFS should use this disks as RAID0-equivalent)

yes, idea is to be able to scale beyond 80+TB in 2-3years period for single filesystem. workload on this system is relative "light" so no need to do premature optimizations about using paths.
 
I'm working with 3par/XP/P9500 but the way our clusters are set we never see the remote site under single device. That's why I was interested to see how system deals with the path from remote, be it synchronously replicated disk.
Just to compare: due to the /big/ cost reductions metro clusters are not used in our environment any more but rather "classic failover approach" is pushed which means failover = short downtime. But then many of our customers moved to HANA solutions where replication is handled and dealt with a bit diffrently.

So it seems you are on the way to have it set just fine, device limit on FreeBSD is high.
 
To begin with, ZFS can easily handle an 80 TB file system; with modern disks, that's just a handful of them.

storage system doing synchronous replication between sites. so there is guarantie that each byte writen on one side is also stored on other site (this is not DR replication - which is asynchronous).
You say workload is light and performance not that important, but synchronous replication is going to be slow. If you think about network speeds, you need to sort of add one ping time to every IO. Unless you are using expensive dedicated networks, that's going to add several ms to each IO request. Note that typical hard disk response times are 5-10 ms for random IO (typical disks can do 100-200 IOps), and much faster for sequential IO, so adding several ms to each request will change the way the system behaves. Unless you are using deep queueing, and have a workload (file system parameters) that can take advantage of that queueing, is might slow you down by a factor of several.

ZFS features for redundancy/de-duplication/compression are actually already handled by storage, and are not needed.
From a disk space point of view, this sounds reasonable. But consider that dedup / compression also reduce IO volume: Every time ZFS can decide that a block is a duplicate and doesn't need to be read or written, that's one IO less. In particular given that your IOs will be slower than common. But that's the kind of optimization that really needs to be thought through and well tested before deploying.

... so no need to do premature optimizations about using paths.
Just be careful you don't abuse the paths. For example: If you are running on side A, but using the storage controller that is physically on side B, then every byte written will go over the (slow) long distance link twice: First from the IO you are doing, and then back to synchronously update the copy on site A. So at least for writes, you can easily reduce the traffic on your link by a factor of two by going to the correct (local) path when available. And if the link is ever overloaded (even for a very short period, like sub-second), reducing traffic will help with improving latency, as there is less wait time there.

By the way, I'm not saying that this is a bad plane, I'm just worried about unintended (performance) consequences.
 
You say workload is light and performance not that important, but synchronous replication is going to be slow.
well, slow is relative :) there is no expected bottlenect on troughput, and latency is unavodable by laws of physics

If you think about network speeds, you need to sort of add one ping time to every IO. Unless you are using expensive dedicated networks, that's going to add several ms to each IO request. Note that typical hard disk response times are 5-10 ms for random IO (typical disks can do 100-200 IOps), and much faster for sequential IO, so adding several ms to each request will change the way the system behaves. Unless you are using deep queueing, and have a workload (file system parameters) that can take advantage of that queueing, is might slow you down by a factor of several.
there are fcping about 1ms range or lower. second factor is that disk write latency is not much factor, because when both storage arrays have IO stored on flash, they consider it "written"
(moving data from flash to magnetic platter can be taken later). so smart storage will be only limited by (flash) cache size and latency of interconnection.
and classic HD are on way out as most storage.

From a disk space point of view, this sounds reasonable. But consider that dedup / compression also reduce IO volume: Every time ZFS can decide that a block is a duplicate and doesn't need to be read or written, that's one IO less. In particular given that your IOs will be slower than common. But that's the kind of optimization that really needs to be thought through and well tested before deploying.
yes. but ZFS dedup require large amount of RAM to hold tables, so .. sorry no dedup on filesystem level.
also type of data is already compressed (JPEGs) so no ZFS compression wont gain anything.

Just be careful you don't abuse the paths. For example: If you are running on side A, but using the storage controller that is physically on side B, then every byte written will go over the (slow) long distance link twice: First from the IO you are doing, and then back to synchronously update the copy on site A. So at least for writes, you can easily reduce the traffic on your link by a factor of two by going to the correct (local) path when available. And if the link is ever overloaded (even for a very short period, like sub-second), reducing traffic will help with improving latency, as there is less wait time there.
i'm aware of that problem, hope ALUA can help with it. also i going with all active paths to spread load.

By the way, I'm not saying that this is a bad plane, I'm just worried about unintended (performance) consequences.
thx for warning. i was first worried about OS limits as we already have similar setup and is works ok.
 
Back
Top