ZFS Poor performance on unrelated single drives on the same HBA as zpool during resilvering

These backplanes are either SAS2-846EL1 or SAS-846EL2; the EL2 has dual connectors but I don't believe they are used for performance, only for redundancy in case of failure. Either way I know for a fact that I only have a single cable to each backplane as I have two HBAs which have two ports each, and there are four backplanes (two in the chassis, two in the disk shelf). Could it be the same issue?
 
These backplanes are either SAS2-846EL1 or SAS-846EL2; the EL2 has dual connectors but I don't believe they are used for performance, only for redundancy in case of failure. Either way I know for a fact that I only have a single cable to each backplane as I have two HBAs which have two ports each, and there are four backplanes (two in the chassis, two in the disk shelf). Could it be the same issue?

I think I need a diagram, but it sure sounds like it could.
 
I worked intensely with the LSI SAS HBAs, about 10 years ago. They are wonderful, but also very annoying. They can have both bandwidth and IOps limitations. The bandwidth ones are pretty obvious (they are usually capable of maxing out both SAS and PCI buses if using large IOs and deep queues). We were using them to connect to several hundred disks (with two servers, a few HBAs each), and we were able to max out the servers at about 10-12 GByte/s (including running RAID and checksum codes). That was using Linux, on both Intel and PowerPC platforms.

But: Where the cards can easily fall apart is IOps, which is much harder to understand, tune, and debug. There is an enormous amount of driver work that needs to be done to get great performance out, and when I say "driver", I mean not just the OS driver for the card itself, but the block subsystem above (which can hold long queues of IOs), memory management (after all, every pending IO in the queue has a memory buffer that is pinned for the duration, and latencies can get high when you have deep queues), and the firmware in the card. And without deep queueing (I used to aim for 5-10 IOs pending on each drive at all times, and 20 or 50 is better), you won't get good performance in random IOs. One thing we discovered the hard way is the following: The firmware in the HBA and the OS driver stack have a lot of error handling and recovery built in. If there are incompatibilities between disk and HBA, those might become low-level IO errors, which occur very frequently. Every time an error happens, the HBA wipes its queue and aborts many other IOs (or lets them fail), and then some layer way above automatically retries them. Net result: No error actually make it up to the application layer (because retries cure all ills), but IOps throughput is really bad, because in effect IO is being single-tracked.

Our fix was to work closely with engineers from LSI, the disk manufacturers, and the SAS expander vendors. That included running special diagnostic firmware versions, collecting undocumented statistics, and occasionally having SAS analyzers on the bus. Took months.

Why am I telling this story? I would not be surprised if the IO pattern that ZFS resilvering does can in some cases cause the HBAs to do things that are performance-killing, through making queuing work less well. I wonder whether it would be possible to instrument ZFS on FreeBSD with individual IO performance metrics (traces, or averages), and see how many IOs are queued on each drive, and what the IO latency is (depending on IO size, distance to previous IO finished, and queue depth). Doing this work would be days or weeks of work.

There used to be a guy named Terry Kennedy who did a lot of large IO work on FreeBSD (disk and tape); I know he had a long history of doing the same thing on VAXes under VMS earlier. He might be able to shine some light on this, but I haven't heard from him in years.
 
That's really helpful information, thanks. Do you think it would be worthwhile to file a bug? I can reproduce this pretty much at will.

Also, I have gone ahead and edited the thread title to be more descriptive of what the actual issue is. It doesn't look like the GELI layer has much to do with this issue; it seems to be purely a ZFS/HBA issue.
 
Is this HBA connected on PCIe2.0 2x4?

Wide port (four lanes), 2400MB/s - Half duplex
So this is your max speed in one direction

Wide port (four lanes), 4800MB/s - Full duplex
This is your limit of the SAS 9211-8i
 
I have looked at the manual for my motherboard, it is the X9DR3-F (SuperMicro motherboard). The manual covers both that and the X9DRi-F, and the only difference between those two models seems to be whether it has 4 (X9DRi-F) or 8 (X9DR3-F) onboard Storage Control Units. I'm not using the onboard units (because I was never able to flash them to something that worked well), so they are probably not relevant.

All slots on the motherboard are PCI-E 3.0 x8 or x16. The external HBA (LSI 9207-8e) is plugged in to "CPU1 Slot2 PCI-E 3.0 x 16" and the internal HBA (LSI 9211-8i) is plugged in to "CPU2 Slot4 PCI-E 3.0 x 16". The 9211-8i is a SAS2, PCI-E 2.0 card (so the slot being PCI-E 3.0 doesn't really matter), but the 9207-8e is a SAS2, PCI-E 3.0 card. However, performance was largely similar between the main chassis (connected to the 9211-8i) and the disk shelf (connected to the 9207-8i), so it seems to me that it is unlikely that PCI-E 2.0 vs PCI-E 3.0 is the determining factor here. It's possible that SAS2 vs SAS3 is an issue, but I have no way of verifying that.
 
Then we can rule out the PCIe as they are connected on x8 (in x16 slot) because the HBA is x8

I check the X9DR3-F manual and it's a bit strange because it's using Intel C606/C602 and those are PCIe2.0 so how this MB support PCIe3.0 i don't know.
Ok they are connected to the CPU, not to the C606.

You can test the HBA limits with some SSD disks in Raid0 to see what is your max IOPs using FIO and what is your max BW.
 
Careful: The PCIe and SAS lane count restricts mostly large sequential throughput (MByte/s), not number of IOs (IOps), which matter more for small and random IOs.
 
Careful: The PCIe and SAS lane count restricts mostly large sequential throughput (MByte/s), not number of IOs (IOps), which matter more for small and random IOs.

Right, I think we've pretty much established it's not a raw throughput issue, because it doesn't matter how many unrelated drives there are. In the disk shelf I had 9 5400 RPM drives all writing at full speed (180-220 MB/s depending on the drive); whereas in the chassis I only had 3 7200 RPM drives (but writing at 270 MB/s). As soon as the resilver kicked off on the same HBA, the throughput of all unrelated drives tanked down to 40-70 MB/s. If it was a throughput limitation, I would have expected to see roughly the same total throughput for the unrelated drives, since it was the same zpool doing the same resilvering each time. But the 3 chassis drives manages about 205MB/s between them while the 9 drives in the disk shelf managed about 450MB/s under the same condition.

Another thing that points against it being a throughput issue is that every drive slows down to a fraction of its own maximum throughput: 270MB/s drives down to ~80MB/s, 220MB drives down to ~70MB/s, and slower drives down to anywhere between 40-60MB/s. If it was a throughput limitation, I would think we'd see roughly the same total aggregate throughput amongst all unrelated drives (we've either got 450MB/s of excess bandwidth, or we don't). Three times as many drives managed two times as much throughput under the same circumstance. That points to something else (likely IOPS) as the bottleneck.


Interesting (and potentially useful) benchmark, though I'm not sure how much to extrapolate this data to running 36-66 spinning disks. Can you elaborate?

Anyway, I am still considering replacing my 9211-8i with a 9207-8i, but that's not because I expect that it will fix this particular issue. Instead I'm considering it because I see roughly a 10% performance delta under heavy load with the same zpool in the chassis vs in the disk shelf (with the disk shelf, connected to the 9207, being faster). I'm suspect that I have just enough fast 7200RPM disks that I'm in an edge case where the conventional wisdom that "SAS2308 is not necessary for hard drives, only for SSDs" isn't 100% correct. 45 disks on one HBA isn't all that common, I'd expect. If I do go that route, I might actually see if there's a performance benefit to one HBA per SAS expander. I have enough PCI-3.0 slots (each CPU has 3 slots, all of which are x8 or x16), I just lack a second LSI with external ports. That would split things up to no more than 24 drives on a single HBA.
 
Interesting (and potentially useful) benchmark, though I'm not sure how much to extrapolate this data to running 36-66 spinning disks. Can you elaborate?
We used to run a system with two (external) HBAs, 4 SAS ports each, connected to about 350 disk drives total (nearly all spinning, a handful of them SSDs), meaning about 40 disks per SAS port (not per SAS card). These were all LSI cards (don't remember the model number), using PCIe Gen3 x8 slots. I think at the time we were still using 6 Gbit SAS (may have been 12 Gbit by the time we shipped to paying customers). The disks were 7200 rpm, in data-center grade enclosures with good expander architectures (let's not discuss which expander chips to use and which to avoid, that's too personal and feelings were hurt).

BUT: Our IO pattern was optimized for throughput, because that's what most of our customers were interested in. So we usually did 1-2 Mbyte size IOs (meaning multiple tracks per contiguous IO), and with roughly 10 IOs queued per drive (helping to minimize seek times by using the elevator in the drive). Our smallest read IOs were typically around 32K, except for tiny writes to the SSDs for logging, but the IO pattern for benchmarks was dominated by large IOs.

Our performance limiter for the standard tests was always the PCI or SAS throughput; we tuned until we were near the hardware limitations. Much of the tuning was making sure that many/most queued IOs actually reach the disk drive, because that's the way to get disk performance.
 
We used to run a system with two (external) HBAs, 4 SAS ports each, connected to about 350 disk drives total (nearly all spinning, a handful of them SSDs), meaning about 40 disks per SAS port (not per SAS card). These were all LSI cards (don't remember the model number), using PCIe Gen3 x8 slots. I think at the time we were still using 6 Gbit SAS (may have been 12 Gbit by the time we shipped to paying customers). The disks were 7200 rpm, in data-center grade enclosures with good expander architectures (let's not discuss which expander chips to use and which to avoid, that's too personal and feelings were hurt).

Hmm, sounds like two backplanes (with no more than 24 drives per backplane), per SAS port (so two cards, four ports) is probably about as good as it'll get then, and no point in trying 1 HBA per backplane.

As far as backplanes, my setup is using the BPN-SAS2-826/846/847 EL1 backplanes. Not sure if those qualify as good or bad expander chips, but that's what I've got, and I'm stuck with them. Converting to SAS3 is massively outside my budget and would require replacing both HBAs, the cables, the backplanes in the server chassis, and the entire disk shelf.
 
Judging by your symptoms (one workload kills overall performance), it's hard to guess where the problem is. Diagnosing and tuning this will require looking at IOs, or having seen similar situations.
 
No update on the overall terrible performance to other drives during resilver to share, but in case anyone is curious about the 9207 vs 9211? I did end up ordering a 9207 because I needed another card for another project (stole the 9211 for that one) and it did in fact make up that 10% performance difference, with performance in the disk shelf (via 9207-8e) and performance in the main chassis (via 9207-8i) now essentially identical. There was no need to split the drives up among additional adapters.

So if anyone in the future is ever curious, at least in my case, with 36 7200RPM EXOS drives on one adapter is apparently past the point where PCI-E 3.0 makes a difference. It's not a large one, but it is measurable.
 
Back
Top