Other GEOM RAID: Slow with gstat reporting high %busy on RAID device under CPU load

GrandAdmiralThrawn · Oct 24, 2021

Hello,

I have another question about the GEOM RAID class. I'm currently using it to join four PCIe 4.0 NVMe SSDs together in a RAID-0 for temporary data, no parity calculations involved. Load type on the device is both linear as well as random I/O, pretty somewhat equally distributed between reads and writes:

Read/Write distribution on the array

Layout: 4 physical disks <-> GEOM RAID <-> GPT partition <-> UFS.

The RAID has been created like this (command is off the top of my head, but should be correct according to $ diskinfo /dev/raid/r0): # graid label -s 65536 DDF BEASTRAID RAID0 /dev/nvd1 /dev/nvd2 /dev/nvd3 /dev/nvd4
The partition has been created like this: # gpart add -t freebsd-ufs -a 65536 -b 64 -l BEASTRAID /dev/raid/r0
Finally, the file system has been created like this: # newfs -E -L BEASTRAID -b 16384 -d 65536 -f 16384 -g 1073741824 -h 32 -t -U /dev/raid/r0p1, tuned towards my workload and file/folder structures to the best of my knowledge.

What I'm seeing is that the array performs badly as soon as the CPU is fully loaded, and gstat reporting very high %busy numbers on the RAID device, but not on the underlying disks. See here:

gstat showing unusually high %busy values on the GEOM RAID device and its partition

So as mentioned, raid/r0 is the device, raid/r0p1 is the UFS-formatted GPT partition on it, and nvd1, nvd2, nvd3 & nvd4 are the GEOM RAID component disks. Both the partition/filesystem as well as the raw graid device have those high %busy states when doing some I/O on the array. In the case where this screenshot was taken, said load was one thread writing a file to the machine via Samba/CIFS over just GBit ethernet. Transfer rates were as low as 20-30MiB/s!

All Samba server processes where running at real-time priority level 30 to ensure it wouldn't slow down the transfer. There are far fewer Samba processes than cores/threads on the machine, so raising them to real-time priority is not starving anything to death. It's done like this as superuser:

Code:

ps ax | grep 'smbd' | grep -v 'grep' | sed 's/^ *//g' | cut -d' ' -f1 | while read -r pid; do
  rtprio 30 -"${pid}"
done

ps ax | grep 'nmbd' | grep -v 'grep' | sed 's/^ *//g' | cut -d' ' -f1 | while read -r pid; do
  rtprio 30 -"${pid}"
done

With no CPU load present, linear transfer rates can exceed 4GiB/s (locally of course), with network transfers in the 70-80MiB/s range, but no dice under heavy CPU load, where it's much slower, with IOPS dropping by a factor of about 10 and linear transfers by a factor of around 8 according to the simple diskinfo benchmarks.

The UFS partition /dev/nvd0p2 on the single /dev/nvd0 SSD is entirely unaffected by this, so it's only happening with GEOM RAID involved. The single /dev/nvd0 disk is a Corsair Force mp600 2TB, whereas the GEOM RAID component disks are of the same make, just one notch smaller: 4 × Corsair Force mp600 1TB.

Is there a way that I can ensure I/O stays fast and snappy on graid level 0 devices even under very high CPU loads? If yes, how can I do it?

Thank you very much!

ralphbsz · Oct 24, 2021

GrandAdmiralThrawn said:
What I'm seeing is ... gstat reporting very high %busy numbers on the RAID device, but not on the underlying disks.

Attempt at an explanation: First, consider a workload where IOs coming into the RAID device never have to be split across multiple physical disks, and the graid layer is implemented with infinite parallelism (it can handle as many IOs in parallel as the physical devices can, and we only block to wait for IOs on the physical disk). In that case, the RAID device can not be more busy than the physical devices.

But in your case, it is. So lets consider a workload where the IOs coming into the RAID device are very big, so much so that each IO needs to use all four physical devices. In that case, RAID IOs will be much slower than physical IOs, because each RAID IO has to wait for the last of the four devices to respond, and the physical devices are executing their IOs out of order. So one possible theory is: Your RAID stripe size is too small, compared to the IO size coming into the RAID layer. Can you measure the size of those IOs?

The other assumption we need to check is the parallelism. Let's go back to the original workload with small IOs to think about that. Say for example each of the NVMe disks is capable of handling 64 IOs at the same time. But the graid layer can only handle 64 IOs too. So the first 64 IOs start, and get immediately sent to the four physical disks; the average number of IOs on physical disks is now 16, and they should not be blocking. The next 64 IOs come in (before any of the first IOs have finished). Now the graid layer is guaranteed to block, and the physical disks are (by construction!) guaranteed to not block, even if the distribution of IOs across them is not uniform. So: How many IOs in parallel can the physical disks handle (that should be possible to look up), how many can the graid layer handle (no idea how to see that or adjust it, but maybe someone knows), and how many does your workload send (that's really hard to measure)? The only tangible bit of advice here is: See whether the maximum number of IOs in the graid layer can be adjusted.

GrandAdmiralThrawn · Oct 24, 2021

Thank you for you reply! I would like to emphasize one thing though: This only turns bad when CPU load is high! Otherwise, everything is great and very fast for all I/O workload types, even the more problematic ones with variable block sizes (somewhat like a few bytes up to ~512kiB or so per block) for consecutive writes. The component disks have pretty good specs, too: 680.000 read IOPS and 600.000 write IOPS each (per second), with theoretical maximum throughput just shy of 5GiB/s per disk. Even a single one can easily handle everything I can throw at them blazingly fast.

Types of I/O vary, but I won't go into much detail, because from large block size linear I/O over variable block size random I/O all the way to even BIO_DELETE commands sent to it on file deletions, everything behaves the same.

Low or no CPU load: The RAID's crazy fast and %busy values are reasonable in gstat.

Full CPU load: The RAID's crazy slow and those high %busy values show up in gstat.

I'm (likely naively) thinking that userland CPU load just gets in the way of GEOM RAID doing it's work splitting data across the four component disks on writes, or reassembling it on reads. On that train of thought I'm assuming that the %busy value skyrockets simply because graid has to wait for CPU cycles too much, or maybe suffers from context switching. But then again, that RAID layer lives in the FreeBSD kernel, so it should always have much higher priority than any of my programs doing computations on the CPUs?

Although I have to admit: I don't really know anything about priorities on the FreeBSD kernel level and how this interacts with userland processes.

I have just looked for tuning knobs by running $ sysctl kern.geom.raid, but couldn't really find anything.

Jose · Oct 24, 2021

Have you found this yet?

SystemTuning - FreeBSD Wiki

ralphbsz · Oct 24, 2021

GrandAdmiralThrawn said:
This only turns bad when CPU load is high!

That's ugly. It does indeed point to a CPU scheduling issue. Not my area of expertise.

richardtoohey2 · Oct 24, 2021

What's causing the high CPU load? Nothing to do with interrupts? Don't think so, just asking. And sure you'd notice and have reported it, but worth checking the CPU load is expected.

My experiences with FreeBSD and NVMe SSDs have been positive so far, but not tried anything like you are doing (yet) so interested.

GrandAdmiralThrawn · Oct 24, 2021

Interrupts? I don't think so either. It's mostly video transcoding jobs I'm launching myself, that's the bulk of it. Those jobs do very little I/O and don't need much memory bandwidth either, just CPU. The resulting load is pretty much what I'd expect, it's the same on Linux and Windows boxes I've used before as well. There can be several hundred threads wanting something from the CPUs though. So, I have 64 logical CPUs, but at times, there might be 500-800 threads fighting for CPU (e.g. on MS Windows, this would make the UI stutter). Maybe this affects GEOM RAID somehow? As said, the single NVMe disk with just UFS on it behaves fine even when it's like that.

Jose, I will give that a good read tomorrow (it's really late around here by now)!

Thank you all!

Phishfry · Oct 26, 2021

Well you are using the GEOM graid module for NVMe drives (with DDF) while it was intended for motherboard softraid SATA drives.
So right there you are using the module for something it is not intended for.
But if you must persist with this module (versus the correct GEOM module -gstripe-) then consider experimenting with pinning the process to certain CPU's. See if this helps.
Is your machine using NUMA domain or single CPU? It could be the QPI bus is becoming saturated at fuller loads.

cpuset(1)

www.freebsd.org

GrandAdmiralThrawn · Oct 26, 2021

I've tried pinning it to a CPU: # cpuset -l 0 -p 80. Here, 80 is the PID of the "g_raid" process. Then I've retested it, but the behaviour did unfortunately not change at all. Out of curiosity I did this as well: # rtprio 31 -80. This had an effect actually. /dev/raid/r0p1's %busy would still show the same high load, but /dev/raid/r0's values are now lower by 30-40%. So it's less %busy when running at real-time priority level 31.

Besides, you mention gstripe. But gstripe is slower than graid level 0 according to my tests, and does not pass BIO_DELETE / ATA TRIM commands through to the disks. I've tested this, observing with $ gstat -d. Hence I would say that gstripe is not advisable for SSDs in its current form, at least not for mixed or write-heavy I/O, as write performance will degrade over time due to the SSDs having to do read-modify-write cycles on writes as soon as all blocks have been written to at least once. This is why I dropped gstripe and went for graid.

Also I don't think that using graid with DDF as a pure software RAID is wrong? DDF is the format of certain hardware RAID controllers like made by Adaptec and ICP, so I was of the impression that graid's support for DDF was meant for ex-/importing arrays to and from such hardware RAID controllers. Like when your Adaptec controller dies and you want to keep using your array in pure software mode.

For BIOS SoftRAIDs I assume that the other formats are simply for correctly interpreting the meta data the BIOS wrote to the disks when creating the RAID array in BIOS. Otherwise it just runs as a software RAID with zero help from any spezialized hardware, so how does it make a difference as to whether you run it in a machine with or without such a BIOS? I would think it would make none, performance-wise. But it's great stuff, portability-wise. Like, you have to switch mainboards? Just take the array with you, as FreeBSD would not need to care about what the BIOS can or can not do...

Please correct me, if I'm wrong!

Edit: Sorry, I forgot to mention: It's a single sTRX40 socket with an AMD Threadripper 3970X on it. So no NUMA.