UFS Question about software RAID-0 with SSDs

I would like to create a striped RAID-0 made up of four NVMe SSDs. It's for unimportant temporary data, but I need the space and lots of speed would be nice to have as well, so I'd like to use RAID-0 rather than something like JBOD/CONCAT. I would not like to use ZFS, but UFS for a file system. Disk redundancy and verifiable data integrity are not required.

There are two GEOM classes which seem to be able to handle this: gstripe(8) and graid(8). Both can do RAID level 0, with graid supporting higher RAID levels and several vendor-compatible metadata formats as well. I'm just not sure which one I should pick now.

Are there any advantages/disadvantages that one has over the other when used for RAID-0? I'm not planning to make the array compatible with any hardware controllers or their firmwares, so I don't really care about the meta data format. The array does not have to be portable to other operating systems either, so it's fine if it's FreeBSD-specific.

Since this will be built with SSDs however, TRIM / BIO_DELETE support is required. According to what I found on the web, this works with graid. Does gstripe also support TRIM / BIO_DELETE?

The OS is FreeBSD 12.1-RELEASE at the time of writing and will be updated to 12.2-RELEASE in a few weeks or so.

Thank you!
 
I tried a 4 drive gstripe array with Samsung PM983 NVMe drives and I was very disappointed with the speed.
I tried several different stripe sizes but it did not help.
Feel free to try it yourself. Maybe there was some magic setting I was missing.
I don't remember what my exact speed was but it was not much more than a single drive.
I was hoping for 8gigabytes/sec as 4 drives could deliver.
I also tried graid3 with poor results. Even in a 5 drive array.
Single drive speeds seem most efficient.
Even with a LSI SAS9400 tri-mode controller with hardware raid0 it did not give me great NVMe results.
 
I use gmirror with UFS on top. It works well and gives about double the speed for sequential loads. Here are some rough benchmarks with 2x consumer nvme drives for the various configurations I tried.

Also, make sure your drives aren't overheating as they will throttle-down. If you don't already have heatsinks on them and you're seeing poor single drive peformance then that's the first thing I'd check.


a) 2x nvme drives in ZFS mirror
- this gave the performance of a single drive at about 1GB/s
- my testing always suggests that ZFS mirrors don't increase performance (unless across spanned mirrors)

b) 2x nvme drives in gmirror with ZFS (non-mirrored) on top
- this was a bit more reasonable at around 2GB/s
- my CPU is sluggish, so I think it's just the extra overhead slowing it down

c) 2x nvme drives in gmirror using UFS instead
- this was by far the best performing with 2.6GB/s+ sequential scanning
- I see real world Postgresql table/index scans of above 800Mb/s (DB is the bottleneck here)


It's difficult to design an application that can efficiently scan through data that quickly so you have to think about what you actually need to acheive.
 
I'm monitoring my current, single SSD with Zabbix using an "external check" item based on a script I wrote that checks the temperatures using smartctl(8). I'm planning to do the same thing for the new SSDs, which will arrive tomorrow and will be installed in a few weeks, when the machine will have completed its current set of larger compute jobs. The SSDs will be installed in a riser card provided by ASRock. The card's metal shroud connects to the SSDs, acting as a heatsink, and there's also a fan installed. I'll be keeping an eye on those temperatures. This is what the graph of my system SSD currently looks like:​

ssd-temp.png

My expectation is that the RAID SSDs will run quite a bit hotter. I'll see when I get there.

What I will be doing on the array is A/V interleaved de-/remultiplexing, so I will demultiplex large M2TS and MKV container files (up to 100GiB) containing data streams (audio, video, and some others), and remultiplex them as well. I'm not quite sure about how large the interleave block sizes really are, but the most likely will be the size of an individual video frame.

If true, that will be somewhere in the 32kiB - 512kiB range according to some simple math. That's an estimate, could also be a bit lower or higher. So the way I understand it, say a video frame is 50kiB in size, then the software will interleave 50kiB of video data, and all the audio data that should be played for the time the video frame is being displayed, for a total that's... maybe 55kiB or 60kiB or something. And then the next video frame will be interleaved with its corresponding audio and other stuff.

I'm leaning towards a stripe block size of somewhere in between 64kiB - 128kiB with a relatively large buffer in RAM. Because hey, I have lots of RAM: 256GiB.

There is also some dependency on the CPU, as the CPU also has to do.. "something" when doing that, so I might even run into a CPU bottleneck in the end. We'll see.

I'd like both writes and reads to be fast. By definition, gmirror should only boost read performance, right? Given that it's actually for mirroring?

Given that gstripe comes with a tuneable buffer size for its "fast" mode, I'll try that first. I'll use dd(1) for benchmarking with a few block sizes once everything is set up.

Thanks for your results, Jerome!

Edit: I wrote some stupid things above. They're now significantly less stupid. ;)
 
Alright, it's been a while, but I can now answer my own question: The GEOM stripe class does not support BIO_DELETE/TRIM, as I've verified using UFS with the TRIM flag enabled, sitting on top of the array. The GEOM raid class however does support this, and is about 30-40% faster for me as well, when using the same stripe block size of 16kiB (tuned towards my workload/data after some reevaluation). This was tested today on FreeBSD 12.2-RELEASE.

I have verified this by deleting very large files while monitoring the arrays and component disks using $ gstat -d.

Performance was below what a single disk can do for gstripe (~2.5GiB/s @ unbuffered 1MiB blocked read) and slightly above that for graid (~4GiB/s @ unbuffered 1MiB blocked read). I can't say it scales well, but maybe that's also because of my rather small stripe block size.

Overall, it get's the job done. I have my 4-disk NVMe RAID-0 running with BIO_DELETE/TRIM support.
 
I didn't run any sophisticated tests. If you would like some to be done that I can and am willing to do (no array recreation or reformatting at this point), just let me know what to do and how to do it.

I only did sequential, single-process, single-threaded tests using 100GiB files and the file system directly as input & output. The result was a gain of 30-40% over a single disk if you assume the faster value as 100%. If you assume the slower value as 100%, then it's about 50-60% gain (I always forget which one I'm supposed to assume as 100%...). For reading at 1MiB block size, it reached just around 4GiB/s. Writes were just slightly slower at ~3.9GiB/s. Single-disk speed at the same settings and UFS block size sits around 2.5-2.6GiB/s. Variations are roughly ±100MiB/s or so.

So with four PCIe 4.0 Corsair mp600 1TB SSDs, it's not even twice as fast as a single one. And that's at zero CPU load and zero swapping to disk.

The commands used for reading and writing look like this (real mountpoint replaced as it contains my user name):

# time dd bs=1M count=102400 if=/mnt/raid0 of=/dev/null
$ time dd bs=1M if=/dev/zero of=/mnt/raid0/testfile.bin


After that I would divide 120400 by the seconds shown by timing the runs to get to MiB/s, then divide by 1024 to get GiB/s.

Note: All tests were run on top of a UFS2 file system formatted with default block size, not on or from the raw raid device!
 
Phishfry, according to the eMail in my inbox you replied here a few hours ago, but I can't see it anywhere. Or did you remove your reply (the one about GEOM RAID and mirror)? Just asking.

Anyway, just recently I noticed that I made a stupid beginner's mistake during the creation of the RAID-0 array, which makes my results questionable. Instead of partitioning the RAID device and properly aligning everything to the SSD block and RAID strip sizes, I just formatted it directly. I assume that what I'm running now is quite likely a mis-aligned file system, which should impact performance. Maybe it doesn't affect linear I/O performance as much as random I/O, but still.

At some point, when the array is not in use I'll have to re-do it and re-run the benchmarks.
 
No I was confused about graid and deleted the post.
I thought graid was only for SATA and motherboard SoftRaid.
I had no idea it could be used for NVMe. That is the first I have heard of it.
 
Way I see it is this (somebody correct me if I'm wrong): GRAID supports several formats of meta data for RAID. I'm using the default one called DDF which I think is also being used by some Adaptec RAID controllers. The full list is in graid(8).

But GRAID implements those hardware/firmware RAID meta data formats in software, independent of the hardware you're using. I guess (though I've never tried it), you can probably import your arrays from a true hardware RAID to a GRAID software one. Or maybe even create them with GRAID and then export them to, say an Adaptec or Intel or Silicon Image controller. As long as it's the right drives in terms of interface. Like SATA or SAS.

But since GRAID does it all in software you can create an array in "Intel" mode on an AMD machine. Or even with USB pendrives if you want. I guess this can be nice if you want to migrate from platform to platform without re-creating your RAID arrays.

Well, given I messed mine up in terms of alignment I'll have to re-create it even without migration. Even well-made software cannot protect me from doing stupid things I guess. :p
 
But since GRAID does it all in software you can create an array in "Intel" mode on an AMD machine.
That is what i was wondering.
So you do not really need the feature in BIOS for graid to work? (unless booting from it.)
I did read up on DDF since you quoted that.
I am doing RAID experiments on gear as time permits. Maybe I will throw it in the basket.
 
No, you don't need BIOS/UEFI/firmware support. I'm running GRAID with DDF meta data on an AMD Threadripper machine with zero firmware support for it. Also works with the Intel meta data format, didn't try any others. I believe the choice of meta data format is mostly there for interoperability reasons and maybe RAID level / feature support. Like not every format supports things like double-parity RAID-6.

My reformatting/realigning of thatfile system on my RAID-0 will not happen anytime soon though, I fear. I have quite a few computation jobs running and reading from / writing to it again, and they will take weeks to complete. But once it's done, I will post the results here.​
 
For Geom raid you need bios and chipset support. As the actual parity or stripe is done on the chipset. The only thing that you miss on graid(8) is the battery backed cache for the write cache that's why raid5 is not recommended in those Bios based raids as the write performance is poor.

Edit:
For this raid the max speed will be limited to ~3.93GB/s (PCIe 3.0 x4 32Gbit/s DMI 3.0) by the transfer speed between the CPU and PCH
For better performance you will need a NVME interface connected directly to the CPU.

Edit2:
The new 600 chipset will provide DMI 4.0 and transfer of ~7.69GB/s (PCIe 3.0 x8) so you will see increased performance in the software based raid
 
Is that really the case though? My apologies for questioning that statement, it's just that I find it strange that a RAID-0 would work in pure software (with - as I said - no BIOS or chipset support present), whereas a RAID-5 would not.

According to my sysctl output, I now have an "aacraid" device running, which indicates an Adaptec RAID, where I have no such controller installed. Also, it shows e.g. "kern.geom.raid.raid5.enable: 1", so I'd assume I could create a RAID-5 with DDF meta data without any actual RAID controller present.

I'd love to just try and see whether it works or not, but unfortunately, it'll have to wait for a few weeks. O:‑)
 
On entry level servers you have option in the Bios to select the raid type between adaptec/intel it's just a software based raid. You still need a support from the bios to recognize the boot device. Otherwise you can't boot from this raid volume.
1630441058499.png
 
Alright, that I understand. For booting. Can't boot stripe sets without firmware support, makes perfect sense.

I guess I just forgot to mention the details. I'm not booting from my RAID device. It's simply for data. The boot device is a single NVMe SSD. That's where FreeBSD and all my software are installed.

My RAID-0 is separate, based on four additional NVMe SSDs.
 
My apologies for digging up my old thread, but I now have additional information to contribute. I wiped the GEOM RAID-0 and rebuilt it with two different stripe block sizes (256 kiB and 1 MiB) to see whether performance would scale up when doing so. I also increased the UFS filesystem block size to 64 kiB.

Additionally, the processes [geom] and [g_raid DDF] were run at real-time priority level 31 and nice level -20, which made sure any resulting performance uplift would not be compromised during periods of excessive CPU load, which in this case is around 99% of the time.

Here's the performance result when running BIO_DELETE, meaning NVMe DEALLOCATE across 4 SSDs in said RAID-0 under CPU loads above 100, with the logical CPU count being 64:​
  • 16 kiB stripe block size: ≈350 MiB/s
  • 256 kiB stripe block size (16×): ≈5,5 GiB/s (≈16×)
  • 1 MiB stripe block size (64×): ≈12,5 – 15 GiB/s (≈36,6× - ≈43,9×)

Here we can see that the performance of the array scales in an almost linear fashion from 16 kiB block size to 256 kiB block size, as the GEOM RAID no longer splits up larger chunks into smaller blocks equal to its stripe block size. It does that, which I noticed when monitoring the array with sysutils/gstat-rs.

Further upwards, scaling is no longer linear, so we can see diminishing returns. But it's still nowhere near "bad", with the theoretical maximum being 19.34 GiB/s according to manufacturer specifications.

This concludes my experiences with GEOM RAID-0 with UFS on top. I'm now really satisfied with the performance of the array. If you're doing anything above 4 kiB I/O operations, I would recommend using larger stripe block sizes and also filesystem block sizes. Also, I'd recommend using GEOM RAID over GEOM MIRROR* STRIPE with flash drives due to the former having support for FreeBSD's BIO_DELETE.

*Edit: I wrongly said MIRROR, but I actually meant GEOM STRIPE.​
 
Maybe you can increase the read performance by setting a read-ahead to several times the stripe size?
 
Thank you for the input! To be honest I have not really looked at read performance again, only measuring BIO_DELETE, as it was prohibitively slow with 4 kiB stripe blocks and negatively affected other I/O operations as well when running.
However, $ sysctl -a | grep read | grep ahead only gives "kern.cam.ada.read_ahead: 1", and I couldn't find anything with $ sysctl -a | grep ufs either, so it's not a sysctl tuning knob? I looked at tunefs(8) and graid(8), but again, didn't find any way to configure this.

Could you tell me how to configure read-ahead?

Thanks!​
 
The vfs.read_max sysctl governs VFS read-ahead and is expressed as the number of blocks to pre-read if the heuristics algorithm decides that the reads are issued sequentially. It is used by the UFS, ext2fs and msdosfs file systems. With the default UFS block size of 32KiB, a setting of vfs.read_max=64 (the default value in 9.x) will allow speculatively reading up to 2048 KiB. (The formula is: block size * vfs.read_max). This setting may be increased to get around disk I/O latencies, especially where these latencies are large such as in virtual machine emulated environments. It may be tuned down in specific cases where the I/O load is such that read-ahead adversely affects performance or where system memory is really low.
 
Thank you very much!

On my system, $ sysctl vfs.read_max reports "64", so this is the default value. With my UFS block size that results in 64 x 64 = 4096 kiB or 4 MiB. Exactly one full RAID strip over my 4-disk RAID-0. I decided to try setting it to "256" and see where it goes. That equals four full RAID strips.

I tried to read two relatively large, uncached files (one 18,35 GiB, the other 26,43 GiB) with dd at a block size of 64 kiB, so at my UFS block size. I'm assuming that the heuristics would consider this linear reading and a candidate for read-ahead. So the first was read with vfs.read_max set to 64, and the second one with it set to 256. Free system RAM at the time of the test was ≈72 GiB, load was around 120.​

The results, however, are not convincing:
  1. vfs.read_max: 64 resulted in 2,074 GiB/s
  2. vfs.read_max: 256 resulted in 1,872 GiB/s
Looks like performance has actually dropped. To verify this, I re-ran the test on two more, uncached files, one 26.43 GiB, the other 24.38 GiB. There is only one significant difference: Unlike for the first round, load was now at 150 - 170 for the duration of the test because my compute jobs were just having a peak. Tends to happen every few minutes and last for a few more. The second round resulted in the following:​
  1. vfs.read_max: 64 resulted in 1,043 GiB/s
  2. vfs.read_max: 256 resulted in 0,636 GiB/s
Naturally, it would be much better to benchmark this on an idle system with no significant CPU load fluctuations, but there is already a bit of a trend. I think I'll keep vfs.read_max at the default setting.

Edit: Maybe I'll send SIGSTOP to all my compute processes (some last for weeks or months, so I won't just kill them), test again and then resume the processes with SIGCONT. Edit 2: Though I've probably run out of uncached files for now...​
 
My hunch would be that the geom already reads a complete stripe. To save time, it could be needed to read ahead two times the size so the reads overlap on one drive.

And do you have a decimal . or ,? Because around 2TB/sec would be more than I had imagined.
 
I'm from Austria, so our decimal is ",". Sorry for the confusion! 2 TiB/s of uncached reads would be physically impossible with the SSDs I have, no matter the tuning. :)
 
Back
Top