UFS GRAID3 usage

Phishfry · Nov 2, 2018

I think I have found the RAID level I am happy with. RAID3.
Are there many users of graid3 out there? Would like to hear your opinion.
I don't think I have heard any questions about it, so I assume it is not very popular.
Seems like it is best used for big files versus small files.

I am not a RAID expert. How does RAID3 stack up? I have tried geom RAID0, RAID1 and RAID10 so far.
Looking at Wikipedia it is rarely used, but FreeBSD RAID3 seems to be different.
Wikipedia says RAID3 need synchronous spindles but I see nothing about that with FreeBSD.
https://en.wikipedia.org/wiki/Standard_RAID_levels

Anybody know the scoop? Absolute-BSD 2'nd-edition had a nice write-up I used for learning.
But with ZFS being the buzzword I wonder if geom_raid3 is still relevant. Is is depreciated?
FreeBSD had a graid5 but it never seemed to have made it into -CURRENT. Anybody know the scoop with that?

FreeBSD geom_raid3 ticked all the boxes for me. Speed, Space and EASY. I already yanked drives and hotswaped in replacements for testing. Looks all systems GO.

I hate to crosspost other forum threads, but this FreeNAS post highlights some of my confusion.
Is the GRAID3 drive number any odd number>1 or (2^n +1).
https://forums.freenas.org/index.php?threads/ufs-and-raid0-raid1-or-raid3.8389/

ralphbsz · Nov 3, 2018

I don't know the details of the geom graid implementation, never used it, other than a quick peek at the graid(8) page.

In general, RAID3 is a bad idea, and it is nearly completely obsolete. There is a variety of reasons. (a) Like all parity-based RAID layers (2 and up), small writes are very expensive, since they require a read-modify-write cycle to update a block in place. RAID 1 or RAID10 does not have this penalty. (b) for small reads, since each data block (as small as 512 bytes) is spread over all disks, even the smallest read requires starting an IO on all disks. If any disk can only do 1 thing at a time, that means that the whole array will only ever work on 1 IO at a time. Compare that with RAID 5: If the array has N disks, and you start many small IOs at the same time, then the RAID layer can run up to N small reads simultaneously. (c) For all IOs (and a serious problem for large IOs), each IO is started on all disks simultaneously. If the disks are not perfectly synchronized, then you always have to wait for the last disk to finish the IO. But if not synchronized, the platters will always be in different positions, so you fundamentally always have to wait for a whole platter rotation, not for half of one on average. This has gotten much worse with modern disks (that use heavy revectoring and where every zone is formatted differently per drive, that's called adaptive layout). Think of it as an extreme case of the "convoy effect". (d) From a throughput viewpoint for highly sequential large-IO workloads (video streaming, supercomputing), it has no advantage over RAID4 and RAID5 (although RAID5 is a little harder to implement).

So today, in practice only a small subset of RAID levels still exist: 1 (and 10), 5, and various multi-parity forms of 5 (sometimes called RAID-6 and such). RAID-3 used to be used a little bit, but I think in the last 10 or 15 years I've not seen it in any new design. In practice, modern production implementations on big computers are much more complex than RAID-1, 10 and 5, as they use variable layouts, higher parity counts, and more interesting layouts.

In theory, RAID-3 can be done with any number of disks that is 3 or greater (the same is true for RAID-2 through 5). In practice, there may be implementation constraints that make it easier with certain numbers of disks. The restriction for "2^n + 1" might come from the fact that the geom implementors didn't bother to write the code for splitting the 512-byte logical block size into physical 512-byte sectors if there is any remainder (which requires doing math at a granularity finer than 512, and occasionally doing atomic multi-sector updates, which you need to do for parity writes anyway). This is not meant as criticism ... I spent ~10 years of my life working on a storage system that could only use RAID-levels that were "2^n + [2,3,4]" encodings (with n=2...4, but placed on an arbitrary number of disks, as long as the number of disks was a dozen or larger). An implementor has to make choices between generality and schedule/budget/quality.

You might ask: If RAID-2, 3 and 4 are de-facto useless, why were they even proposed? The answer is that Garth Gibson's original RAID paper was his PhD thesis, and it wasn't meant as a marketing white paper, nor as an implementor's guide, but as a taxonomy of what is feasible. Just because it is described and the nomenclature standardized doesn't mean that it is a good idea. Disclaimer: I see Garth regularly, and he was my manager for a while.

Honestly, I have no idea why you are even interested in messing with the geom-based RAID system. Why don't you just use ZFS, and pick a sensible RAID layout there? ZFS's RAID system is way better than anything else available on commodity computers (just for starters: you can add bigger disks, dynamically change the RAID level, you get checksums, ...).

Phishfry · Nov 3, 2018

What I am really foggy about is it appears that even if you add more disks ie. 3,5,9 it still only uses one disk for parity.
I don't understand how that is possible.

This weekend I am actually using geom nested raid for learning with RAID3+1 and RAID30. Six disk minimum.

Maybe one day I will work my way to ZFS, but right now I am enjoying geom. It is very easy to use and it automatically rebuilds quite nicely. I read about resilvering times on ZFS and it makes me shudder. My rebuilds are quite quick comparatively with geom..
I think the KISS philosophy is why I am here. Maybe one day I will need the advanced features of ZFS. Right now I don't.
The reason I am wary is I got bit by an ZFS upgrade on FreeNAS around 2012. ZFS filesystem version13 upgrade wiped me out.
I didn't have the skills to recover and it left a very bitter taste in my mouth.

Plus there is something negative to me about a Larry Ellison file system. One day when ZFS becomes thouroughly entrenched and Larry loses some big clients he will be back to haunt you. He is a snake.

ralphbsz · Nov 3, 2018

Phishfry said:
What I am really foggy about is it appears that even if you add more disks ie. 3,5,9 it still only uses one disk for parity.
I don't understand how that is possible.

Simple. The following diagrams show the disks A, B, C, ... are "columns", and I number the data bytes as 1...; the parity bytes are shown as sums (so 1+2 means the parity of bytes 1 and 2, and 1+2+3+4 the parity of bytes 1...4):

Code:

3 disks:             5 disks:
A   B     C          A    B    C    D      E
1   2    1+2         1    2    3    4    1+2+3+4
3   4    3+4         5    6    7    8    5+6+7+8
5   6    5+6         9   10   11   12    9+10+11+12
7   8    7+8        13   14   15   16    13+14+15+16
...                       ...

... and I was too lazy to draw the one for 9 disks, which should be obvious.

This weekend I am actually using geom nested raid for learning with RAID3+1 and RAID30. Six disk minimum.

If you have 6 disks, you should really use a code that is 2-fault tolerant (3+1 is, but it only gives you 2 disks worth of capacity). Here's why: with modern disks, the probability of finding a sector read error during a resilver is very high (fundamentally it is ~1 on large RAID arrays). That means that being tolerant to a single disk is suicidal. Any large storage system has to tolerate two faults to be production worthy.

I read about resilvering times on ZFS and it makes me shudder. My rebuilds are quite quick comparatively with geom..

ZFS rebuild is slow when there is concurrent activity, because ZFS automatically (and sensibly) throttles rebuild in that case. Also, ZFS is not a very fast file system on low-powered machines, which affects rebuild even more than normal IO. On the other hand, ZFS rebuilds can be faster than separate RAID: if the file system is only 1/2 full, ZFS doesn't need to do a rebuild on free space; it only needs to rebuild those blocks used by files, not all blocks.

Maybe one day I will need the advanced features of ZFS. Right now I don't.

Yes, you do. Checksums alone are a reason to switch to ZFS. Today, undetected read errors will hit you on large storage system, so checksums have become important (again, if you want to have production reliability).

Plus there is something negative to me about a Larry Ellison file system.

It is not a Larry Ellison file system. It was written at Sun, long before Larry took that company over. I have a few friends who worked on it, who would never work for Oracle. And today, the version of ZFS that you run in FreeBSD has nothing to do with Larry either; it is open source.

He is a snake.

Absolutely. He is scum. His relationship with women is probably even the lowest point of his scummyness. Don't like the guy a bit. Fortunately, I don't have to deal with him.

But getting back to your situation: If you are doing this as a form of experimentation, or for a home server that does not have important data on it, go ahead, and play with whatever system you want. Once you cross the threshold where your data has some value, different rules apply, and you need to use common sense and stop playing.

Phishfry · Nov 3, 2018

Yes its all experimentation for me. I assembled a 24 bay rig, but realized I am still storage ignorant. SO I temporarily downgraded to a 1U 8 bay rig that might end up a cold storage machine.

With experimentation I am frequently rebooting and the wear and tear of 24 drives restarting was not good. So to figure out the essentials of storage I am using 5 drives in a 8 bay 1U chassis. I bought some 0 hour 10K.6 drives for dirt cheap. Now that I have used them I see the problem. They run stupidly hot. I am not stacking them to allow better airflow.
The machinist in me wonders how backplane designers planned on airflow getting to the drives. I thought about drilling a bunch of holes in the void areas for experimentation. No wonder rack cases are so loud. That have to force air though a brick wall known as the backplane.

ralphbsz · Nov 3, 2018

As long as you don't power-cycle, the restarts should not cause wear and tear. Matter-of-fact, older disks (>5 years or so ago) don't get any wear and tear from R/W either. Modern disks do get worn out by reading and writing (which is why manufacturers now put a limit, often 550TB/year, or else the warranty is void).

Use smartctl to check the temperature. 35-40 degrees are good. Over 50 is bad, and around 60 you really need to improve cooling.

Typically, cooling works by taking air in the front, along the drive, and then out up or down. Or there is a little space next to the backplane for the air to go. But you're right: today, rack-mounted computers are designed to go into data centers, and the are optimized for (a) good cooling, (b) reliable cooling (meaning redundant fans), and (c) low power consumption (efficient airflow if possible). Noise is not a concern at all. Data centers are run at uncomfortable temperatures (either too hot or too cold for humans, there are different ways to do it), and at insane noise levels. That's because in today's data centers, humans are very rarely present. Which means comfort of the human is not important.

Old joke: how to run a computer? Hire a man and a dog. The man to feed the dog. The dog to bite the man if the tries to touch the computer.

Phishfry · Nov 3, 2018

Your absolutly right about the airflow. The fans pull it across the drives into the tunnel for the CPU heatsinks and out the back.
What would be the number 2 filesystem you would chose behind ZFS including any other OS's.

ralphbsz · Nov 3, 2018

Of the publicly available ones ... the new Microsoft reliable file system (the successor to NTFS) which is available in the newest Windows Server edition (has checksums and built-in RAID, and with all the reservations about MS, they tend to have very high quality server OSes). I hear really good things about it, but I don't know any of the people involved in developing it (I live in Silicon Valley, and I think it is implemented in Redmond or Kirkland, 12 hours by car away). In theory Hammer might be good (the specs certainly look nice), but I have no idea what it's real-world survivability and reliability is: given the size of its staff, it can't have the 100-person support department that you need for an industrial-quality product. Linux BtrFS (not BetrFS) is in theory a nice design, but the quality of the implementation is such that it has become nothing but a machine for destroying your data. SGI's XFS and CxFS was really good 10 or 15 years ago, but then SGI failed, and it hasn't had any attention paid to it: still a good, fast, reliable design, but without checksums and RAID features, which is where today the real value-add is. IBM's GPFS is great, but extremely expensive (designed for people who buy $10-100M supercomputers, and think that a $100K license for the file system is a rounding error). Plus, to get the checksum and RAID features you need to buy your storage from IBM and friends, so it is unaffordable for home users. Forget Lustre, all the downsides of GPFS without any of the upside. The dark horse in the race is Veritas' VxFS: I don't think there is a way an individual can buy it (matter-of-fact, I don't know how to buy it without getting HP-UX installed), but it used to have a great reputation. I recently was talking to one of the implementors of the MacOS file system (he's a relative of a friend), and was surprised at how many modern and sensible features they have added recently; given that it is only sold on a very limited set of computers, most of which are not suitable as servers, it's nice to see that there is advanced R&D going into it. But I don't know whether it has end-to-end checksums, and forget RAID (how would you put multiple disks into a Macbook?). In the Linux world, I use ext4. Not because it's great, but because it's so old that it's pretty reliable, and because I respect the folks who work on it. If you run it on a cloud machine (where the underlying storage already has checksums and reliability/availability built in), that's good enough. Other than Raspberry Pi, I no longer have any Linux machines that are not in the cloud.

Do you now see why I run FreeBSD on my home server in the basement? ZFS is just head and shoulders above everything else in the low cost area.

At work I use or develop ... stuff that I won't talk about.

Phishfry · Nov 3, 2018

It seems gluster has lost some of its luster to ceph for Redhat. I noticed they finally killed off BtrFS.
I forgot about XFS.

Phishfry · Nov 3, 2018

Really surprised to see SAS LSI MPT devices being depreciated by RH. SAS2008 is slill a usable controller in my opinion.
https://access.redhat.com/documenta...ux-7.5_release_notes-deprecated_functionality

I still use them as they do SATA3 as well as SAS2. This is the famous IBM M1015 that everyone flashes to IT mode..
Can you imagine if FreeBSD ditched this. There would be alot of cussing.

ralphbsz · Nov 3, 2018

Careful: RedHat can neither "kill off BtrFS", nor "deprecate LSI SAS2008". All they can do is to say that they no longer support it in those Linux distributions that they sell and support. And while they can fund the development of Ceph (they have after all bought Sage's company and are pouring money into development), they can't make gluster disappear.

All this means is: If a customer buys a copy of RHEL, it might come without the SAS2008 kernel module (driver) prebuilt. Nothing prevents a RHEL customer from downloading the code from kernel.org and running it himself. Other than ... if they afterwards call RedHat and say "I have a problem with my system when using a SAS2008 controller", then RedHat can easily say "sorry, you're own your own, that stuff is unsupported".

You have to remember: RHEL is not bought/used by hobbyists or small-time users. It's too expensive. It's used by companies who want (or need) good support. They won't buy a used IBM ServeRAID on eBay, reflash it themselves, and then expect it to be supported; instead they'l spend $10K on a new rackmount machine and get on with life. Similarly, if RedHat doesn't do BtrFS, then RedHat's customers will install using XFS or ext4.

Now the part where you are correct is this: 20 years ago, Linux development was mostly done by amateur volunteers. Today, the situation has completely changed, and most development is done by paid staff, with RedHat, Intel, IBM and Oracle being the largest contributors. If RedHat stops putting bug fixing and improvement effort into things like BtrFS, SAS2008 and gluster, they might dry up and die on the vine, unless volunteers are found.

Phishfry · Apr 19, 2020

I decided to try graid3 with 3 NVMe. These are Samsung PM983 1.92TB drives. Single drive shows 1500MB/sec in diskinfo.
The array is seriously slow:

Code:

diskinfo -t /dev/raid3/gr0p1
/dev/raid3/gr0p1
    1024            # sectorsize
    3840766779392    # mediasize in bytes (3.5T)
    3750748808      # mediasize in sectors
    0               # stripesize
    20480           # stripeoffset
    233473          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    No              # TRIM/UNMAP support
    Unknown         # Rotation rate in RPM

Seek times:
    Full stroke:      250 iter in   0.031007 sec =    0.124 msec
    Half stroke:      250 iter in   0.030998 sec =    0.124 msec
    Quarter stroke:      500 iter in   0.063941 sec =    0.128 msec
    Short forward:      400 iter in   0.047966 sec =    0.120 msec
    Short backward:      400 iter in   0.047953 sec =    0.120 msec
    Seq outer:     2048 iter in   0.210953 sec =    0.103 msec
    Seq inner:     2048 iter in   0.256304 sec =    0.125 msec

Transfer rates:
    outside:       102400 kbytes in   0.137516 sec =   744641 kbytes/sec
    middle:        102400 kbytes in   0.138308 sec =   740377 kbytes/sec
    inside:        102400 kbytes in   0.138677 sec =   738407 kbytes/sec

I have destroyed that array and I an going to compare it to a ZFS raidz1 array.