ZFS The Art of Storage

Code:
                    Linux
  
  
 --------------------------------------------
|     /data1   |    /data2    |    /data3    |
 --------------------------------------------
|     XFS      |     XFS      |    XFS       |  <--- mkfs.xfs
 -------------------------------------------
| LVM Volume 1 | LVM Volume 2 | LVM Volume 3 |  <--- lvcreate
 -------------------------------------------
|            LVM Volume Group                |  <--- pvcreate & vgcreate
 -------------------------------------------
|               RAID Volume                  |  <--- mdadm
 -------------------------------------------
|   GPT    |    GPT   |   GPT    |    GPT    |  <--- parted
 -------------------------------------------
| /dev/sda | /dev/sdb | /dev/sdc | /dev/sdd  |
 --------------------------------------------
  
  
  
  
            DragonFly BSD HAMMER 1
  
  
 --------------------------------------------
|            |     /data1    |    /data2     |
 -------------------------------------------
|            |  /pfs/@@-1:01 | /pfs/@@-1:02  |  <--  hammer pfs-master
 -------------------------------------------
|            |       HAMMER 1                |  <-- newfs_hammer -L DATA
 -------------------------------------------
|            |     /dev/ar0s1a               |  <-- disklable64 (partitions)
 -------------------------------------------
| /dev/ar0s0 |     /dev/ar0s1                |  <-- gpt (slices)
 -------------------------------------------
|                Hardware RAID               |
 -------------------------------------------
| /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
 --------------------------------------------

  
  
                 FreeBSD ZFS

  
 -------------------------------------------
|      /data1         |      /data2          |
 -------------------------------------------
|     dataset1        |      dataset2        |  <-- zfs create -o compress=lz4
 -------------------------------------------
|              ZFS Storage Pool              |  <-- zpool create
 -------------------------------------------
|   GPT    |    GPT   |   GPT    |    GPT    |  <-- gpart
 -------------------------------------------
| /dev/da0 | /dev/da1 | /dev/da2 | /dev/da3  |
 --------------------------------------------


Note that GPT layer for ZFS can be skipped. It is merely put there to protect a user from having problems with the HDDs of slightly unequal sizes as well to make easier to identify the block devices (whole disks or partitions on the disk), so one doesn't get the ambiguity caused by the OS renumbering devices depending on which devices were found in hardware (which makes da0 into da1)? Labeling part can be accomplished with glabel from GEOM framework but YMMV as both GEOM and ZFS store matadata at the end of HDD.

Here is the original thread with my post

https://marc.info/?l=dragonfly-users&m=150741646224774&w=2

Could be useful for people like me who don't know much.
 
@Oko

That Linux setup with LVM and XFS is quote 'risky' because it does not protect You from bit rot. There is no data integrity check on XFS. Its only on HAMMER1/2 and ZFS.

Its better to use BTRFS on Linux as it provides data integrity, just do not use RAID5/6 features with BTRFS.

I amazes me that companies like Red Hat or SUSE try to sell their 'storage appliances' with PETABYTES of data while their solutions does not guarantee data integrity ...
 
  • Thanks
Reactions: Oko
@Oko

That Linux setup with LVM and XFS is quote 'risky' because it does not protect You from bit rot. There is no data integrity check on XFS. Its only on HAMMER1/2 and ZFS.

Its better to use BTRFS on Linux as it provides data integrity, just do not use RAID5/6 features with BTRFS.

I amazes me that companies like Red Hat or SUSE try to sell their 'storage appliances' with PETABYTES of data while their solutions does not guarantee data integrity ...

The write up is motivated by an article in the most recent BSD Magazine in which guy is trying to sell HAMMER1 as better LVM. I just told to myself: "That makes no sense as he is comparing completely different things a logical volume manager and a file system". I had to draw a diagram just to clear my head.

I am making no claims at how useful is Linux setup. As a matter of fact I only use ZFS to store the data at work so I think that is telling enough what I think about the first option. Guys with lots of data here in Pittsburgh Google, Uber and many others live and die by ZFS (too many FreeNAS installations if you ask me). The first option is how the things were done in 90s. I also looked little bit more at LVM snapshots which I thought were useless (last time I looked many years ago). Sure enough they are useless, very expensive, and of fixed size.

I respectfully disagree with you on BTRFS part. I will repeat my point of view that BTRFS is a vaporware. Aside of the fact that people who are brave enough to put their data on that shit are losing their files left and right, Red Hat has given up on it completely and they are now trying to sell masked LVM snapshots of XFS as real snapshots. I am not sure which marketing trick will they use to mask for the lack of COW, checksum, data integrity and so on. As you know Ubuntu has also given on BTRFS and is moving towards ZFS possibly as FUSE implementation as oppose to kernel implementation. With exception of possibly SUSE in Europe all other Linux distros are statistical error in U.S. so I think that pretty much validates my point on BTRFS.

You can check Tomohiro's improved (from the internal point of view) diagram for HAMMER1

https://marc.info/?l=dragonfly-users&m=150746222231191&w=2

I think there are some other interesting information in that thread which reveal the sorry state of LVM2 on DragonFly. Talk about FreeBSD GEOM from the point of view of people who rejected it. Also information about old gpt vs new gpart (DF only uses FreeBSD 5.0 implementation of gpt) and some discussion about disklabel64 and why is necessary on DF.
 
That Linux setup with LVM and XFS is quote 'risky' because it does not protect You from bit rot. There is no data integrity check on XFS. Its only on HAMMER1/2 and ZFS.
Traditionally, file systems did not provide data integrity checking, because it was deemed unnecessary. The disk drive itself has extensive data integrity checking (with very elaborate ECC), and disk drive vendors specify a rate of about 10^-15 undetected errors (errors per bit read and written). In addition, the transport mechanisms (like SCSI cables) have their own checking (even parallel SCSI had a parity wite, and all the serial protocols have CRCs on the wire). That's a tiny number, and when disks and file systems are gigabytes in size, the chance probability of "bit rot" (undetected data errors) was tiny.

And then, of those cases of "bit rot", a good fraction are off-track writes: during a write operation, the servo mechanism (which moves the head from track to track) doesn't work perfectly enough, and the data is written correctly, but not in a place where it will be read. The next read of the same data then returns good and valid data, just not the data written most recently. This is something that checksum mechanisms have a hard time guarding against, because the data that's read is valid and has a good checksum, just not the checksum that should be expected. I don't know whether ZFS and BTRFS can help with that problem; other file systems have checksum mechanisms that do help (by deliberately storing the expected checksum in several places on several disks at once).

With T10DIF, the CRC mechanism can be extended from the platter of a SCSI disk to at least the HBA; that is standard today, and easy to set up. Still doesn't guard against off-track writes, and creates bizarre error behavior that makes diagnosing a system difficult: when a CRC error occurs, the HBA may have to "fake" a SCSI error condition with artificial ASC/ASCQ codes, but that's better than not detecting the error at all.

All these problems begin to be serious as file systems are routinely hitting many terabytes (today, an amateur home user can easily have a 10TB file system, by using a single disk), and large file systems are often measured in petabytes (which today takes only a little over 100 disks, and file systems with 1,000 to 10,000 disks are common in large installations). At these sizes, the small probability of data corruption *per byte* multiplied by the large size becomes a good chance that there is an error somewhere.

I see running without strong data checksums as similar to running with a single-fault tolerant RAID code (like mirroring or RAID5): It was a good idea 10 years ago. It is still better than nothing, but it is no longer adequate for large systems with high expectations on reliability and availability.

Its better to use BTRFS on Linux as it provides data integrity, just do not use RAID5/6 features with BTRFS.
Personally, I've heard many stories that BTRFS is so buggy that it should not be used at all, other than for hobbyists and enthusiasts. We used to refer to it with terms as "machine to create data loss". Note that Red Hat is removing BTRFS from its release. I wouldn't store anything on it.

I amazes me that companies like Red Hat or SUSE try to sell their 'storage appliances' with PETABYTES of data while their solutions does not guarantee data integrity ...
Does Red Hat sell a "storage appliance"? I think they sell an operating system, which others can then integrate into a storage appliance. To quote the former CTO of NetApp: selling RAID5 today amounts to professional malpractice. If you build large storage servers or storage systems today, you have to have an answer to undetected read errors and to the error rate overwhelming some RAID codes.
 
  • Thanks
Reactions: Oko
The write up is motivated by an article in the most recent BSD Magazine in which guy is trying to sell HAMMER1 as better LVM. I just told to myself: "That makes no sense as he is comparing completely different things a logical volume manager and a file system". I had to draw a diagram just to clear my head.

I am making no claims at how useful is Linux setup. As a matter of fact I only use ZFS to store the data at work so I think that is telling enough what I think about the first option. Guys with lots of data here in Pittsburgh Google, Uber and many others live and die by ZFS (too many FreeNAS installations if you ask me). The first option is how the things were done in 90s. I also looked little bit more at LVM snapshots which I thought were useless (last time I looked many years ago). Sure enough they are useless, very expensive, and of fixed size.

I respectfully disagree with you on BTRFS part. I will repeat my point of view that BTRFS is a vaporware. Aside of the fact that people who are brave enough to put their data on that shit are losing their files left and right, Red Hat has given up on it completely and they are now trying to sell masked LVM snapshots of XFS as real snapshots. I am not sure which marketing trick will they use to mask for the lack of COW, checksum, data integrity and so on. As you know Ubuntu has also given on BTRFS and is moving towards ZFS possibly as FUSE implementation as oppose to kernel implementation. With exception of possibly SUSE in Europe all other Linux distros are statistical error in U.S. so I think that pretty much validates my point on BTRFS.

You can check Tomohiro's improved (from the internal point of view) diagram for HAMMER1

https://marc.info/?l=dragonfly-users&m=150746222231191&w=2

I think there are some other interesting information in that thread which reveal the sorry state of LVM2 on DragonFly. Talk about FreeBSD GEOM from the point of view of people who rejected it. Also information about old gpt vs new gpart (DF only uses FreeBSD 5.0 implementation of gpt) and some discussion about disklabel64 and why is necessary on DF.

I was not entirely clear about BTRFS, I would not put my data on it now, but after 3-4 years of development, it may be stable and usable, this is the only FS on Linux that is GPL licensed and has data integrity built it. I also like the Ubuntu way more by incorporating ZFS. I love all these screams that CDDL is not compatible with GPL bullshit and so, but even VMware used Linux kernel, modified it in the ESX/ESXi products, did not commit the changes back and no one has done anything to them, so no one will against 'friendly' ZFS use on Ubuntu.

I would also prefer that Linux fanatics would stop praying to GPL that much and incorporate ZFS, but it aint gonna happen on Red Hat or SUSE corporations.

Besides, Red Hat dropped support for BTRFS mostly because they (Red Hat) does not have any BTRFS developers, they only have XFS and LVM developers, that is why they want to stick with it.

Personally I wait for HAMMER2 to stabilize and become feature complete, Matt Dillon shows how much FreeBSD project has lose by not having him.
 
Personally, I've heard many stories that BTRFS is so buggy that it should not be used at all, other than for hobbyists and enthusiasts. We used to refer to it with terms as "machine to create data loss". Note that Red Hat is removing BTRFS from its release. I wouldn't store anything on it.

I was not entirely clear about BTRFS, I would not put my data on it now, but after 3-4 years of development, it may be stable and usable, this is the only FS on Linux that is GPL licensed and has data integrity built it.

Does Red Hat sell a "storage appliance"? I think they sell an operating system, which others can then integrate into a storage appliance. To quote the former CTO of NetApp: selling RAID5 today amounts to professional malpractice. If you build large storage servers or storage systems today, you have to have an answer to undetected read errors and to the error rate overwhelming some RAID codes.
Yes they have, as their Linux is called Red Hat Enterprise Linux, their storage is called Red Hat Enterprise Storage :)

You can compare both SUSE and Red Hat 'storage' offerings here, compared to other 'enterprise' solutions:
https://www.suse.com/docrep/documents/taz9nke6i8/5_year_tco_case_study.pdf
 
I'm amazed that both SUSE and RedHat market Ceph-based solutions as an enterprise storage solution. That will make for unhappy customers, and more sales for EMC/NetApp/IBM/Dell/HP/...
 
Back
Top