ZFS ZFS RaidZ Capacity does not make sense

kballow · Dec 21, 2022

Hello All,
I have 4 30TB SSD's, yep you heard that right.
Does it sound right for a 4 Wide Raidz1 with these disks to give me 56TB of space? A Raid 10 would give similar amounts.
Please see below. I used default options for creating the raidz.

Code:

(19:27:16)X123-~> zfs list
NAME     USED  AVAIL     REFER  MOUNTPOINT
Raidz1  3.81T  52.0T     3.81T  /Raidz1

(19:27:18)X123-~> zpool status
  pool: Raidz1
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    Raidz1      ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        nvd2    ONLINE       0     0     0
        nvd3    ONLINE       0     0     0
        nvd4    ONLINE       0     0     0
        nvd5    ONLINE       0     0     0

errors: No known data errors

(19:27:22)X123-~> zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Raidz1   112T  7.61T   104T        -         -     0%     6%  1.00x    ONLINE  -

Alain De Vos · Dec 21, 2022

Try

Code:

zfs list -r -o name,usedbydataset -s usedbydataset | grep -v "@"
zfs list -o space

Eric A. Borisch · Dec 21, 2022

Due to the way raidz-n works, each allocation must be a multiple of n+1, while still meeting redundancy requirements. The first requirement means you will have at worst 50% (2 blocks (one on two different disks) allocated for 1 block of data) and at best 75% (second requirement; 1 block on each of four drives for three blocks of data) efficiency. Your actual efficiency will be somewhere between those depending on file sizes and IO behavior (and snapshots).

(It can actually be significantly worse if you are writing primarily small files that are smaller than the zpool’s block size/ashift.)

~~My guess is that zfs is using a ~worst case when reporting, as it can’t tell the future.~~ [Edit: a test pool confirms, as Alain De Vos points out it should show at 3/4 size for raidz1 and four disks, at least on 13.1; what version of FreeBSD are you on?]

So depending on your use case, you may be happier with a stripe of mirrors. Turning on compression can help recover some of this loss if you have compressible files.

Alain De Vos · Dec 21, 2022

If n is the total number of disks.
For a mirror the capacity is 50%
For raid-Z1 it is (n-1)/n.
For raid-Z2 it is (n-2)/n
For raid-Z3 it is (n-3)/n

SirDice · Dec 21, 2022

Also note that disks use decimal prefixes, whereas FreeBSD and ZFS use binary prefixes. So 30 TB is roughly ~27 TeBi. The bigger the disk the bigger this difference is.

Binary prefix - Wikipedia

en.wikipedia.org

kballow · Dec 21, 2022

Alain De Vos said:
Try

Code:

zfs list -r -o name,usedbydataset -s usedbydataset | grep -v "@" zfs list -o space

Code:

[root@X123 ~]# zfs list -r -o name,usedbydataset -s usedbydataset | grep -v "@"
NAME    USEDDS
Raidz1   3.81T

[root@X123 ~]# zfs list -o space
NAME    AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
Raidz1  52.0T  3.81T        0B   3.81T             0B       716M

kballow · Dec 21, 2022

SirDice said:
Also note that disks use decimal prefixes, whereas FreeBSD and ZFS use binary prefixes. So 30 TB is roughly ~27 TeBi. The bigger the disk the bigger this difference is.

Binary prefix - Wikipedia

en.wikipedia.org

Hi Yes I understand that its around 27TB which is where the 112TB Raw Number comes from.
I just have a hard time seeing where all the space is going, I understand that 1 disk is for parity, Then there is padding, filesystem, etc. but Raidz1 is taking 2 Disks worth it seems because When I raid10, i got 54TB which we know is half.

kballow · Dec 21, 2022

Eric A. Borisch said:
Due to the way raidz-n works, each allocation must be a multiple of n+1, while still meeting redundancy requirements. The first requirement means you will have at worst 50% (2 blocks (one on two different disks) allocated for 1 block of data) and at best 75% (second requirement; 1 block on each of four drives for three blocks of data) efficiency. Your actual efficiency will be somewhere between those depending on file sizes and IO behavior (and snapshots).

(It can actually be significantly worse if you are writing primarily small files that are smaller than the zpool’s block size/ashift.)

~~My guess is that zfs is using a ~worst case when reporting, as it can’t tell the future.~~ [Edit: a test pool confirms, as Alain De Vos points out it should show at 3/4 size for raidz1 and four disks, at least on 13.1; what version of FreeBSD are you on?]

So depending on your use case, you may be happier with a stripe of mirrors. Turning on compression can help recover some of this loss if you have compressible files.

I am using FreeBSD 13.1 as well. I googled intesly to see if there is a way to break up the space usage and have failed. I can not imagine there is not a way to get all spaced used information somehow. It seems like there is RAW and then Avaiable, nothing in between like padding, filesystem, etc.

covacat · Dec 21, 2022

have you set copies=2 or something ?

kballow · Dec 21, 2022

covacat said:
have you set copies=2 or something ?

Does not look like it,
Raidz1 copies 1 default
I posted the other info too.

Code:

[root@X123 ~]# zfs get all
NAME    PROPERTY              VALUE                  SOURCE
Raidz1  type                  filesystem             -
Raidz1  creation              Tue Dec 20 14:44 2022  -
Raidz1  used                  3.81T                  -
Raidz1  available             52.0T                  -
Raidz1  referenced            3.81T                  -
Raidz1  compressratio         1.00x                  -
Raidz1  mounted               yes                    -
Raidz1  quota                 none                   default
Raidz1  reservation           none                   default
Raidz1  recordsize            128K                   default
Raidz1  mountpoint            /Raidz1                default
Raidz1  sharenfs              off                    default
Raidz1  checksum              on                     default
Raidz1  compression           lz4                    local
Raidz1  atime                 on                     default
Raidz1  devices               on                     default
Raidz1  exec                  on                     default
Raidz1  setuid                on                     default
Raidz1  readonly              off                    default
Raidz1  jailed                off                    default
Raidz1  snapdir               hidden                 default
Raidz1  aclmode               discard                default
Raidz1  aclinherit            restricted             default
Raidz1  createtxg             1                      -
Raidz1  canmount              on                     default
Raidz1  xattr                 on                     default
Raidz1  copies                1                      default
Raidz1  version               5                      -
Raidz1  utf8only              off                    -
Raidz1  normalization         none                   -
Raidz1  casesensitivity       sensitive              -
Raidz1  vscan                 off                    default
Raidz1  nbmand                off                    default
Raidz1  sharesmb              off                    default
Raidz1  refquota              none                   default
Raidz1  refreservation        none                   default
Raidz1  guid                  10265362000983470673   -
Raidz1  primarycache          all                    default
Raidz1  secondarycache        all                    default
Raidz1  usedbysnapshots       0B                     -
Raidz1  usedbydataset         3.81T                  -
Raidz1  usedbychildren        716M                   -
Raidz1  usedbyrefreservation  0B                     -
Raidz1  logbias               latency                default
Raidz1  objsetid              54                     -
Raidz1  dedup                 off                    default
Raidz1  mlslabel              none                   default
Raidz1  sync                  standard               default
Raidz1  dnodesize             legacy                 default
Raidz1  refcompressratio      1.00x                  -
Raidz1  written               3.81T                  -
Raidz1  logicalused           3.80T                  -
Raidz1  logicalreferenced     3.80T                  -
Raidz1  volmode               default                default
Raidz1  filesystem_limit      none                   default
Raidz1  snapshot_limit        none                   default
Raidz1  filesystem_count      none                   default
Raidz1  snapshot_count        none                   default
Raidz1  snapdev               hidden                 default
Raidz1  acltype               nfsv4                  default
Raidz1  context               none                   default
Raidz1  fscontext             none                   default
Raidz1  defcontext            none                   default
Raidz1  rootcontext           none                   default
Raidz1  relatime              off                    default
Raidz1  redundant_metadata    all                    default
Raidz1  overlay               on                     default
Raidz1  encryption            off                    default
Raidz1  keylocation           none                   default
Raidz1  keyformat             none                   default
Raidz1  pbkdf2iters           0                      default
Raidz1  special_small_blocks  0                      default

mer · Dec 21, 2022

kballow do you have the command you used to create the pool? I think you should be able to get it from zpool history.

Why am I asking? I'm not sure what you are expecting to see. Based on zpool status and your OP, I'm guessing you are expecting 4*30TB or roughly "120GB" of space, but you are only seeing 56TB (roughly half), is that about right?

kballow · Dec 21, 2022

mer said:
kballow do you have the command you used to create the pool? I think you should be able to get it from zpool history.

Why am I asking? I'm not sure what you are expecting to see. Based on zpool status and your OP, I'm guessing you are expecting 4*30TB or roughly "120GB" of space, but you are only seeing 56TB (roughly half), is that about right?

I used the command
zpool create Raidz1 raidz nvd2 nvd3 nvd4 nvd5
My thought process is my 30TB disks which are really 28ish TB with raw capacity of 112TB should give me something closer to 84TB, understanding that there is other elements like filesystem, padding, parity, the number should go down but losing a whole ~30TB seems excessive. Perhaps I am missing something...

Eric A. Borisch · Dec 21, 2022

An example, with 2048 4k (=8MB logical) files (on ashift=12):

Code:

$ zpool list -v testpool; zfs list testpool; du -hA /testpool; du -h /testpool
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool         15.5G  18.9M  15.5G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       15.5G  18.9M  15.5G        -         -     0%  0.11%      -    ONLINE
    /root/diska      -      -      -        -         -      -      -      -    ONLINE
    /root/diskb      -      -      -        -         -      -      -      -    ONLINE
    /root/diskc      -      -      -        -         -      -      -      -    ONLINE
    /root/diskd      -      -      -        -         -      -      -      -    ONLINE

NAME       USED  AVAIL     REFER  MOUNTPOINT
testpool  13.7M  10.9G     12.9M  /testpool

8.0M    /testpool

 13M    /testpool

$ zfs list -o refer,logicalreferenced,used testpool
   REFER  LREFER   USED
   12.9M   8.44M  13.7M

Note this is with raidz1 with 4x4G (files), and it shows some of the different (8, 13, and 19M, depending on where you look) accounting that occurs. But, you can see this shows ~ 3x4G for zfs list available, so something is strange with yours only showing 2x(device size). As mer suggested, was does zpool history Raidz1 show?

kballow · Dec 21, 2022

Eric A. Borisch said:
An example, with 2048 4k (=8MB logical) files (on ashift=12):

Code:

$ zpool list -v testpool; zfs list testpool; du -hA /testpool; du -h /testpool NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT testpool 15.5G 18.9M 15.5G - - 0% 0% 1.00x ONLINE - raidz1-0 15.5G 18.9M 15.5G - - 0% 0.11% - ONLINE /root/diska - - - - - - - - ONLINE /root/diskb - - - - - - - - ONLINE /root/diskc - - - - - - - - ONLINE /root/diskd - - - - - - - - ONLINE NAME USED AVAIL REFER MOUNTPOINT testpool 13.7M 10.9G 12.9M /testpool 8.0M /testpool 13M /testpool $ zfs list -o refer,logicalreferenced,used testpool REFER LREFER USED 12.9M 8.44M 13.7M

Note this is with raidz1 with 4x4G (files), and it shows some of the different (8, 13, and 19M, depending on where you look) accounting that occurs. But, you can see this shows ~ 3x4G for zfs list available, so something is strange with yours only showing 2x(device size). As mer suggested, was does zpool history Raidz1 show?

Nothing special

Code:

[root@X123 ~]# zpool history Raidz1
History for 'Raidz1':
2022-12-20.14:44:06 zpool create Raidz1 raidz nvd2 nvd3 nvd4 nvd5
2022-12-20.14:44:25 zfs set compression=lz4 Raidz1

Also I destroyed the Raidz1 and created with 3 Wide to show it scales with the same problem

Code:

[root@X123 ~]# zpool create Raidz1 raidz nvd2 nvd3 nvd4

[root@X123 ~]# zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Raidz1  83.8T  13.9M  83.8T        -         -     0%     0%  1.00x    ONLINE  -

[root@X123 ~]# zpool status
  pool: Raidz1
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    Raidz1      ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        nvd2    ONLINE       0     0     0
        nvd3    ONLINE       0     0     0
        nvd4    ONLINE       0     0     0

errors: No known data errors

[root@X123 ~]# zfs list
NAME     USED  AVAIL     REFER  MOUNTPOINT
Raidz1  5.62M  41.8T     1.50M  /Raidz1

kballow · Dec 21, 2022

Here is the Raid 10 Data point showing it works as expected.

Code:

[root@X123 ~]# zpool create Raid10 mirror nvd2 nvd3 mirror nvd4 nvd5

[root@X123 ~]# zpool status
  pool: Raid10
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    Raid10      ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nvd2    ONLINE       0     0     0
        nvd3    ONLINE       0     0     0
      mirror-1  ONLINE       0     0     0
        nvd4    ONLINE       0     0     0
        nvd5    ONLINE       0     0     0

errors: No known data errors

[root@X123 ~]# zpool list
NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Raid10  55.9T  5.81M  55.9T        -         -     0%     0%  1.00x    ONLINE  -

[root@X123 ~]# zfs list
NAME     USED  AVAIL     REFER  MOUNTPOINT
Raid10  5.81M  55.7T     1.50M  /Raid10

mer · Dec 21, 2022

So my understanding of zpool history for the create that should have created a stripe of the 4 30TB devices, to give total of 120TB minus some overhead (I think nominally 110TB or so).

The Raid10 makes sense "mirror the 30TB and then stripe over the mirrors", so mirror 30TB gives a mirror of 30TB, then stripe 2 30TB mirrors to get 60TB (or the actual 56TB).

The raidz with 3 of the devices seemingly scaling is interesting. Almost like a device is being used as parity?

VladiBG · Dec 21, 2022

There's metaslab allocation lost and slop space reservation which depend of the number of the vdev in the group.

https://www.bsdcan.org/2016/schedule/attachments/366_ZFS%20Allocation%20Performance.pdf

ZFS Storage Overhead - WintelGuy.com

ZFS Storage Overhead

wintelguy.com

kballow · Dec 21, 2022

VladiBG said:
There's metaslab allocation lost and slop space reservation which depend of the number of the vdev in the group.

https://www.bsdcan.org/2016/schedule/attachments/366_ZFS%20Allocation%20Performance.pdf

ZFS Storage Overhead - WintelGuy.com

ZFS Storage Overhead

wintelguy.com

Yes I do understand that the rule of thumb about 3.9% is reserved for these items, but that still does not account for the ~24TB missing after i take out the 3.9% from the theoretical 84TB

Summary
Raw: 112TB
Expected before padding and slope: 84TB
After 3.9% accounted for: ~80TB
What I got: 56TB
Missing: ~24TB

Eric A. Borisch · Dec 21, 2022

Can you try recreating this?

Code:

# truncate -s 4g /root/disk{a,b,c,d}
# zpool create testpool raidz1 /root/disk{a,b,c,d}
# zpool list -v testpool
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool         15.5G   191K  15.5G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       15.5G   191K  15.5G        -         -     0%  0.00%      -    ONLINE
    /root/diska      -      -      -        -         -      -      -      -    ONLINE
    /root/diskb      -      -      -        -         -      -      -      -    ONLINE
    /root/diskc      -      -      -        -         -      -      -      -    ONLINE
    /root/diskd      -      -      -        -         -      -      -      -    ONLINE
# zfs list testpool
NAME       USED  AVAIL     REFER  MOUNTPOINT
testpool   143K  11.2G     32.9K  /testpool

You can see on my (13.1) system, when I create a raidz1 with 4 devices (files here), it lists (for zfs list) the available space as ~ 3x single device.

kballow · Dec 21, 2022

Eric A. Borisch said:

Can you try recreating this?

Code:

# truncate -s 4g /root/disk{a,b,c,d}
# zpool create testpool raidz1 /root/disk{a,b,c,d}
# zpool list -v testpool
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool         15.5G   191K  15.5G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       15.5G   191K  15.5G        -         -     0%  0.00%      -    ONLINE
    /root/diska      -      -      -        -         -      -      -      -    ONLINE
    /root/diskb      -      -      -        -         -      -      -      -    ONLINE
    /root/diskc      -      -      -        -         -      -      -      -    ONLINE
    /root/diskd      -      -      -        -         -      -      -      -    ONLINE
# zfs list testpool
NAME       USED  AVAIL     REFER  MOUNTPOINT
testpool   143K  11.2G     32.9K  /testpool

You can see on my (13.1) system, when I create a raidz1 with 4 devices (files here), it lists (for zfs list) the available space as ~ 3x single device.

Wow good test to prove that the zfs raidz works as expected.

Code:

[root@X123 ~]# truncate -s 4g /root/disk{a,b,c,d}

[root@X123 ~]# zpool create testpool raidz1 /root/disk{a,b,c,d}

[root@X123 ~]# zpool list -v testpool
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool         15.5G   227K  15.5G        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       15.5G   227K  15.5G        -         -     0%  0.00%      -    ONLINE
    /root/diska      -      -      -        -         -      -      -      -    ONLINE
    /root/diskb      -      -      -        -         -      -      -      -    ONLINE
    /root/diskc      -      -      -        -         -      -      -      -    ONLINE
    /root/diskd      -      -      -        -         -      -      -      -    ONLINE
    
[root@X123 ~]# zfs list testpool
NAME       USED  AVAIL     REFER  MOUNTPOINT
testpool   138K  11.2G     32.9K  /testpool

So this makes me think there is an issue with the SSD Size being 30TB. I wonder if there is some sort of Limit set or its an actual limitation for parity? Wonder if there is a sysctl tunable for this... more research to come apparently

Alain De Vos · Dec 22, 2022

Simple parity disks don't contain actual data.

covacat · Dec 22, 2022

you can create gnop devices on your disk of various sizes and then create the raidz
you can probably to a binary approach to see where the problem hits
so try 16TB if problem go to 8TB if not to 23TB ... etc

mer · Dec 22, 2022

OP, does dmesg or "geom disk list" give reasonably accurate sizes for the devices?

covacat · Dec 22, 2022

mer said:
OP, does dmesg or "geom disk list" give reasonably accurate sizes for the devices?

they seem ok from the pool sizes and his mirror setup

mer · Dec 22, 2022

covacat said:
they seem ok from the pool sizes and his mirror setup

I realized that after I asked, but figured it won't hurt to get confirmation.

ZFS ZFS RaidZ Capacity does not make sense

Administrator