ZFS ZFS RaidZ Capacity does not make sense

Yeah the 30TB NVMe SSDs work in every configuration with the correct sizes except for when raidz,z2,z3.
I am trying to go back to VROC Raid5 now as I ran out of ideas on what could be the issue after playing with all the tunable ZFS has.
I have another machine with similar setup coming and I will attempt to try the Raidz with different sizes of fake disks to see where the issues tops out on to share with everyone.
 
  • Like
Reactions: mer
Ok all so I got the second box and check out this interesting findings
1. First I created 4 28TB Partition disks

Code:
[root@X124 /]# df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/nvd0p4     46G     19G     24G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
fdescfs        1.0K    1.0K      0B   100%    /dev/fd
procfs         4.0K    4.0K      0B   100%    /proc
/dev/nvd0p8    670G    696M    616G     0%    /data
/dev/nvd1p8    670G     24K    616G     0%    /data2
/dev/nvd0p6     31G    6.6G     22G    23%    /tmp
/dev/nvd0p5     31G     20M     28G     0%    /var
/dev/nvd2p1     27T     32M     25T     0%    /Disk1 <- Here
/dev/nvd3p1     27T     32M     25T     0%    /Disk2 <- Here
/dev/nvd4p1     27T     32M     25T     0%    /Disk3 <- Here
/dev/nvd5p1     27T     32M     25T     0%    /Disk4 <- Here

2. Created a fake file and made a raidz
Code:
[root@X124 /]# truncate -s 24TB /Disk1/data

[root@X124  /]# truncate -s 24TB /Disk2/data

[root@X124  /]# truncate -s 24TB /Disk3/data

[root@X124  /]# truncate -s 24TB /Disk4/data

[root@X124  /]# zpool create Raidz raidz /Disk1/data /Disk2/data /Disk3/data /Disk4/data

[root@X124  /]# zpool list -v
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Raidz            96.0T   230K  96.0T        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       96.0T   230K  96.0T        -         -     0%  0.00%      -    ONLINE
    /Disk1/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk2/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk3/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk4/data      -      -      -        -         -      -      -      -    ONLINE
    
[root@X124  /]# zfs list
NAME    USED  AVAIL     REFER  MOUNTPOINT
Raidz   143K  71.7T     32.9K  /Raidz

I confirmed it works by writing some data

Code:
[root@X124 /Raidz]# dd if=/dev/zero of=test.img bs=1M
^C44164+0 records in
44163+0 records out
46308261888 bytes transferred in 62.660866 secs (739030030 bytes/sec)

[root@X124 /Raidz]# df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/nvd0p4     46G     19G     24G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
fdescfs        1.0K    1.0K      0B   100%    /dev/fd
procfs         4.0K    4.0K      0B   100%    /proc
/dev/nvd0p8    670G    696M    616G     0%    /data
/dev/nvd1p8    670G     24K    616G     0%    /data2
/dev/nvd0p6     31G    6.6G     22G    23%    /tmp
/dev/nvd0p5     31G     20M     28G     0%    /var
/dev/nvd2p1     27T     14G     25T     0%    /Disk1
/dev/nvd3p1     27T     14G     25T     0%    /Disk2
/dev/nvd4p1     27T     14G     25T     0%    /Disk3
/dev/nvd5p1     27T     14G     25T     0%    /Disk4
Raidz           72T     42G     72T     0%    /Raidz

So in the end I got 72TB of usable space which is more then using the disks directly which gave me 56TB. Note I could have squeezed another 1TB from each disk as I made the file 24TB instead of the Max 25TB each GPT disk had. Which means the value is closer to 75TB.
We are now within 5TB goal of 80TB expected result.

So what gives?
 
can you recreate the raid with /dev/nvd{2,5}p1 devices instead of the files on these partitions and see if the capacity problem persist
 
Ok all so I got the second box and check out this interesting findings
1. First I created 4 28TB Partition disks

Code:
[root@X124 /]# df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/nvd0p4     46G     19G     24G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
fdescfs        1.0K    1.0K      0B   100%    /dev/fd
procfs         4.0K    4.0K      0B   100%    /proc
/dev/nvd0p8    670G    696M    616G     0%    /data
/dev/nvd1p8    670G     24K    616G     0%    /data2
/dev/nvd0p6     31G    6.6G     22G    23%    /tmp
/dev/nvd0p5     31G     20M     28G     0%    /var
/dev/nvd2p1     27T     32M     25T     0%    /Disk1 <- Here
/dev/nvd3p1     27T     32M     25T     0%    /Disk2 <- Here
/dev/nvd4p1     27T     32M     25T     0%    /Disk3 <- Here
/dev/nvd5p1     27T     32M     25T     0%    /Disk4 <- Here

2. Created a fake file and made a raidz
Code:
[root@X124 /]# truncate -s 24TB /Disk1/data

[root@X124  /]# truncate -s 24TB /Disk2/data

[root@X124  /]# truncate -s 24TB /Disk3/data

[root@X124  /]# truncate -s 24TB /Disk4/data

[root@X124  /]# zpool create Raidz raidz /Disk1/data /Disk2/data /Disk3/data /Disk4/data

[root@X124  /]# zpool list -v
NAME              SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Raidz            96.0T   230K  96.0T        -         -     0%     0%  1.00x    ONLINE  -
  raidz1-0       96.0T   230K  96.0T        -         -     0%  0.00%      -    ONLINE
    /Disk1/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk2/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk3/data      -      -      -        -         -      -      -      -    ONLINE
    /Disk4/data      -      -      -        -         -      -      -      -    ONLINE
   
[root@X124  /]# zfs list
NAME    USED  AVAIL     REFER  MOUNTPOINT
Raidz   143K  71.7T     32.9K  /Raidz

I confirmed it works by writing some data

Code:
[root@X124 /Raidz]# dd if=/dev/zero of=test.img bs=1M
^C44164+0 records in
44163+0 records out
46308261888 bytes transferred in 62.660866 secs (739030030 bytes/sec)

[root@X124 /Raidz]# df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
/dev/nvd0p4     46G     19G     24G    44%    /
devfs          1.0K    1.0K      0B   100%    /dev
fdescfs        1.0K    1.0K      0B   100%    /dev/fd
procfs         4.0K    4.0K      0B   100%    /proc
/dev/nvd0p8    670G    696M    616G     0%    /data
/dev/nvd1p8    670G     24K    616G     0%    /data2
/dev/nvd0p6     31G    6.6G     22G    23%    /tmp
/dev/nvd0p5     31G     20M     28G     0%    /var
/dev/nvd2p1     27T     14G     25T     0%    /Disk1
/dev/nvd3p1     27T     14G     25T     0%    /Disk2
/dev/nvd4p1     27T     14G     25T     0%    /Disk3
/dev/nvd5p1     27T     14G     25T     0%    /Disk4
Raidz           72T     42G     72T     0%    /Raidz

So in the end I got 72TB of usable space which is more then using the disks directly which gave me 56TB. Note I could have squeezed another 1TB from each disk as I made the file 24TB instead of the Max 25TB each GPT disk had. Which means the value is closer to 75TB.
We are now within 5TB goal of 80TB expected result.

So what gives?
That is interesting. Lots of people have differing opinions on "should I ZFS the whole device or partitions?" My opinion from a long time ago is "partitions". One reason is you can guarantee that partitions are the same size, but you can't always guarantee whole devices are. When creating partitions, you can also go and add alignment values so you improve performance (eliminate write amplification).
But that's just me, my opinion, others may differ.
 
Back
Top