ZFS ZFS space confusion

Hi! I just stood up a new system with a 9-disk RAIDZ3 pool, copied the data from a pre-established dataset on a different server (on a 3 disk RAIDZ1 pool) via zfs send.

On old server, the dataset takes 5.32TB. On the new server, it's taking 6.54TB.

As I dug in, I noticed that every file is taking up significantly more space (according to du) than it should (and did on the old system).

An example:
Code:
# ls -l .wget-hsts
-rwxrwxr-x+ 1 test_user  wheel  175 May 24  2019 .wget-hsts

# du -Ah .wget-hsts
512B    .wget-hsts

# du -h .wget-hsts
 11K    .wget-hsts

As you can see, du (not apparent) is showing that the file is taking up 11Kb (and in fact 11kB is the smallest size I see on any file).

I've researched as much as possible, but haven't found any explanation for why a 175 byte file would take 11kB on disk.

The new system is using a recordsize of 128K, and ashift of 12 (as these are new 4k SAS drives). So I could understand the file taking 4kB due to the sector size, but I can't figure out why it would take more. Does anyone have any idea? Happy to provide more info of course.

(I do understand that a 9 disk RAIDZ3 isn't ideal from a space-efficiency perspective, but this system is going into a remote colo that will be difficult to get out to so I wanted the maximum redundancy I could get. As far as I can tell, that shouldn't affect the output of du; my understanding is that du calculates without taking parity into account.)

Thanks in advance!
 
I can't diagnose your problem, but can confirm you appear to have one...

My recordsize and ashift are at their default values.

My zroot is a mirror, and small files take 4.5K:
Code:
[sherman.154] $ zpool version
zfs-2.0.0-FreeBSD_gf11b09dec
zfs-kmod-2.0.0-FreeBSD_gf11b09dec
[sherman.155] $ zfs get all zroot | grep record   
zroot  recordsize            128K                   default
[sherman.156] $  zpool get all zroot | grep ashift
zroot  ashift                         0                              default
[sherman.157] $ zdb -C | grep ashift | uniq
            ashift: 12
[sherman.158] $ ls -lad .Xauthority
-rw-------  1 phil  phil  348 Mar 23 21:27 .Xauthority
[sherman.159] $ cat .Xauthority | wc -c
     348
[sherman.160] $ du -Ah  .Xauthority
512B    .Xauthority
[sherman.161] $ du -h .Xauthority
4.5K    .Xauthority
 
tl;dr: raidz3 + 4k + small files = pain.

Long version: read this first.

RAID-Z also requires that each allocation be a multiple of (p+1)

Where that is in sectors of 2^ashift (so 4k here), so for raid-z3, the smallest allocation is actually 16k (p=3 for raidz3) with ashift=12, regardless of how it is being reported, and that’s for anything up to 4k of data (after compression.) As you get more and more data, it grows monotonically, but not smoothly; there is the “multiple of four” requirement (16k, then 32k for up to 20k of user data), and then extra requirements come in as you fill out your stripe width.

The good news is that as you have larger files, you approach the 6/9 efficiency you might have expected from 3x parity on 9 drives, but because RAIDZ is fundamentally different (and more flexible) from traditional block-device-raid, you’ll never reach it (without compression). However, if you have compressible data, and your files are not, in general, small, you can do much better than 6/9 efficiency overall with raidz.

I put together a spreadsheet at one time; I think the “layout examples” tab is the best place to start:


If the drives present as 512B (emulated 512B sectors) and you care more about space efficiency than performance, you can recreate the pool with ashift=9 (512B); all the rules still apply, but the multiple-of-4 padding pain is much less.
 
tl;dr: raidz3 + 4k + small files = pain.

Long version: read this first.



Where that is in sectors of 2^ashift (so 4k here), so for raid-z3, the smallest allocation is actually 16k (p=3 for raidz3) with ashift=12, regardless of how it is being reported, and that’s for anything up to 4k of data (after compression.) As you get more and more data, it grows monotonically, but not smoothly; there is the “multiple of four” requirement (16k, then 32k for up to 20k of user data), and then extra requirements come in as you fill out your stripe width.

The good news is that as you have larger files, you approach the 6/9 efficiency you might have expected from 3x parity on 9 drives, but because RAIDZ is fundamentally different (and more flexible) from traditional block-device-raid, you’ll never reach it (without compression). However, if you have compressible data, and your files are not, in general, small, you can do much better than 6/9 efficiency overall with raidz.

I put together a spreadsheet at one time; I think the “layout examples” tab is the best place to start:


If the drives present as 512b (emulated 512b sectors) and you care more about space efficiency than performance, you can recreate the pool with ashift=9 (512b); all the rules still apply, but the multiple-of-4 padding pain is much less.

Thank you for your response!

I'd actually stumbled upon the delphix article while searching for this issue, but it didn't seem like it was what I was seeing here, and that might be because of a misunderstanding I may have.

I'd thought that du was getting the information not counting the parity information needed by the file. Similar to how zpool is showing the raw space and zfs shows the "usable space" post-parity, I'd understood du to be effectively the latter. Which also makes sense as du seems to match approximately to what I see in zfs list.

If du is pulling all blocks, including parity blocks, then this starts to make sense. Was that a misunderstanding on my part, and du is counting the parity space as well or am I missing something else?

(At the same time, based on your calculations above, my smallest file should take 16k, but instead I'm seeing it at 11k. stat agrees with du, showing the small files taking 21 512b blocks.)

Fortunately, I have plenty of space even in this incredibly inefficient config; at my current rate of growth this should last me well over 5 years. But I hate not understanding what I'm seeing, so I'm trying to get to the bottom of the issue so I understand! (And these drives don't present as 512e; they're only 4k).
 
I'm not sure what to make of du's output, but it's certainly not capturing the actual usage in terms of blocks on disk.

For fun I spun up a 9-wide file-backed raidz3 ashift=12 pool. I put 200MB worth of 28k files (7 blocks of user data; ends up needing 6 blocks of redundancy, and three blocks of padding to hit all the requirements in this layout, so intentionally pathological at only 7/16 = 43% efficiency of user data to consumed-on-disk.) You can see the actual usage quite clearly via zpool (the fs is otherwise empty):

Code:
[/testpool/fs]# du -Amd 0 .                                     
200     .
[/testpool/fs]# du -md 0 .
289     .
[/testpool/fs]# zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool  8.50G   471M  8.04G        -         -     0%     5%  1.00x    ONLINE  -
[/testpool/fs]# rm *
[/testpool/fs]# zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool  8.50G  5.20M  8.49G        -         -     0%     0%  1.00x    ONLINE  -

You can see at the top that du reports 200MB apparent usage, and 289MB without the apparent flag.

Subtracting the allocation before and after the deletion, we see that the 200MB of user-data files (along the the extra filesystem metadata that must also exist) consumed 467M, or just shy (a little worse than; metadata entries need some space) of the 43% calculated above. (Note the system only seems to update the zpool usage values periodically (10-15s?) so if you want to retry the experiment, be sure to have a little patience.)

For reference, the reason it needs so much is again the (parity + 1) allocation size rule (multiple of 4 blocks for raidz3) combined with the nine-wide width (maximum of 6 data blocks with 3 parity), so 7 4k user data blocks becomes:
Code:
DDDDDDPPP
DPPPXXX
D = data; P = parity; X = padding to hit allocation rule at 16 blocks. (7/16 ~= 43%) This is a fairly pathological case for this layout; but 1, 2, and 3 user block sizes (<12k after compression) are even worse. If you're mainly writing large files and hitting the 128k record size, you'll get 62% (close to 6/9 limit for 3-parity on 9 devices) efficiency. You can eke a bit more out by increasing the record size further. (But only for large files that will use it.) And as mentioned in the post I linked above, if you've got compressible data, you can do much better. (But the files still need to be large enough to benefit...)
 
Spaces occupied by boot environments can be slightly mind-bending.

Here, at the time of writing, each non-active environment occupies 255 M space, or more:

Code:
% bectl list -c creation
BE                    Active Mountpoint Space Created
n250511-5f73b3338ee-d -      -          4.94G 2021-11-13 15:43
n252381-75d20a5e386-b -      -          6.81G 2022-01-12 23:23
n252450-5efa7281a79-a -      -          6.49G 2022-01-14 19:27
n252483-c8f8299a230-b -      -          4.84G 2022-01-17 14:24
n252505-cc68614da82-a -      -          4.90G 2022-01-18 14:26
n252531-0ce7909cd0b-h -      -          5.71G 2022-02-06 12:24
n252997-b6724f7004c-c -      -          6.17G 2022-02-11 23:07
n253116-39a36707bd3-e -      -          5.66G 2022-02-20 07:03
n253343-9835900cb95-c -      -          1.54G 2022-02-27 14:58
n253627-25375b1415f-e -      -          4.58G 2022-03-12 18:20
n253776-d5ad1713cc3-b -      -          1.56G 2022-03-18 09:31
n253861-92e6b4712b5-a -      -          255M  2022-03-19 07:40
n253861-92e6b4712b5-b NR     /          166G  2022-03-21 12:38
%

With zfs-list(8), the USED column:

Code:
% zfs list -o space | head -n 16
NAME                                      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
august                                     221G   662G        0B     88K             0B       662G
august/ROOT                                221G   166G        0B     88K             0B       166G
august/ROOT/n250511-5f73b3338ee-d          221G  7.21M        0B   7.21M             0B         0B
august/ROOT/n252381-75d20a5e386-b          221G  5.03M        0B   5.03M             0B         0B
august/ROOT/n252450-5efa7281a79-a          221G  3.27M        0B   3.27M             0B         0B
august/ROOT/n252483-c8f8299a230-b          221G  5.46M        0B   5.46M             0B         0B
august/ROOT/n252505-cc68614da82-a          221G  3.20M        0B   3.20M             0B         0B
august/ROOT/n252531-0ce7909cd0b-h          221G  1.90M        0B   1.90M             0B         0B
august/ROOT/n252997-b6724f7004c-c          221G  1.87M        0B   1.87M             0B         0B
august/ROOT/n253116-39a36707bd3-e          221G  7.15M        0B   7.15M             0B         0B
august/ROOT/n253343-9835900cb95-c          221G  2.89M        0B   2.89M             0B         0B
august/ROOT/n253627-25375b1415f-e          221G  6.92M        0B   6.92M             0B         0B
august/ROOT/n253776-d5ad1713cc3-b          221G  2.49M        0B   2.49M             0B         0B
august/ROOT/n253861-92e6b4712b5-a          221G  2.82M        0B   2.82M             0B         0B
august/ROOT/n253861-92e6b4712b5-b          221G   166G      114G   52.1G             0B         0B
%

– each non-active environment uses 7.21 M or less:

Code:
% zfs list -o space -S used | grep ROOT/n
august/ROOT/n253861-92e6b4712b5-b          221G   166G      114G   52.1G             0B         0B
august/ROOT/n250511-5f73b3338ee-d          221G  7.21M        0B   7.21M             0B         0B
august/ROOT/n253116-39a36707bd3-e          221G  7.15M        0B   7.15M             0B         0B
august/ROOT/n253627-25375b1415f-e          221G  6.92M        0B   6.92M             0B         0B
august/ROOT/n252483-c8f8299a230-b          221G  5.46M        0B   5.46M             0B         0B
august/ROOT/n252381-75d20a5e386-b          221G  5.03M        0B   5.03M             0B         0B
august/ROOT/n252450-5efa7281a79-a          221G  3.27M        0B   3.27M             0B         0B
august/ROOT/n252505-cc68614da82-a          221G  3.20M        0B   3.20M             0B         0B
august/ROOT/n253343-9835900cb95-c          221G  2.89M        0B   2.89M             0B         0B
august/ROOT/n253861-92e6b4712b5-a          221G  2.82M        0B   2.82M             0B         0B
august/ROOT/n253776-d5ad1713cc3-b          221G  2.49M        0B   2.49M             0B         0B
august/ROOT/n252531-0ce7909cd0b-h          221G  1.90M        0B   1.90M             0B         0B
august/ROOT/n252997-b6724f7004c-c          221G  1.87M        0B   1.87M             0B         0B
%

Clearer, with snapshot information shown by bectl(8), but still mind-bending:

Code:
% bectl list -c creation -s
BE/Dataset/Snapshot                                         Active Mountpoint Space Created

n250511-5f73b3338ee-d
  august/ROOT/n250511-5f73b3338ee-d                         -      -          7.21M 2021-11-13 15:43
    august/ROOT/n253861-92e6b4712b5-b@2021-11-14-00:24:29-0 -      -          4.93G 2021-11-14 00:24

n252381-75d20a5e386-b
  august/ROOT/n252381-75d20a5e386-b                         -      -          5.03M 2022-01-12 23:23
    august/ROOT/n253861-92e6b4712b5-b@2022-01-14-19:27:21-0 -      -          6.80G 2022-01-14 19:27

n252450-5efa7281a79-a
  august/ROOT/n252450-5efa7281a79-a                         -      -          3.27M 2022-01-14 19:27
    august/ROOT/n253861-92e6b4712b5-b@2022-01-17-04:49:36-0 -      -          6.49G 2022-01-17 04:49

n252483-c8f8299a230-b
  august/ROOT/n252483-c8f8299a230-b                         -      -          5.46M 2022-01-17 14:24
    august/ROOT/n253861-92e6b4712b5-b@2022-01-18-14:26:02-0 -      -          4.83G 2022-01-18 14:26

n252505-cc68614da82-a
  august/ROOT/n252505-cc68614da82-a                         -      -          3.20M 2022-01-18 14:26
    august/ROOT/n253861-92e6b4712b5-b@2022-01-19-16:22:12-0 -      -          4.90G 2022-01-19 16:22

n252531-0ce7909cd0b-h
  august/ROOT/n252531-0ce7909cd0b-h                         -      -          1.90M 2022-02-06 12:24
    august/ROOT/n253861-92e6b4712b5-b@2022-02-07-11:25:41-0 -      -          5.71G 2022-02-07 11:25

n252997-b6724f7004c-c
  august/ROOT/n252997-b6724f7004c-c                         -      -          1.87M 2022-02-11 23:07
    august/ROOT/n253861-92e6b4712b5-b@2022-02-12-17:19:08-0 -      -          6.17G 2022-02-12 17:19

n253116-39a36707bd3-e
  august/ROOT/n253116-39a36707bd3-e                         -      -          7.15M 2022-02-20 07:03
    august/ROOT/n253861-92e6b4712b5-b@2022-02-23-00:42:44-0 -      -          5.65G 2022-02-23 00:42

n253343-9835900cb95-c
  august/ROOT/n253343-9835900cb95-c                         -      -          2.89M 2022-02-27 14:58
    august/ROOT/n253861-92e6b4712b5-b@2022-03-05-15:47:28-0 -      -          1.54G 2022-03-05 15:47

n253627-25375b1415f-e
  august/ROOT/n253627-25375b1415f-e                         -      -          6.92M 2022-03-12 18:20
    august/ROOT/n253861-92e6b4712b5-b@2022-03-14-23:40:12-0 -      -          4.58G 2022-03-14 23:40

n253776-d5ad1713cc3-b
  august/ROOT/n253776-d5ad1713cc3-b                         -      -          2.49M 2022-03-18 09:31
    august/ROOT/n253861-92e6b4712b5-b@2022-03-19-07:40:16-0 -      -          1.56G 2022-03-19 07:40

n253861-92e6b4712b5-a
  august/ROOT/n253861-92e6b4712b5-a                         -      -          2.82M 2022-03-19 07:40
    august/ROOT/n253861-92e6b4712b5-b@2022-03-21-12:38:37   -      -          253M  2022-03-21 12:38

n253861-92e6b4712b5-b
  august/ROOT/n253861-92e6b4712b5-b                         NR     /          166G  2022-03-21 12:38
  n253861-92e6b4712b5-b@2021-07-10-04:31:39-0               -      -          13.8G 2021-07-10 04:31
  n253861-92e6b4712b5-b@2021-11-13-15:43:33-0               -      -          4.94G 2021-11-13 15:43
  n253861-92e6b4712b5-b@2021-11-14-00:24:29-0               -      -          4.93G 2021-11-14 00:24
  n253861-92e6b4712b5-b@2022-01-14-19:27:21-0               -      -          6.80G 2022-01-14 19:27
  n253861-92e6b4712b5-b@2022-01-17-04:49:36-0               -      -          6.49G 2022-01-17 04:49
  n253861-92e6b4712b5-b@2022-01-18-14:26:02-0               -      -          4.83G 2022-01-18 14:26
  n253861-92e6b4712b5-b@2022-01-19-16:22:12-0               -      -          4.90G 2022-01-19 16:22
  n253861-92e6b4712b5-b@2022-02-07-11:25:41-0               -      -          5.71G 2022-02-07 11:25
  n253861-92e6b4712b5-b@2022-02-12-17:19:08-0               -      -          6.17G 2022-02-12 17:19
  n253861-92e6b4712b5-b@2022-02-23-00:42:44-0               -      -          5.65G 2022-02-23 00:42
  n253861-92e6b4712b5-b@2022-03-05-15:47:28-0               -      -          1.54G 2022-03-05 15:47
  n253861-92e6b4712b5-b@2022-03-07-03:48:38-0               -      -          838M  2022-03-07 03:48
  n253861-92e6b4712b5-b@2022-03-14-23:40:12-0               -      -          4.58G 2022-03-14 23:40
  n253861-92e6b4712b5-b@2022-03-19-07:40:16-0               -      -          1.56G 2022-03-19 07:40
  n253861-92e6b4712b5-b@2022-03-21-12:38:37                 -      -          253M  2022-03-21 12:38
%
 
I'm not sure what to make of du's output, but it's certainly not capturing the actual usage in terms of blocks on disk.

For fun I spun up a 9-wide file-backed raidz3 ashift=12 pool. I put 200MB worth of 28k files (7 blocks of user data; ends up needing 6 blocks of redundancy, and three blocks of padding to hit all the requirements in this layout, so intentionally pathological at only 7/16 = 43% efficiency of user data to consumed-on-disk.) You can see the actual usage quite clearly via zpool (the fs is otherwise empty):

Code:
[/testpool/fs]# du -Amd 0 .                                   
200     .
[/testpool/fs]# du -md 0 .
289     .
[/testpool/fs]# zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool  8.50G   471M  8.04G        -         -     0%     5%  1.00x    ONLINE  -
[/testpool/fs]# rm *
[/testpool/fs]# zpool list testpool
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
testpool  8.50G  5.20M  8.49G        -         -     0%     0%  1.00x    ONLINE  -

You can see at the top that du reports 200MB apparent usage, and 289MB without the apparent flag.

Subtracting the allocation before and after the deletion, we see that the 200MB of user-data files (along the the extra filesystem metadata that must also exist) consumed 467M, or just shy (a little worse than; metadata entries need some space) of the 43% calculated above. (Note the system only seems to update the zpool usage values periodically (10-15s?) so if you want to retry the experiment, be sure to have a little patience.)
This is great, thank you.

I did something similar, where I ran a couple of different experiments this morning. (Unfortunately I don't have the result of zpool, because I couldn't add new disks and the existing pool has data I'd prefer not to destroy.) But zfs list and du mostly agree, and I'm seeing crazy numbers at the low end that is far beyond the pathological case that the article and spreadsheet calculate.

First experiment was essentially replicating what you did above with 24k files:

Code:
root@[/mnt/core/tmp]# i=0; while [ $i -lt 1000 ]; do cat /dev/random |head -c 24576 > file$i; let i=$i+1; done
root@[/mnt/core/tmp]# ls |wc -l  
    1000
root@[/mnt/core/tmp]# du -sh
 32M    .
root@[/mnt/core/tmp]# du -shA
 23M    .
root@[/mnt/core/tmp]# du -sh file0
 30K    file0
root@[/mnt/core/tmp]# du -shA file0
 24K    file0
root@[/mnt/core/tmp]# zfs list -o space core/tmp
NAME      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
core/tmp  43.7T  33.1M        0B   33.1M             0B         0B

So I'm getting approximately the same efficiency you have above (measured between du -A and du, where du is matching zfs list).


The second experiment was creating 1000 files of only 24B a piece on a new dataset.
Code:
root@[/mnt/core/tmp]# i=0; while [ $i -lt 1000 ]; do cat /dev/random |head -c 24 > file$i; let i=$i+1; done
root@[/mnt/core/tmp]# ls |wc -l
    1000
root@[/mnt/core/tmp]# du -sh
 13M    .
root@[/mnt/core/tmp]# du -shA
501K    .
root@[/mnt/core/tmp]# zfs list -o space core/tmp                                                           
NAME      AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
core/tmp  43.7T  13.9M        0B   13.9M             0B         0B

root@[/mnt/core/tmp]# du -h file0
 11K    file0
root@[/mnt/core/tmp]# du -Ah file0
512B    file0
root@[/mnt/core/tmp]# ls -l file0
-rw-r--r--  1 root  wheel  24 Mar 25 06:51 file0

So... you can see that NOT including parity (based on zfs list) these files are taking up 13M with 501kB apparent on disk. So this is a 26x difference. Which doesn't map to anything that was discussed in the delphix article as far as I can see.

Given the 4k disks, I could imagine that the actual pre-parity size could be 4MB, and while I don't fully understand it yet, maybe metadata might be another 4M, but in your example above it's clear that zfs list is not including parity given the comparison you're showing with zpool. I think that the padding referred to in the article would count in the parity overhead that wouldn't show up in zfs list or du... and it seems that it almost can't count since this isn't showing the 16 blocks mentioned.. just 11. (And I think that with a 1-block file, there wouldn't need to be padding since it would just be 1 data block and 3 parity blocks, taking up 16kB of space).

So I'm still quite confused.
 
Another interesting data point.. when I pull from /dev/zero instead of /dev/random, du -sh (for 1000 files) shows up as 3M (which is probably correct, rounded to the nearest MiB) since each file will take a single block with compression. So... a 24B file from /dev/random takes up 11kB. A 24B file from /dev/zero (with compression) takes up 512B according to du (because it seems that du assumes 512B blocks). I'm not sure how this fits into the puzzle above, but it's suggesting that somehow real data is taking up a lot more real space.
 
Small enough files (<112B) can take advantage of embedded_data (see zpool-features(7)) to hide where the block-pointer typically lives in the meta-structure of ZFS, just to make things more confusing. ;)
That's even more confusing! That suggests that a 24B file shouldn't take up any data blocks at all! (I verified that "embedded_data" was enabled on this pool).

I should stick with something easy like quantum mechanics....
 
Back
Top