ZFS vastly different dataset size on FreeBSD vs illumos/SmartOS

I am currently replacing a lot of old backup-patchwork made up of AMANDA jobs and lots of shellscripts with ZFS-based backups.
One usage scenario for ZFS replication is for our smartOS zones, where the zone datasets (=mostly zvols for KVM VMs) are being snapshotted and send|received to a storage server. With these datasets I see a _much_ higher disk usage on FreeBSD compared to the original datasets on illumos/smartOS.

Both pools use the same ashift (=12 for 4k drive alignment) and datasets are set to use LZ4 compression. However, actual used space on FreeBSD for the replicated datasets is nearly twice as high (USED & REFER) as on illumos.

smartOS # uname -a
SunOS vhost1 5.11 joyent_20170706T001501Z i86pc i386 i86pc
smartOS # zfs list -o name,used,lused,refer,usedbysnapshots,compress,compressratio,dedup zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0
NAME                                               USED  LUSED  REFER  USEDSNAP  COMPRESS  RATIO          DEDUP
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  57.0G  94.9G  45.7G     11.3G       lz4  1.68x            off

FBSD # uname -a
FreeBSD stor1 11.0-RELEASE-p10 FreeBSD 11.0-RELEASE-p10 #5 r309898M: Fri May  5 12:14:20 CEST 2017   root@stor1:/usr/obj/usr/src/sys/NETGRAPH_VIMAGE  amd64
FBSD # zfs list -o name,used,lused,refer,usedbysnapshots,compress,compressratio,dedup -r stor1/backups/zones/winsrv1/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0
NAME                                                                     USED  LUSED  REFER  USEDSNAP  COMPRESS  RATIO          DEDUP
stor1/backups/zones/winsrv1/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  97,4G  95,0G  78,3G     19,1G       lz4  1.31x            off
This output is from a freshly replicated dataset ( ssh zfs send -peDR | zfs recv -ue).

I can see similar behavior with other replicated datasets where FreeBSD is using ~70-100% more disk space than the original dataset on smartOS.

To get out some variables with replicating streams, I transferred just the current state of the dataset to the FreeBSD host:
FBSD # ssh root@ zfs send -e zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@--head-- | zfs recv -ue stor1/test
FBSD # zfs list -o name,used,lused,refer,compress,compressratio -r stor1/test/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0
NAME                                                    USED  LUSED  REFER  COMPRESS  RATIO
stor1/test/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  76,1G  75,9G  76,1G       lz4  1.34x
Also note how LUSED is lower than USED and/or REFER - shouldn't it always be higher when using compression? At least that's what I see everywhere with every "native" and replicated datasets from other FreeBSD systems, except for very small datasets (just a few kb), where more metadata is being written/aggregated than actual data...

Let's see what happens when sending this dataset back to smartOS:
FBSD # zfs send -e stor1/test/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0 | ssh root@ zfs recv -ue zones/test

smartOS # zfs list -ro name,used,lused,refer,compress,compressratio zones/test/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0
NAME                                                    USED  LUSED  REFER  COMPRESS  RATIO
zones/test/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0  43.3G  75.7G  43.3G       lz4  1.76x
smartOS # zfs list -ro name,used,lused,refer,compress,compressratio zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@--head--
NAME                                                        USED  LUSED  REFER  COMPRESS  RATIO
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk0@--head--  1.20G      -  43.3G         -  1.76x

REFER is back to its original size; COMPRESSRATIO is back up at 1.76x just as the original source.
FreeBSD used nearly 33GB more space (+75% !!) for the same data...

For one, the LZ4 implementation in FreeBSD seems to be much less efficient than the one illumos is using (assuming the COMPRESSRATIO is correct). But even this won't account for all the additional space used on FreeBSD, so there may be another, additional problem? (Metadata?)

Can anyone try using ZFS send|receive between FreeBSD and illumos machines and can or can't confirm these findings?

I've never really looked behind the curtains of ZFS (or any filesystem at all...), so if any dev or filesystem-wizard could provide me with some insight on where to start figuring out what is going wrong here, I'd be happy to try to shed some light into this (possible) issue.
  • Thanks
Reactions: Oko
And what's the blocksize property set to on each system? Are they the same? Have you gone through all the properties on each pool/dataset on each system to make sure they are the same?

It's interesting that the compression ratios for the two systems is so different.
I'm guessing (while we wait for a response) that you've changed the layout in the transition here, from, for example, a mirror to a raidz, or from raidz to raidz2... With zvols (which tend to have volblocksize << 128k) and ashift=12, the actual layout efficiency for raidz* varies greatly with raidz width.
Are the configurations (raidzn?; number of devices, etc.) you are comparing the same?

Oh my... I Completely forgot to compare that because all my/our pools are usually made up of mirror-vdevs - EXCEPT for this pool on this smartOS host, which is a single 3-disk raidz1 vdev.
If "zfs list" is really showing actual "physical" disk usage, not "filesystem" usage, this may well be a part of the culprit here.
Doing the math (if I understood the concepts correctly) the numbers sum up roughly correctly:
43.3G on raidz1 with about 66% disk space efficiency and 1.76 CR = 50.28G "raw" data
76.1G on (multiple) mirrors with roughly 50% space efficiency and 1.34 CR = 50.92G "raw" data

So at least part of the fuzz was my fault for not looking at the vdev configurations. Sorry about that :oops:

But there still is a ~.3 difference in LZ4 efficiency between illumos and FreeBSD... I'll try to set up another system/VM with a single 3-disk raidz1 pool to check if this is another side effect of the layout or if there really is a difference between the two operating systems. (Maybe I can also throw in a linux VM for further comparison)
Just a quick update as I didn't had much time to further investigate this yet.

I just made an incremental send|receive for the delta between those latest 2 snapshots on the smartOS host:
NAME                                                                       USED  AVAIL  REFER  MOUNTPOINT
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-02_01.25.00--5d   604K      -   214G  -
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-04_11.04.04--5d  1.56M      -   217G  -

FBSD # ssh root@ zfs send -I @2017-10-02_01.25.00--5d zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-04_11.04.04--5d | mbuffer -s 128 -m 5G | zfs recv -ue stor1/backups/zones/winsrv1
in @  0.0 KiB/s, out @ 66.8 MiB/s,  107 GiB total, buffer   0% fulll                           
summary:  107 GiByte in 29min 37.5sec - average of 61.8 MiB/s

107GiB transferred for a delta of 1.56MB. What?

I already have a HBA + some SSDs laying on my desk to put in my workstation so I can set up 2 clean VMs with identical pool configurations so I can try to reproduce this behaviour.
I'm on field service tomorrow and away on friday, so it will be next week until I can spend some time on this issue. Meanwhile I'm grateful for any ideas or comments on what might be wrong here.
You want to look at ‘written’ for the delta between snapshots, not ‘used’, which is ~ the space you would free up deleting the snapshot.
Sorry, another case of "don't test anything if the coffee hasn't kicked in yet" :oops:
The actual delta is 80.2G:
# zfs list -o name,written -rt snapshot zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1
NAME                                                                      WRITTEN
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-02_01.25.00--5d     613K
zones/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-04_11.04.04--5d    80.2G

So 107GiB transferred is still ~30% more.
The delta on the FreeBSD host however is even larger:
# zfs list -rt snapshot -o name,written stor1/backups/zones/winsrv1/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1                                             
NAME                                                                                            WRITTEN
stor1/backups/zones/winsrv1/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-02_01.25.00--5d     368K
stor1/backups/zones/winsrv1/c86060a8-15b3-c641-d3e9-9cb03a1d6878-disk1@2017-10-04_11.04.04--5d     120G

Before making any further tests or assumptions, I'm going to set up those clean, new VMs and pools so I can eliminate some variables between the systems/pools first.
zpool list -v and zfs get all [zvol] from both systems would be helpful for this discussion.

The source’s written size was likely smaller than the send size due to compression. (There is likely also some minor overhead for the stream, but my bet is on compression for the bulk of the difference here.)
On tuesday I finally found the time to set up 2 VMs and started some testing...
Initially each VM has 2x400GB zvols as backing storage, configured as mirror and with ashift=12.

root@fbsdtest:~ # uname -a
FreeBSD fbsdtest 11.1-RELEASE FreeBSD 11.1-RELEASE #0 r321309: Fri Jul 21 02:08:28 UTC 2017     root@releng2.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC  amd64
root@fbsdtest:~ # zdb | egrep 'ashift| name'
    name: 'test'
            ashift: 12
root@fbsdtest:~ # zpool list -v
test         399G   316K   399G         -     0%     0%  1.00x  ONLINE  -
  mirror     399G   316K   399G         -     0%     0%
    ada1         -      -      -         -      -      -
    ada2         -      -      -         -      -      -

[root@smarttest ~]# uname -a
SunOS smarttest 5.11 joyent_20170928T144204Z i86pc i386 i86pc
[root@smarttest ~]# zdb | egrep 'ashift| name'
    name: 'zones'
            ashift: 12
[root@smarttest ~]# zpool list -v
zones        398G  1.00G   397G         -     0%     0%  1.00x  ONLINE  -
  mirror     398G  1.00G   397G         -     0%     0%
    c1t1d0      -      -      -         -      -      -
    c1t2d0      -      -      -         -      -      -

For testing I replicated a zvol from our smartOS host to 'smarttest' (in hindsight I should have chosen a smaller one :oops:). To check if LZ4 behaves differently on illumos vs FreeBSD this baseline volume is not compressed. I then locally send|receive it into a volume with compression=lz4. So I end up with an uncompressed and an lz4 compressed volume on 'smarttest':
[root@smarttest ~]# zfs list -rt volume -o name,used,lused,refer,usedsnap,compress,ratio,dedup zones/test
zones/test/lz4/test0  73.1G  95.1G  58.3G     14.8G       lz4  1.31x
zones/test/test0      95.4G  95.0G  75.9G     19.5G       off  1.00x

Now I transfer both of these volumes to 'fbsdtest' via send -eLR | recv -ue. To cross-check if received vs "native" LZ4 makes any difference, I also locally send|receive the uncompressed volume into another compressed dataset, so I end up with one uncompressed and two LZ4-compressed volumes:
root@fbsdtest:~ # zfs list -rt volume -o name,used,lused,refer,usedsnap,compress,ratio test
test/lz4/test0_native  73.1G  95.1G  58.3G     14.8G       lz4  1.31x
test/lz4/test0_recv    73.1G  95.1G  58.3G     14.8G       lz4  1.31x
test/test0             95.4G  95.0G  75.9G     19.5G       off  1.00x
So this is just how it should look like - the volume sizes all perfectly match.

The only variation on the production systems are the vdev configurations of the pools. To compare their impact I added 7x200G vdevs to 'fbsdtest' to create a 2x2 mirror pool as on the production smartOS host and a 3-disk RAIDZ1 pool as on the production FreeBSD storage/backup server and propagated the volumes to these pools. To compare the used space on all 3 pools I also removed one of the compressed volumes from the earlier test.
So the pools and their contents now look like this:
root@fbsdtest:~ # zpool list -vo name,size,alloc,free,cap                                                                                                                                     
test         399G   169G   230G    42%
  mirror     399G   169G   230G         -    27%    42%
    ada1        -      -      -         -      -      -
    ada2        -      -      -         -      -      -
test2x2      398G   168G   230G    42%
  mirror     199G  85.4G   114G         -    29%    42%
    ada3        -      -      -         -      -      -
    ada4        -      -      -         -      -      -
  mirror     199G  82.9G   116G         -    28%    41%
    ada5        -      -      -         -      -      -
    ada6        -      -      -         -      -      -
test3z1      596G   336G   260G    56%
  raidz1     596G   336G   260G         -    33%    56%
    ada7        -      -      -         -      -      -
    ada8        -      -      -         -      -      -
    ada9        -      -      -         -      -      -

root@fbsdtest:~ # zfs list -rt volume -o name,used,lused,refer,usedsnap,compress,ratio
test/lz4/test0_native     73.1G  95.1G  58.3G     14.8G       lz4  1.31x
test/test0                95.4G  95.0G  75.9G     19.5G       off  1.00x
test2x2/lz4/test0_native  72.7G  94.6G  58.3G     14.4G       lz4  1.68x
test2x2/test0             95.4G  95.1G  75.9G     19.6G       off  1.00x
test3z1/lz4/test0_native  96.9G  95.1G  77.5G     19.4G       lz4  1.31x
test3z1/test0              127G  95.0G   101G     25.8G       off  1.00x

root@fbsdtest:~ # zpool list -o name,size,alloc,free,cap
test      399G   169G   230G    42%
test2x2   398G   168G   230G    42%
test3z1   596G   336G   260G    56%

To cross check if the numbers are reported the same on illumos, I attached the disks/volumes for pool "test2x2" and "test3z1" to the smartos-VM and the numbers are all identical; so there is no difference in the reporting logic.

So on the single-mirror pool and the 2x2 mirror pool the same data uses the same amount of space - 196G or 42% of the pools usable capacity of ~400G (2x400G mirror and 2x2x200G mirror)
Looking at the RAIDZ1 pool, it has a reported size of 596G. Regarding to different sources, this is due to the implementation of RAIDZ in that it includes the parity data in these numbers (size and allocated space) - so for a 3-disk pool the reported numbers are ~33% higher. For the pool size this is perfectly plausible, but allocated space * 0.66 = 221G and therefore still way off...
However I try to mangle the numbers, RAIDZ always ends up using more space for the same data. Either there is either some logic (magic) in the reported space allocation for RAIDZ I don't understand, or RAIDZ is - at least for such small vdevs - very inefficient.

But at least I ruled out variations in ZFS behaviour between platforms, so thats a good thing and I was sceptical of this theory from the beginning, given it is basically the same OpenZFS codebase.
The allocations in raidzn must occur in n+1 multiples of 2**ashift bytes, or in your case, multiples of 8KB. This is done to ensure that when an allocation is removed, the space is able to be used without relying on adjacent allocations being freed. Yes, this is different than block-based raid5 or raid6.

What is your volblocksize? I'm guessing 4k or 8k. Here's why:

With D data, P parity, and X padding to hit multiples of (n-1) 2** ashift = 4K blocks, (and ‘.’ part of some other allocation) allocations and efficiencies (payload / allocation size) for your 3 wide z1 are:

      +- DISKS -+
PAYLD 012 012 012  TOT  Alloc Efficiency (Payload / Total allocation)
4k    DP. ... ...  8k   50% (Worst case)
8k    DDP X.. ...  16k  50%
12k   DDP DPX ...  24k  50%
16k   DDP DDP ...  24k  67% (Best case == 3-wide raid5)
20k   DDP DDP DP.  32k  62.5%

and so on...

This is not monotonic, but somewhere be 50% and 67% for all cases with this layout. The smallest allocations are always the worst case. Note that raid width and ashift both factor in heavily.

It appears zfs assumes best-case "alloc efficiency", using 67% overhead for 3-drive Z1 when reporting available sizes. Said another way, an empty raidz1 of 3 100MB devices will show 200MB available. To achieve this, all allocations will need to be "best case."

If we look at the refer for the non-compressed test/test0 vs. test3z1/test0, we have 75.9G / 101G = 75% relative efficiency (where “100% relative efficiency” would imply the best-case 67% alloc-efficiency)

relative * best_case = actual: 75% * 67% == 50% (predicted from table above for 4-12k payloads) raw efficiency.

Hope that helps and makes sense. Avoid small volblocksizes on raidz. I typically use 32k on raidz3 with compression; seems to be a decent compromise on the performance / space efficiency curve.
  • Thanks
Reactions: sko
Wow, thanks! This helped me a lot!

The volblocksize of the test data is indeed 8k. I did the math by your examples for some other volumes on the production systems and it now all makes perfect sense, thank you!
The blog entry by Matt Ahrens was also very helpful.
So from what I've learned, the bottom line for zvols is: To increase space efficiency, use larger volblocksize and/or use more providers per vdev.

For filesystem datasets ZFS uses variable stripe sizes, correct? So the efficiency on these datasets mainly depends on the actual data? What role does the recordsize have (if any) for these?

The only thing I still don't get is the reported size of the 3-disk raidz1 pool. Mirror pools are reported with the actual usable size ( /2 the physical space for 2-way mirror), but the raidz pool with (roughly) the total physical size of all providers?
The man pages (on illumos/smartOS and FreeBSD) are pretty scarce here:
    size        Total size of the storage pool.
A few points:

A wider raidzn will be more efficient (with these small blocks) in general, but not as compared to the efficiency of a vanilla raidN (no-z) of the same parity and width. See the second link (“not monotonous”) in my last message for efficiency (2nd tab) and efficiency compared to raidN (first tab.)

Efficiency on a dataset depends on the actual files. Lots of small files has the same issue, lots of large (where large == >recordsize) files will do much better. Setting recordsize=1m will help reduce this (raidz padding) overhead down to very low level if you are mainly saving media files, with very few drawbacks. (As always, test. It’s not a good choice for database storage, for example. Large files accessed/modified in small chunks.) As documented elsewhere, recordsize is an upper bound; actual recordsize for a file is determined when it is being initially created.

The zpool listing seems in line with the man page when interpreted as “the size of the storage consumed by this pool.” 3x200 raidz1 consumes 600 GB of hard drive space; 2x200 mirror consumes 400 GB of hard drive space.
  • Thanks
Reactions: sko
The zpool listing seems in line with the man page when interpreted as “the size of the storage consumed by this pool.” 3x200 raidz1 consumes 600 GB of hard drive space; 2x200 mirror consumes 400 GB of hard drive space.

The mirrored pools consist of 2x400GB (test) and 2x2x200GB (test2x2); so the total consumed storage for these pools should be 800GB following that logic. This is what confuses me about the zpool list "size" output.
My mistake; I see you’re correct — I must not look at zpool list too often, other than for capacity % !! ;)

I suppose you could interpret it as the unique storage space managed (as the
mirrors will be ~ identical, while each member of the raid contains different information) but that’s not clear at all from the man page. You could crack open the source and see what the actual logic is...
I just found this bug report for ZoL [1] which points to a passage in the zpool(8) manual page:
The space usage properties report actual physical space available to the storage pool. The physical space can be different from the total amount of space that any contained datasets can actually use. The amount of space used in a raidz configuration depends on the characteristics of the data being written. In addition, ZFS reserves some space for internal accounting that the zfs(8) command takes into account, but the zpool command does not. For non-full pools of a reasonable size, these effects should be invisible. For small pools, or pools that are close to being completely full, these discrepancies may become more noticeable.

So zpool reports the physical space available to the pool - i.e. all space that can be used for "unique" blocks/data (mirrored data is not unique).
The important, if somewhat vague, passage regarding the reported size from zpool list is:
In addition, ZFS reserves some space for internal accounting that the zfs(8) command takes into account, but the zpool command does not.
internal accounting includes parity data; which for zpool is just like any other (user-) data and therefore parity isn't subtracted from the reported pool size for raidz pools/vdevs.

TL;DR: Reported size from zpool is all space available on the vdev/pool for writing any type of data, including parity.

[1] https://github.com/zfsonlinux/zfs/issues/1942