ZFS Size discrepancy between same dataset on sending and receiving host

ggb · Jul 23, 2024

I've got a dataset without children that I sent from one server to another with the code below

Code:

$ ssh remotehost zfs send zpool/dataset@snap | zfs recv zroot/dataset

Now the two datasets have a nearly 2x difference in their 'used' property. I have read the relevant man pages, but I fail to understand the discrepancy here.

From the original:

Code:

# remotehost
$ zfs get used,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize zpool/dataset
NAME           PROPERTY       VALUE           SOURCE
zpool/dataset  used           53.4G           -
zpool/dataset  usedbydataset  51.7G           -
zpool/dataset  logicalused    20.1G           -
zpool/dataset  referenced     51.7G           -
zpool/dataset  copies         2               local
zpool/dataset  compression    zstd            local
zpool/dataset  compressratio  1.98x           -
zpool/dataset  volblocksize   -               -

And from the receiving host

Code:

# localhost
$ zfs get used,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize zroot/dataset
NAME           PROPERTY       VALUE           SOURCE
zroot/dataset  used           27.9G           -
zroot/dataset  usedbydataset  27.2G           -
zroot/dataset  logicalused    19.9G           -
zroot/dataset  referenced     27.2G           -
zroot/dataset  copies         2               received
zroot/dataset  compression    zstd            received
zroot/dataset  compressratio  2.00x           -
zroot/dataset  volblocksize   -               -

Erichans · Jul 23, 2024

What is the output of
zfs list -r -o space,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize -t all zroot/dataset
zdb -C zroot | grep ashift
and
zpool status zroot
on both hosts?

ggb · Jul 23, 2024

remotehost; sender

Code:

$ zfs list -r -o space,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize -t all preview1/hestia
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  USEDDS  LUSED     REFER  COPIES  COMPRESS        RATIO  VOLBLOCK
zpool/dataset                      42.8G  53.4G     1.72G   51.7G             0B         0B   51.7G  20.2G     51.7G  2       zstd            1.98x         -
zpool/dataset@snap1                    -  1.14G         -       -              -          -       -      -     50.8G  -       -               2.04x         -
zpool/dataset@snap2                    -   405M         -       -              -          -       -      -     51.5G  -       -               2.04x         -
$
$ zpool status zpool
  pool: zpool  
 state: ONLINE
  scan: scrub repaired 0B in 00:05:20 with 0 errors on Mon Jul 31 01:27:37 2023
config:

        NAME        STATE     READ WRITE CKSUM
        zpool       ONLINE       0     0     0
          da2p1     ONLINE       0     0     0

errors: No known data errors

localhost; receiver (the removed drive is a flaky cache device; no observed impact on the pool other than obvious performance)

Code:

$ zfs list -r -o space,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize -t all zroot/hestia
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  USEDDS  LUSED  REFER  COPIES  COMPRESS        RATIO  VOLBLOCK
zroot/dataset                      16.3T  27.9G      773M   27.2G             0B         0B   27.2G  19.9G  27.2G  2       zstd            2.00x         -
zroot/dataset@snap1                    -   773M         -       -              -          -       -      -  26.1G  -       -               2.04x         -
zroot/dataset@snap2                    -     0B         -       -              -          -       -      -  27.2G  -       -               2.04x         -
$
$ zpool status zroot
  pool: zroot
 state: ONLINE
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 02:34:11 with 0 errors on Wed Jul 17 14:41:11 2024
config:

        NAME            STATE     READ WRITE CKSUM
        zroot           ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            ada2p4.eli  ONLINE       0     0     0
            ada3p4.eli  ONLINE       0     0     0
          mirror-1      ONLINE       0     0     0
            ada0p1.eli  ONLINE       0     0     0
            ada1p1.eli  ONLINE       0     0     0
        logs
          mirror-3      ONLINE       0     0     0
            nda1p1      ONLINE       0     0     0
            nda2p1      ONLINE       0     0     0
        cache
          nda0p1        REMOVED      0     0     0

errors: No known data errors

The first send was of snap1. I have since sent snap2. The sender continues to accrue data, but that data is quite small. The difference in size was basically the same immediately upon first send.

Erichans · Jul 23, 2024

Right, you have a different set up of vdevs on either host; your first one shown is just one vdev (da2p1) ; that explains some difference, but a factor of approx 2 seems strange.
Ashift difference might reveal more: can you give the output of zdb -C zroot | grep ashift?

ggb · Jul 23, 2024

Sorry, I skipped right over the zdb command in your first response.

remotehost; sender

Code:

$ zdb -C zpool | grep ashift
                ashift: 12

localhost; receiver (whole output, trimmed to only nodes with ashift set, d/t differences in ashift values)

Code:

MOS Configuration:
        version: 5000
        name: 'zroot'
        ...
        hole_array[0]: 2
        vdev_children: 4
        vdev_tree:
            type: 'root'
            id: 0
            guid: 13913236799939257359
            create_txg: 4
            children[0]:
                type: 'mirror' // children are mirror vdev disks
                id: 0
                ...
                ashift: 12
                ...
            children[1]:
                type: 'mirror' // children are mirror vdev disks
                id: 1
                ...
                ashift: 12
                ...
            children[2]:
                type: 'hole' // no children
                id: 2
                ...
                ashift: 0
                ...
            children[3]:
                type: 'mirror' // children are SLOG vdev disks
                id: 3
                ...
                ashift: 12

ggb · Jul 23, 2024

How would the different topology affect how much space is used? The single-device pool shows what seems to be more space used. If anything, more total raw disk space is consumed by the data stored on mirrors, but this pool shows significantly less used.

ggb · Jul 23, 2024

The question about ashift made me double check. All disks on both machines are listed as having a sectorsize of 512. Not sure if this is a factor that would interact with ashift.

remotehost; sender

Code:

$ geom disk list | grep -i sectorsize
   Sectorsize: 512
   Sectorsize: 512
   Sectorsize: 512

localhost; receiver

Code:

$ geom disk list | grep -i sectorsize
   Sectorsize: 512
   Sectorsize: 512
   Sectorsize: 512
   Sectorsize: 512
   Sectorsize: 512
   Sectorsize: 2048 // this is a bluray disk drive
   Sectorsize: 512
   Sectorsize: 512

Erichans · Jul 23, 2024

Unfortunately, I don't have an answer for the space difference, I cannot see a culprit that I could point to.

I do have some remarks and observations.

ggb said:
How would the different topology affect how much space is used? The single-device pool shows what seems to be more space used. If anything, more total raw disk space is consumed by the data stored on mirrors, but this pool shows significantly less used.

First question in general: padding; you are using a 2-way mirror of stripes. However, I'm much inclined to agree with your remark that then that would likely have resulted in a difference the other way around.

Your ashifts do not look suspicious to me (an ashift=9 where 12 or 13 would be appropriate could be problematic).

I noticed that you have copies=2. Given a single vdev (da2p1) on the remote host this will increase redundancy, not like a mirror of two physical disks, but it helps. I see that you also have copies=2 on your local host where you have a ZFS mirror. There it would also help, but I personally would not prefer that as there already is the redundancy of the mirror itself. That totals 4 copies where 2 of them will really slow down the writes to the mirror by about a factor of 2 probably. If those happen to be NAND flash SSDs, I'd say you are increasing your wear unnecessarily.

Shot in the dark: you didn't by any chance happen to set copies=2 on your local mirror after your send-receive transfer?

ggb · Jul 24, 2024

Thanks for the attention on this.

I set copies and compression on remotehost upon creating the dataset, so those have been there from the beginning. Localhost inherited those upon receiving the first send from remotehost.

The remote's disk is redundant storage from a public cloud provider, and also backed up regularly, so I figured I could afford to go with just a single "disk". I set copies for peace of mind on the dataset. I don't mind it on the receiving side, as this dataset is transient anyway. I brought it local to do some analysis and work on refactoring the application that generates all this.

I can tell you that the write profile of the application is lots of very small files with gaps of minutes in between them. Is it possible that the average amount of data in the ZIL is smaller than a ZFS block size? If the writes are routinely smaller than block size, could that account for this apparent inflation? If it is possible for that sort of write inflation, would a send | recv compact the data in terms of blocks on disk?

Erichans · Jul 24, 2024

ggb said:
would a send | recv compact the data in terms of blocks on disk?

This is the one thought I held back for you to try, as I saw it as a remote possibility and a bit cumbersome in your set up beacause of the "distance" between local and remote (although we're not talking TBs here). This could make the problem go away; however, it would not as a consequence, directly solve the mystery!

That ZIL notion you state is an interesting one. I think that the ZIL write delay is something like max 5s. On the scale of things that 5s is a pretty long time. I couldn't say definitely one way or the other.

I'd consider documenting this thorougly and if no one here comes up with an explanation, then pose this problem on the appropriate FreeBSD mailinglist or the OpenZFS github website; I'm inclined to go directly to the OpenZGS github. Either way, it may be wise to keep that org "big" remote dataset available.

ggb · Jul 24, 2024

I'm happy to take this to a developers' mailing list. I can condense the conversation in this thread. Is there anything else worthwhile to include in context or documentation?

ggb · Jul 24, 2024

So, here's some additional detail, based on looking at du. The size reported by du is the same as that reported by ZFS as useddataset. But the apparent size is much smaller, and also smaller than the logicalused of ZFS.

This makes me think that it has to do with the many very small files stored in this dataset. I have many files of <512B. A quick spot check shows three typical files with <512B showing up under du as 512B when using -A and as 8.5KiB without -A.

remotehost; sender:

Code:

$ zfs list -r -o space,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize -t all preview1/hestia
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  USEDDS  LUSED     REFER  COPIES  COMPRESS        RATIO  VOLBLOCK
zpool/dataset                      42.8G  53.4G     1.72G   51.7G             0B         0B   51.7G  20.2G     51.7G  2       zstd            1.98x         -
$ du -h -d0 /dataset
 52G    /dataset
$ du -hA -d0 /dataset
 18G    /dataset

localhost; receiver:

Code:

$ zfs list -r -o space,usedbydataset,logicalused,refer,copies,compress,compressratio,volblocksize -t all zroot/hestia
NAME                               AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD  USEDDS  LUSED  REFER  COPIES  COMPRESS        RATIO  VOLBLOCK
zroot/dataset                      16.3T  27.9G      773M   27.2G             0B         0B   27.2G  19.9G  27.2G  2       zstd            2.00x         -
$ du -h -d0 /zroot/dataset  
 28G    /zroot/dataset
$ du -hA -d0 /zroot/dataset
 18G    /zroot/dataset

Erichans · Jul 24, 2024

ggb said:
Is there anything else worthwhile to include in context or documentation?

Nothing comes to mind at the moment.

ggb said:
This makes me think that it has to do with the many very small files stored in this dataset. I have many files of <512B

Hmm, but on the face of it both host's datasets have not shown radically different settings sofar, why would these many small files result in a (big) difference in space used between these two datasets?

- Suggestion
If you can create a separate pool & dataset locally on some partition outside of your local mirror that mimics the remote host's dataset on da2p1, thus a single vdev (for consistancy don't forget to get t set copies=2 at creation time and same ashift where the underlying disk (if something else) also requires the same ashift). Then:

ggb said:
If it is possible for that sort of write inflation, would a send | recv compact the data in terms of blocks on disk?

Ideally, two tests. First test with as source-sender from the remote host (if easily doable) and for the second test with as source-sender your local mirror. I'd like to see what happens it we can eleminate any unexpected effects of the local mirror of stripes. Especially the first test would be a more clear A to A' comparison when put to a mailing list.

(I'm still contemplating if there isn't some "factor of 2" we've been overlooking while staring us in the face; that, or there's something wrong.)

Edit:

ggb said:
I set copies and compression on remotehost upon creating the dataset, so those have been there from the beginning.

When doing the transfer by zfs send/receive the data gets decompressed & serialized and on the receiver side deserialized and compressed according to the compression on/off and the algorithm used: zstd. All clear on the receiving end. However on the sending end: you must be sure in being able to state that not only compression was set at creation time, but also that zstd was used from the get go. AKAIK, lz4 would have been/still is the default; don't know all the details about zstd. Compression is on a per file basis, so with lots of very small files that would not bring any space savings, however, that being said, I still can't see how this relatively bad compression efficiency for those very small files would work out differently between the two datasets.

ggb · Jul 24, 2024

For testing, would a new file-based zpool running on top of my local mirrors work? Or should it be completely independent disks?

As for the history of the datasets:

remotehost, sender:

Code:

$ zpool history
History for 'zpool':
2022-05-17.02:44:06 zpool create zpool da2p1
2022-05-17.03:05:35 zfs set dedup=off zpool
2022-05-17.03:22:18 zfs create -v -o atime=off -o compress=zstd -o copies=2 -o dedup=off -o mountpoint=/dataset zpool/dataset
...

And the remainder of truncated history is scrubs, imports, snapshots, one other dataset creation, and the sends in question.

For localhost, the target dataset did not exist, but was created upon receive:
localhost; receiver

Code:

$ doas zpool history                                                                                                                                                  
History for 'zroot':
...snip installer-driven configuration and a couple years of maintenance...
2024-07-19.12:31:16 zfs recv zroot/dataset

All dataset configuration on localhost was from the recv. The pool into which it was received also has ZSTD compression set (but no copies)

Erichans · Jul 24, 2024

A file could probably work. I'd prefer to create as much a direct comparable a situation as on your remote host. So only one disk, one vdev-one partition, needing and getting the same ashift. Size of the disk or partition shouldn’t matter. You'll decide what's feasible and how that aligns with your own ideas.