ZFS Causes for a file reporting larger than the apparent size?

I run a Usenet service with FreeBSD 13.2-RELEASE-p4 and INN running in a ESXi VM atop NVMe storage using the NVMe controller, if that matters. I'm noticing a strange issue where the reported size of a 10GB news buffer file is larger than the actual/apparent size and I do not understand why.

Code:
# du cnfs2
12612853    cnfs2
# du -A cnfs2
10000000    cnfs2

This is a file that is continuously written to and when it reaches the end begins storing overtop data at the beginning (known as a CNFS buffer in INN/Usenet speak). The first few times the buffer is filled and wraps the space reported by 'du' is smaller than the file (given the small amount of compression from ZFS) but after it wraps a number of times 'du' starts to report larger sizes.

This file was originally created with 'dd if=/dev/zero of=cnfs2 bs=1k count=10000000' and has been written to for a few days with INN and wrapped ~13 times.

When I copy the file over to a Linux box with ext4 it looks 'normal':

Code:
# du cnfs2
10000000    cnfs2
# du --apparent-size cnfs2
10000000    cnfs2

I then copied the file back from the Linux box to a new folder and now it is using double?!
Code:
# du -A cnfs2
10000000    cnfs2
# du cnfs2
20224841    cnfs2

When I remove that file ~20GB of space are recovered according to df.

What would cause this kind of behavior? It feels like this is recent change in behavior after a patch release as I don't recall seeing this a year+ ago. Typically the file consumes about 9.5GB of space given the little bit of compression provided by ZFS.
 
One possible reason is snapshots.

Defining the exact disk usage of a file in a log-structured and garbage collecting system such as ZFS is difficult. If the physical file layout on disk contains holes with data that is logically deleted, but has not been garbage collected or compacted yet, then are those holes really used? If nothing else is using them, and perhaps can't even use them, they're not free space either.
 
One possible reason is snapshots.

Defining the exact disk usage of a file in a log-structured and garbage collecting system such as ZFS is difficult. If the physical file layout on disk contains holes with data that is logically deleted, but has not been garbage collected or compacted yet, then are those holes really used? If nothing else is using them, and perhaps can't even use them, they're not free space either.
Originally thought it could be caused by a snapshot I forgot to remove, but I have none. For the heck of it I tried a scrub and trim but no change.

It is possible there are holes in the file, as some articles that get stored inside the file are later deleted (spam). I've been using these buffer files for many years and this is the first time I've noticed one of them 'grow' beyond their actual size until recently.
 
du(1)
Code:
DESCRIPTION
     The du utility displays the file system block usage for each file
     argument and for each directory in the file hierarchy rooted in each
     directory argument. [...]

ZFS uses dynamic block sizes and du has no concept of that, hence its calculations are usually very wrong, especially if you ask it about something like on-disk size, where ZFS compression also comes into play...

You can use zdb(8) to examine objects on zfs datasets to examin such details. use zdb -dd <dataset> to see everything on that dataset. I created a dataset and generated a file via [cmd]dd if=/dev/random of=testfile bs=1M count=100 and ended up with those contents on the dataset:
Code:
# zdb -dd zroot/usr/home/sko/test
Dataset zroot/usr/home/sko/test [ZPL], ID 107896, cr_txg 48816777, 100M, 8 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         0    6   128K    16K    56K     512    32K   12.50  DMU dnode
        -1    1   128K    512      0     512    512  100.00  ZFS user/group/project used
        -2    1   128K    512      0     512    512  100.00  ZFS user/group/project used
        -3    1   128K    512      0     512    512  100.00  ZFS user/group/project used
         1    1   128K     1K     8K     512     1K  100.00  ZFS master node
         2    2   128K   128K   100M     512   100M  100.00  ZFS plain file
        32    1   128K    512      0     512    512  100.00  SA master node
        33    1   128K    512      0     512    512  100.00  ZFS delete queue
        34    1   128K    512      0     512    512  100.00  ZFS directory
        35    1   128K  1.50K     8K     512  1.50K  100.00  SA attr registration
        36    1   128K    16K    16K     512    32K  100.00  SA attr layouts
        37    1   128K    512      0     512    512  100.00  ZFS directory

    Dnode slots:
        Total used:             8
        Max used:              37
        Percent empty:  78.378378

To examine a file you have to find its object number and add it to the zdb command:
Code:
# zdb -dd zroot/usr/home/sko/test 2
Dataset zroot/usr/home/sko/test [ZPL], ID 107896, cr_txg 48816777, 100M, 8 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K   100M     512   100M  100.00  ZFS plain file

This file is rather boring: the disksize (dsize) is 100M as well as its logical size (lsize) as that random data isn't compressible.

If I create the file from /dev/zero you can see the actual disksize is 0 thanks to compression:

Code:
# dd if=/dev/zero of=testfile bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes transferred in 0.017417 secs (6020568238 bytes/sec)

# zdb -dd zroot/usr/home/sko/test 2
Dataset zroot/usr/home/sko/test [ZPL], ID 107896, cr_txg 48816777, 96K, 8 objects

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K      0     512   100M    0.00  ZFS plain file

But always keep in mind there's also metadata involved that isn't shown here, as we only look at the object that acutally holds the file data.


to get even more information about an object, add 2 more 'd's to the command:
Code:
# zdb -dddd zroot/usr/home/sko/test 2
Dataset zroot/usr/home/sko/test [ZPL], ID 107896, cr_txg 48816777, 100M, 8 objects, rootbp DVA[0]=<0:16518ddc000:1000> DVA[1]=<0:1cfd1158000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=48816872L/48816872P fill=8 cksum=e6115edcc:280bf1688c10:3bd08f14d61f54:3f6fab287118a259

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
         2    2   128K   128K   100M     512   100M  100.00  ZFS plain file
                                               168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED 
        dnode maxblkid: 799
        uid     0
        gid     0
        atime   Wed Oct 11 13:18:04 2023
        mtime   Wed Oct 11 13:18:04 2023
        ctime   Wed Oct 11 13:18:04 2023
        crtime  Wed Oct 11 13:18:04 2023
        gen     48816872
        mode    100644
        size    104857600
        parent  34
        links   1
        pflags  40800000004

adding another 'd' would give you details for every single block of the object.
 
Thanks - here's what I've got. I don't understand why the size on disk is greater than the logical size though. Adding another d provides output I can't understand. :-)

Code:
# zdb -dddd zroot/ROOT/default 34227873
Dataset zroot/ROOT/default [ZPL], ID 389, cr_txg 8, 1.50T, 298769837 objects, rootbp DVA[0]=<0:ddfc5f9000:1000> DVA[1]=<0:12314a7c000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=8986828L/8986828P fill=298769837 cksum=13718114dc:37392589acec:5401b2ab6aa3a6:5a64a32bd5329286

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  34227873    3   128K   128K  11.5G     512  9.54G  100.00  ZFS plain file
                                               168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
    dnode maxblkid: 78124
    uid     8
    gid     8
    atime    Sat Jun 25 03:04:40 2022
    mtime    Wed Oct 11 06:46:29 2023
    ctime    Wed Oct 11 06:46:29 2023
    crtime    Sat Jun 25 03:04:40 2022
    gen    832251
    mode    100644
    size    10240000000
    parent    53742
    links    1
    pflags    40800000004
 
[...] I then copied the file back from the Linux box to a new folder and now it is using double?!
Code:
# du -A cnfs2
10000000    cnfs2
# du cnfs2
20224841    cnfs2

When I remove that file ~20GB of space are recovered according to df.
I'd like to know what zfs list -o space reports on the dataset where cnfs2 belongs to, just before versus just after its deletion. For a consistent stable comparison, you'd probably best create a new seperate dataset for this file as the target of that last copy.

edit: what ZFS setup are you using: n-way mirror or ZFS RAIDzx?
Alternatively, when you have a seperate dataset containing solely cnfs2, you could also get an accurate assessment by performing a dry run: zfs destroy -nv <dataset> (that does then include the administrative setup of that dataset as well, but compared to the filesize that would seem minimal)
 
I'd like to know what zfs list -o space reports on the dataset where cnfs2 belongs to, just before versus just after its deletion. For a consistant stable comparison, you'd probably best create a new seperate dataset for this file as the target of that last copy.
That particular example seems to be a weird fluke I can't explain, any subsequent attempts to copy the file result in the same size as the source.

That said, I duplicated the file and here is a before and after removing it with the irrelevant lines snipped:
Code:
# zfs list -o space -p
NAME                       AVAIL           USED  USEDSNAP         USEDDS  USEDREFRESERV      USEDCHILD
zroot                56464748544  1721793101824         0          98304              0  1721793003520
zroot/ROOT           56464748544  1663108702208         0          98304              0  1663108603904
zroot/ROOT/default   56464748544  1663108603904         0  1663108603904              0              0

# rm cnfs2

# zfs list -o space -p
NAME                       AVAIL           USED  USEDSNAP         USEDDS  USEDREFRESERV      USEDCHILD
zroot                68907159552  1709350690816         0          98304              0  1709350592512
zroot/ROOT           68907159552  1650666745856         0          98304              0  1650666647552
zroot/ROOT/default   68907159552  1650666647552         0  1650666647552              0              0
 
Maybe something is wrong with the zroot dataset/filesystem. I added another virtual disk to the VM and created a new dataset and copied the file there:
Code:
# du -A cnfs2
10000000    cnfs2
# du cnfs2
9702029    cnfs2
 
Maybe something is wrong with the zroot dataset/filesystem. I added another virtual disk to the VM and created a new dataset and copied the file there:
Code:
# du -A cnfs2
10000000    cnfs2
# du cnfs2
9702029    cnfs2
I was comparing, and the only option/difference I see between these datasets is that the new one I created does not have dedup on.
 
I enabled dedup, so now all ZFS options are the same, removed all files on the new dataset and recopied the cnfs2 file:
Code:
# zdb -dddd -O additional test2

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
       128    3   128K   128K  9.32G     512  9.54G  100.00  ZFS plain file

This also does not make sense to me, when I make a copy of the file and remove it, for a brief time zfs list -o space shows space being reclaimed, but then a few seconds later it is back to the original values:
Code:
# zfs list -o space
NAME                AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
additional           762G  18.7G        0B   18.6G             0B      76.1M
zroot               63.8G  1.55T        0B     96K             0B      1.55T
zroot/ROOT          63.8G  1.50T        0B     96K             0B      1.50T
zroot/ROOT/default  63.8G  1.50T        0B   1.50T             0B         0B

# zfs list -o space
NAME                AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
additional           762G  18.7G        0B   18.6G             0B      76.1M
zroot               52.4G  1.55T        0B     96K             0B      1.55T
zroot/ROOT          52.4G  1.50T        0B     96K             0B      1.50T
zroot/ROOT/default  52.4G  1.50T        0B   1.50T             0B         0B
 
You might have a complicating factor in your FreeBSD VM running on top of ESXi. I have very little knowledge about ESXi and I don't know in what way it grants your FreeBSD VM disk access. Ideally it should grant FreeBSD exclusive and total control of your (NVMe) devices that make up your FreeBSD ZFS pool; without interfering or adding a translation layer. If not, that might "trouble" ZFS in accurately assessing data properties of the underlying hardware devices of the pool. I don't know exactly what that kind of influence that might have on ZFS, but I do know that ZFS is meant to be run with complete hardware access to its physical disks (that's why ZFS on top of hardware raid is not recommended for example).

Regarding your last message (#11): that you're experiencing unstable disk space measurements over time, just as you have deleted a large file is something that happens because it takes some time to do "garbage collecting"edit*: process all invalidated blocks on the disk by the deletetion of the file. However, I can't explain why the AVAIL space seems to diminish over time as displayed. I don't know if certain settings at the ESXi level might have anything to do with that. Does ESXi (its settings) set the amount of available disc blocks for the FreeBSD VM in a flexible manner?

I enabled dedup, so now all ZFS options are the same,
I don't know details about your data in your ZFS pool but, in general, probably more then one would think: deduplication not worth the effort. It's quite resource hungry and might also impede ZFS file I/O performance. Have you ever assessed if you really want/need deduplication?

As you seem to have enabled dedup from the beginning, try zdb -S data to get more insight; especially relevant are the last few lines of the output.

P.S. I'm still interested in your ZFS setup: mirror or RAIDZx.

___
* in ZFS deletion of a file is instantanious because of the nature of ZFS' implementation as a Copy On Write (=COW) filesystem. (Based on that same principle a ZFS snapshot is (nearly) instantanious as well). After the invalidation all ZFS data blocks belonging to the deleted file on disk, these blocks will be administratively processed ASAP. Besides the observation of unstable values in the AVAIL column, you should be able to watch these changes more appropriately by referring to the freeing property (see zpoolprops(7) or freeing), try using the zpool-list(8) with that property.
 
Last edited:
Originally thought it could be caused by a snapshot I forgot to remove, but I have none. For the heck of it I tried a scrub and trim but no change.

It is possible there are holes in the file, as some articles that get stored inside the file are later deleted (spam). I've been using these buffer files for many years and this is the first time I've noticed one of them 'grow' beyond their actual size until recently.
We may be talking about different kinds of holes.

In the POSIX file system interface, it is possible for an application to create a file that contains a hole. That means there is a part of the file's virtual address space that can be read, but that has never been written and typically uses no disk space. Reads on that space will succeed and return zero. The easy way to do it is to seek to a place in the file that has not previously been written, then write there. This is actually quite common when files are created with a sequence of writes random positions. In your case, you prefilled with file with dd, so this would not create holes. There are also specialized system calls to punch holes into existing files. On Linux, one uses the fallocate(2) call; on FreeBSD, that only exists in the glibc compatibility tools (and I don't know whether it actually implements punch_hole), but one can achieve similar effects using combinations of seek, truncate and write.

The other kind of hole that I was talking about earlier is in the physical layout of the file on disk, not in the logical (virtual) address space. It happens most commonly in log-structured file systems such as ZFS. To first order, ZFS only writes to disk by appending to an internal log file. As an example, consider an idle ZFS system, where one program is sequentially writing a file. Most likely, the file will physically be contiguous on disk, at the end of the log. A while later someone goes back to the file, seeks, and overwrites a small part somewhere in the middle. ZFS will probably write the new data to the current end of the log, not overwrite the existing block in place. What happens to the existing block on disk that is no longer needed? It is logically free to overwrite (unless it is needed for a snapshot of the file). But because ZFS will only write new data at the ends of its log, that unused block is not going to be used. At some later point, ZFS will perform "garbage collection" or "compaction" or "defragmentation" or such operations, make the file more contiguous again, and that will make it possible for the log to write to that unused block again.

This unused block is what I was referring to as a hole. From a disk space usage viewpoint, is that unused block part of the file? No, it was deallocated, and is just "waste" on disk. Waste that is unusable until the data layout is cleaned up, and that part of the disk becomes the log again. But on the other hand, that disk block is unusable (since we can't write on it right now), so it's cost should really be charged to the file that had that block overwrite. Yet, the normal du mechanism (and the underlying block counting) doesn't see that. This is what I mean by space accounting being interestingly complex in a log-structured system.
 
You might have a complicating factor in your FreeBSD VM running on top of ESXi. I have very little knowledge about ESXi and I don't know in what way it grants your FreeBSD VM disk access. Ideally it should grant FreeBSD exclusive and total control of your (NVMe) devices that make up your FreeBSD ZFS pool; without interfering or adding a translation layer. If not, that might "trouble" ZFS in assessing accurate data properties of the underlying hardware devices of the pool. I don't know exactly what that kind of influence that might have on ZFS, but I do know that ZFS is meant to be run with complete hardware access to its physical disks (that's why ZFS on top of hardware raid is not recommended for example).

Regarding your last message (#11): that you're experiencing unstable disk space measurements over time, just as you have deleted a large file is something that happens because it takes some time to do "garbage collecting" on the disk—ZFS is a COW filesystem. However, I can't explain why the AVAIL space seems to diminish over time as displayed. I don't know if certain settings at the ESXi level might have anything to do with that. Does ESXi (its settings) set the amount of available disc blocks for the FreeBSD VM in a flexible manner?


I don't know details about your data in your ZFS pool but probably more then one would think is deduplication not worth the effort: it's quite resource hungry and might also impede ZFS file I/O performance. Have you ever assessed if you really want/need deduplication?
As you seem to have enabled dedup from the beginning, try zdb -S data to get more insight, especially relevant are the last few lines of the output.

P.S. I'm still interested in your ZFS setup: mirror or RAIDZx.
I have no idea how VMWare exposes storage, I know and think little about storage, except I know that FreeBSD sees a NVMe controller. I may not need deduplication, but since it was enabled for zroot via the Auto (ZFS) setup options, I figured I'd make sure the two datasets were equivalent in case that skewed numbers.

It has a single volume attached and I used the Guided root on ZFS setup option, so technically a mirror? I notice just now with zpool list that it thinks there is 105GB free, but I don't see that number anywhere else. Today I added another VMWare disk and created the 'additional' pool for testing.

Code:
# zpool list -v
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
additional   796G  9.40G   787G        -         -     0%     1%  2.00x    ONLINE  -
  nvd1       800G  9.40G   787G        -         -     0%  1.18%      -    ONLINE
zroot       1.66T  1.55T   105G        -         -    91%    93%  1.00x    ONLINE  -
  nvd0p3    1.66T  1.55T   105G        -         -    91%  93.8%      -    ONLINE

# zdb -S zroot
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     142M    546G    449G    687G     142M    546G    449G    687G
     2    8.30K    418M    154M    167M    18.2K    892M    329M    357M
     4      699   43.5M   16.1M     17M    3.33K    214M   80.3M   84.9M
     8      265   29.7M   13.8M   13.8M    2.91K    340M    160M    161M
    16      179   22.2M   13.3M   13.3M    3.83K    486M    296M    296M
    32      110   13.8M   8.60M   8.60M    4.96K    635M    398M    398M
    64       67   8.38M   5.71M   5.71M    5.62K    719M    500M    500M
   128       25   2.88M   1016K   1020K    4.66K    536M    185M    186M
   256       22   2.75M    968K    968K    6.53K    836M    287M    287M
 Total     142M    547G    449G    688G     142M    551G    451G    690G

dedup = 1.00, compress = 1.22, copies = 1.53, dedup * compress / copies = 0.80
 
Just to get a clearer storage picture of your set up.
What actual physical storage hardware items are in use for your ZFS storage; I can deduce one (NVMe) disk, is there more?
What is the output of zpool status
 
Just to get a clearer storage picture of your set up.
What actual physical storage hardware items are in use for your ZFS storage; I can deduce one (NVMe) disk, is there more?
What is the output of zpool status
Absolutely, anything I can provide that may help me understand more of what I am seeing, even if it is 'normal' for ZFS, I am happy to do.

The VM is running on a ESXi 8.0 node, Dell PowerEdge R440 with Samsung EVO 970 2TB NVMe and VMFS 6 datastore on top of that NVMe. The volume in ESXi is attached via the virtualized NVMe controller (not the standard ESXi SATA controller since I'm using NVMe storage underneath, as I understand this should support OS-issued trim commands and the like). Prior to attaching the 'additional' volume it was a single 1.7TB volume that was provisioned with the Root on ZFS (Guided) option during FreeBSD setup.

Code:
# zpool list
NAME         SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
additional   796G  9.40G   787G        -         -     0%     1%  2.00x    ONLINE  -
zroot       1.66T  1.55T   105G        -         -    91%    93%  1.00x    ONLINE  -

# zpool status
  pool: additional
 state: ONLINE
config:

    NAME        STATE     READ WRITE CKSUM
    additional  ONLINE       0     0     0
      nvd1      ONLINE       0     0     0

errors: No known data errors

  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 01:48:24 with 0 errors on Wed Oct 11 09:20:49 2023
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
      nvd0p3    ONLINE       0     0     0

errors: No known data errors
 
Ok, I have to admit this is not exactly what I had in mind based on the beginning of the thread.

ZFS guards your data very well, it checksums just about everything when your data is on disk, warns you when errors are affecting a certain level of data integrity and will continue to serve all its data correctly. However, when a level of no more redundancy has been reached, any further errors can immediately endanger your data. That means using a pool with redundancy, i.e. a mirrored or RAIDZx (with x having a minimum value of 2) set up where the data is spread over an appropriate number of VDEVs, located on different physical hardware storage devices.

Without redundancy any data error in a ZFS pool cannot be corrected and results in the complete loss of the pool. There is no ZFS equivalent for UFS' fsck(8). That means you'll have to resort to backups when a pool is lost.

If your ZFS data is valuable to you, especially when you are using ZFS in a production or professional environment, you'd be well advised to get acquainted with the basic principles regarding data integrity, setup & management and backup of a ZFS filesystem (you may have that already in place at the VMWare level).

Search/look through the forum and you'll find valuable ZFS information and references. Some relevant ZFS resources:
Though Absolute FreeBSD has basic/medium (but important) ZFS information, it is contains a wealth of FreeBSD (sysadmin) info. I recommend the two ZFS books; for a couple bucks you can get a DRM free pdf version.

There are a lot more ZFS resources on the web available, in video format as well. A lot are on the technical (inner workings) of ZFS. There are some video's more oriented towards ZFS users or admins, but I presume you'd prefer to have something more easily referenced on hand.

It is important that when adding a VDEV to a pool to take care of the proper value for ashift; use the appropriate value when specifying -o ashift=xx on the command line or set the minimum value through the relevant sysctl, as in for example sysctl vfs.zfs.min_auto_ashift=12. See also two of my earlier messages: this & that. Choosing the wrong ashift and also different ashifts for various VDEVs of a pool can be disastrous for performance (can also have a negative impact on storage efficiency). One video: Preferred Ashift by George Wilson - OH SH*FT!!

How exactly the VMWare setup of underlying volumes may affect the working of a ZFS pool I cannot tell you, but you have to make sure that you create a VDEV redundancy spread over an appropriate amount of hardware disks. For example, I imagine that for a two-way mirror, you should make sure that you create the mirror parts from two volumes that are located on different disks. Inform yourself appropriately, perhaps by someone who has experience in the ZFS-on-VMWare domain.

As you are currently using a consumer NVMe, depending on your desired data durability properties, and estimated write cycles (especially in relation to your cyclical buffer file) you may want to consider (semi-) professional grade SSDs/NVMEs; Power Loss Protection (PLOP) is a desirable property in case of power loss.

Having secured a redundant setup (for your set up a mirror (at least two-way) comes to mind), there are some aspects of ZFS that can make your life so much easier. Some useful links from Articles from Klara Systems:
  1. Managing Boot Environments
  2. Let’s Talk OpenZFS Snapshots (including its first, basic part)
  3. Demystifying OpenZFS 2.0 - a smal overview of OpenZFS 2 additions

Lastly, IMO definitely the lesser important issue (you're also not experiencing any performance problems):
Code:
# zpool list -v
# zdb -S zroot
Simulated DDT histogram:

bucket              allocated                       referenced
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1     142M    546G    449G    687G     142M    546G    449G    687G
[...]
dedup = 1.00, compress = 1.22, copies = 1.53, dedup * compress / copies = 0.80
"dedup=1.00" tells that the dataset on hand does not have data to be "deduplicated". I understand, you'd want to duplicate the dedup setting for your additional pool to not skew any comparison. Looking at this data point, I don't see any direct advantage in keeping dedup enabled. In general: files that have been deduplicated ("dedup-ed") don't get "un-deduplicated" after disabling the property; it works akin to compression. Only when moved outside the ZFS dataset is data "freed" from deduplication and "re"-deduplication is depend on the properties of the target.
 
Most of my production workloads are using hardware-based RAID with VMWare datastores atop. This VM is an outlier and the NVMe was needed for performance. Minus the one 10GB file I've been referencing this volume also holds ~800,000,000 Usenet articles stored in individual files all of which get recorded in a history file and overview database. At times I'm importing several million files from other servers in batches, so the most I/O was the goal as storing each article produces a number of I/O operations beyond writing the article to a file. Once the project of archiving is complete these articles will be sorted chronologically and transmitted to its final home on RAID-based storage. I do have multiple copies of this data in other places should I have a failure before I get done.

Thank for the links and mention of ashift, mine defaulted to 12 when it was created. I'm a Linux convert and most of my storage operations these days are performed at the VMWare level. Minus this VM everything else runs on hardware RAID and backups created using VMWare snapshots and other VMWare specific tools. Diving into FreeBSD and ZFS has been both a pleasant user experience, but a huge learning experience where I'm back to square one with a lot of fundamentals, but it's been a lot of fun.

Now that I'm learning more about ZFS and datasets, I will be spending more time setting up the final server where this data will live with separate datasets for the /usr/local/news/spool/articles folder, etc. It would be nice to live in its own dataset I can use ZFS send/receive to archive on occasion.
 
Gone down a rabbit hole regarding ZFS and SSD/NVMe storage... I noticed after 2 months of usage on my 2TB NVMe's I've gone through 10% of the life (100TBW) on the NVMe containing the ZFS volume. I'm thinking ZFS may not the right filesystem for my use case. The FreeBSD box runs as the primary news server and I have a Linux box (ext4) that is secondary. Both servers are fed the same dataset from a single source. The NVMe where the Linux box's volume (ext4) lives is reporting 8TBW, which is more inline with other metrics I have regarding how much data has flowed into the boxes from the source. Besides the occassional 'freebsd-update', the applications and operations performed by these two machines are identical. Both run the same applications and are ingesting identical data, yet the NVMe where the ZFS volume lives is seeing a 12.5x increase in the amount of data written. To me that is an insane amount of overhead, especially atop hardware RAID solutions.

I'm not sure if having dedup on this whole time has been a large factor in the number of writes, but I have atime and relatime off and I'm not sure what else I can do to decrease the amount of writes. At this rate, even an enterprise SSD with a 3,000+TBW is only going to last a few years. I know I made some mistakes letting the pool get over 80% full, but that's the way life goes sometimes. Maybe root on ZFS and the rest of my storage on UFS is the better fit for me.

The Usenet server software (INN) is ancient, single-threaded, and extremely heavy on small writes. Some files are written to constantly or regenerated and overwritten daily, like the history and overview databases, so these are probably poor choices for ZFS with solid state storage, but the article storage is of the 'write once read many' workload and would probably be fine on ZFS.

If you have any other suggestions for tuning ZFS for solid state storage I'm open to further reading even though most of it is over my head.

None of this gets to the root my original question, though; how does a file consume more space on the disk than the size of the file? When this all started I had multiple CNFS buffers, two were 100GB files and the one remaining 10GB file I've been referencing above. The two 100GB files were consuming 130GB of disk space each according to all tools I used to investigate, and when I removed both the pool reported ~260GB of reclaimed space. I left the 10GB file there in case it would be helpful for troubleshooting, but as I've reached a point where this is way over my head and I have another system with identical function and dataset all I can do is point my finger at ZFS and say it isn't the best fit for me, but I don't like giving up that easy. ;)
 
After more reading, I'm even more confused about the proper value to use for ashift when running atop ESXi, but I think the default of 12 is a safe bet from what I've read. What I did not realize, and may be particularly impactful to my use case is the default recordsize is 128k. The vast majority of the hundreds of millions of files are <64k, I'd even venture to say (without actually calculating) that 90% or more are smaller than 32k since we are talking mostly about text Usenet articles. It sounds like I would benefit from a smaller record size.
 
Hmmm maybe I need to back up a minute... I did some playing around with how VMWare presents the block device to the VM and no matter which 'virtual controller' type (SCSI, SATA, NVMe, etc.) I chose, the OS sees 512 byte sectors. Does this mean I should really be using an ashift value of 9 inside the VM?
 
Are you running an INN server with 1-file-per-article storage?
Yes, all text articles are stored using INN's tradspool format (each article is a file stored in a hierarchy of folders based on newsgroup) and binaries and some junk groups stored in CNFS buffers. This isn't optimal long-term but I have a myriad of reasons why I am aggregating the archive this way. Once pulling in archives is complete, sorting articles in chronological order, and a lot of spam removed, they will be fed into a new server using CNFS exclusively.
 
Back
Top