ZFS Discrepancy in expected vs. actual capacity of ZFS filesystem

The actual capacity of my filesystem is smaller than what I expected/calculated, and I'd like to better understand why. I suspect this may have something to do with some facet of ZFS like ashift or recordsize that I'm forgetting to account for. Please note that I do not believe this is related to misinterpreting the results returned by zpool list vs. zfs list.

Quick background:

I've been using ZFS on several OSes (OpenSolaris, FreeBSD, Linux) for a little over 8 years now, and am knowledgeable of the differences between the various sizes reported by zpool list, zfs list, and df.

My understanding of those differences are, briefly:
  • zpool list returns the number of bytes in use and the number of bytes available across all storage devices in a pool. It does not care about how those bytes are used (could be data, could be parity, could be housekeeping like the metaslab), it just simply returns how many bytes are used and how many are available.
  • zfs list returns the total number of bytes required by the filesystem. It intelligently understands parity devices (by not including them), filesystem metadata (by including it), compression (by returning the space required to store all files uncompressed), and deduplication (by returning the space required to store all duplicates).
  • df could return a value smaller or larger than zfs list since it does not understand how to interpret compression, snapshots, etc.
I know this is a common stumbling block for some, so I wanted to get that out of the way first.

As for my setup, I'm using the default recordsize=128K. My zpool was created with ashift=12 and contains a single RAID-Z2 vdev comprised of 11x8TB drives:

Code:
# zdb | grep "ashift"
            ashift: 12
# zfs get recordsize backuptank0
NAME         PROPERTY    VALUE    SOURCE
backuptank0  recordsize  128K     default
# zpool status
  pool: backuptank0
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Sun Jul 31 03:33:54 2016
config:

    NAME              STATE     READ WRITE CKSUM
    backuptank0       ONLINE       0     0     0
     raidz2-0        ONLINE       0     0     0
       Z840BYL8-enc  ONLINE       0     0     0
       Z840BYSH-enc  ONLINE       0     0     0
       Z840BYXA-enc  ONLINE       0     0     0
       Z840BZA6-enc  ONLINE       0     0     0
       Z840BZG7-enc  ONLINE       0     0     0
       Z840BZK0-enc  ONLINE       0     0     0
       Z840CVPP-enc  ONLINE       0     0     0
       Z840DW23-enc  ONLINE       0     0     0
       Z840E0SR-enc  ONLINE       0     0     0
       Z840KHKL-enc  ONLINE       0     0     0
       Z840KK3X-enc  ONLINE       0     0     0

errors: No known data errors

Each drive is exactly 8001563222016 bytes large, or roughly 8002GB|7452GiB, or 8TB|7.27TiB.

11 of these drives should yield a total capacity of 88017195442176 bytes, or roughly 88017GB|81972GiB, or 88TB|80.05TiB. The base2 value seems to agree well with zpool list, though ZFS's use of "T" to denote units is ambiguous:

Code:
# zpool list
NAME          SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
backuptank0    80T  20.9T  59.1T         -    16%    26%  1.00x  ONLINE  -

Since this is a RAID-Z2, I would expect 2*8TB|7.27TiB = 16TB|14.54TiB to be lost to parity, bringing the total capacity available to the filesystem down to 72TB|65.51TiB.

The actual value of 59T (15.5T used + 43.5T avail; not sure if it's base2 or base10) that I'm seeing reported is much smaller than the expected 72T|65.51TiB I calculated above:

Code:
# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
backuptank0  15.5T  43.5T  15.5T  /backuptank0

Assuming zfs is reporting in base2, where am I losing those 6.51TiB (65.51TiB-59T) in expected vs. actual capacity? That's 9.93%!

The pool has deduplication disabled and has no snapshots. It does have lz4 compression enabled, though all data in the pool at the moment is incompressible.
 
I hate when people reply without a real reply, but am going to try anyways.

I went through this a while back, the additional fs consumption has to do with overhead by the 4k block. If you put ashift to 9 you will see the numbers at about what you are hoping for. I can't remember the exact calculation, but it is assuming some of the 4k blocks will not be full, but will be used. I don't think it's real though. If you filled the backuptank0 with one big file using all full 4k blocks, I believe it would fill to the higher water mark. Can't remember right now the exacts..But it's a place to start.
 
lkateley I believe you are referring to the so-called "slack". If you have 4K blocks, writing a 2K file will show up as using 2K but the whole 4K block will be in use. So if you have a large number of 2K files there's actually twice as much data in use. To combat this rather inefficient way of storing things modern filesystems like FreeBSD's UFS and ZFS use some form of block sub-allocations. This allows, for example, for two 2K files to occupy just one 4K block. But this still allows some of the blocks to not be filled fully, leaving some unusable free space.
 
Appreciate the replies Ikateley and SirDice. I did explore the possibility this was due to ashift (or some other parameter I set without fully realizing the effects it would have), but I was never able to fully explain the 9.93% difference which just seemed too high to me (hence my post here).

To elaborate, some of the effects I believe I've accounted for are:
  • 1/64th of the raw pool size reserved for increasing allocator efficiency (space ZFS keeps free to allow deletion on a completely full pool).
  • The 2 uberblocks sections per disk. My understanding was that these are only ever going to reach a few megabytes in size at most.
  • The 200 metaslabs/space maps ZFS keeps per pool. Although I'm now no longer able to find the Sun webpage I originally read this at, I recall these being typically a few tens of MiB in size. I do remember reading that for filesystems with an extremely large number of very small files, they could inflate to hundreds of MiB, but the vast majority of files on this filesystem are in the single GiBs - lots of large .tar.gz's, very few files under 1MiB.
I've been discussing this with a few folks on IRC, and they pointed me to a table (seen here) that illustrates the expected allocation overhead, and for this system (11 disks, RAID-Z2, ashift=12, 128KiB record size), that figure is 7.38%.

Maybe the remaining 2.65% really is due to the above effects? That doesn't feel right, but I'm not able to come up with any other explanation. Hmm.
 
Back
Top