ZFS New array with identical data using nearly 50% more space

I have a 3 x 1 TB mirror-0 array, and I'm migrating the data to a second 2 x 2 TB mirror-0 array, using rsync.

The only differences between the mirrors are:

1. The version of FreeBSD which created it (FreeBSD 10.4-STABLE versus FreeBSD 12.0-STABLE). Note that the original array has had "zpool upgrade" done since moving to 12.0-STABLE.
2. The logical size of the array (1TB versus 2TB)
3. The ashift value (ashift=9 versus ashift=12)
4. The use of different partitioning schemes (MBR adaXs1 versus GPT adaXp1)

This is how it looks once the data is copied:

Code:
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
db      928G   673G   255G        -         -    75%    72%  1.00x  ONLINE  -
dbnew  1.81T   985G   871G        -         -     0%    53%  1.00x  ONLINE  -


NAME  PROPERTY              VALUE                  SOURCE
db    used                  673G                   -
db    referenced            673G                   -
db    compressratio         2.11x                  -
db    logicalused           1.37T                  -
db    logicalreferenced     1.37T                  -

NAME  PROPERTY              VALUE                  SOURCE
dbnew  used                  985G                   -
dbnew  referenced            985G                   -
dbnew  compressratio         2.11x                  -
dbnew  logicalused           1.37T                  -
dbnew  logicalreferenced     1.37T                  -


The two arrays have identical compressatio/logicalused/localreferenced values (as expected, since the data is the same), but for some reason the new array consumes an extra 312GB (about +46%) of space.

I know that ashift=12 will result in less free space, but I don't believe the penalty should be this large?

Any ideas? Thank you.
 

Thanks, interesting thread. I did a search before starting this one but obviously missed it. I was searching for 46% ;)

I'm thinking - and I'm happy to be proven wrong - that ashift=9 could still work with a 4k drive, so long as:

- The partition is aligned to 4k
- zfs primarycache/secondarycache is never set to "none" (ie: reads must be cached)
- zfs sync is never set to "always" (ie: writes must be cached)

(Note that the latter requirement could be negated if the application gratuitously forces synchronous flushing, overriding ZFS's ability to choose when, and in what order, to write out pending sectors. To work around such behaviour may require setting sync=disabled, which is potentially risky.)

In my case, I was migrating to a larger array to give myself some headroom, but a 46% reduction in storage space hardly makes it worth it, so, I'm prepared to take a performance hit.
 
These are mirrors rather than raids, so it’s a little different from PMc’s setup.

I’m guessing you have lots of small files? Assuming the drives are not 4kn (have at least emulated 512b sectors) you could go with ashift=9.
 
These are mirrors rather than raids, so it’s a little different from PMc’s setup.

I’m guessing you have lots of small files? Assuming the drives are not 4kn (have at least emulated 512b sectors) you could go with ashift=9.

Not really. 2505 files in total, with 99%+ of the used storage for MySQL files. 1.37TB (apparent size) over 2505 files doesn't seem to explain "slack" as the reason for the wasted space. Is there a command which may give some clues about this?

I've recreated the array with ashift=9 and I'm currently populating it. The only issue is that "zpool status" complains about the sector size mismatch, which is annoying, but I can live with it. (Interesting that despite ZFS knowing this is a 4k internal drive, dmesg talks only about 512 byte sectors when probing the drive at boot; no mention of "4k" or "4096").
 
These are mirrors rather than raids, so it’s a little different from PMc’s setup.

Ups, my fault. Just read up to "3 x 1TB" and didn't read carefully.

I'm thinking - and I'm happy to be proven wrong - that ashift=9 could still work with a 4k drive, so long as:

Spinning disk or SSD? On spinning disk it is only a performance issue, on SSD it might increase wear.

In my case, I was migrating to a larger array to give myself some headroom, but a 46% reduction in storage space hardly makes it worth it, so, I'm prepared to take a performance hit.

I would think it is worth figuring out first where these 46% have gone - they must be somewhere. I found plain mirrors to behave rather orderly - but You have compression on and that makes it a little bit more individual. At least my thread could give some ideas on how to hunt such an issue down...

1.37TB (apparent size) over 2505 files doesn't seem to explain "slack" as the reason for the wasted space.

No. And sparse files should have been catched by the compression.

Is there a command which may give some clues about this?

I found du giving surprizingly correct data - this means starting from the bottom, looking at (some of) the individual files, and comparing the space they use. The zfs space accounting is a bit weird, but usually it does sum up correctly, after one gets a clue on how it is meant to be interpreted.
 
I’m guessing no snapshots on the new pool since you said updated with rsync. Not sure what to make of this space loss. Hopefully the new (ashift=9) version fixes it for you.

What do you have your recordsize set to? My best guess is that previously some records/writes after compression were << 4096 and fit into multiple 512b sectors rather than being stuck allocating 4096b (the smallest possible allocation with ashift=12.) If I follow that logic, the compressratio must report the logical compression rate (ignoring allocation restrictions) rather than the physically realized compression rate on disk.
 
Following this up, I haven't really been able to do much more diagnosing with the array, because I couldn't afford more downtime.

I did try ashift=12 on the remaining original 1TB disk (which is now unused), but it ran out of space before I could copy everything over. I don't consider that a complete fail, since it again clearly shows that ashift=12 gobbles up significantly more space for this data; the same data can be stored in less than 1TB with ashift=9.

Here's the configuration of both the original and subsequent test arrays (I used the zpool history of the first to create the second). Note the recordsize is set to 8k:

Code:
History for 'dbtest':
2019-06-24.21:18:27 zpool create dbtest ada3s1
2019-06-24.21:18:34 zfs set recordsize=8K dbtest
2019-06-24.21:18:39 zfs set atime=off dbtest
2019-06-24.21:18:42 zfs set compression=lz4 dbtest

Another experiment I did was to rebuild one of my backup arrays (a two drive stripe) with ashift=9 (originally created with ashift=12)

Here's a comparison of ashift=12 and ashift=9:

Code:
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rddbak  3.62T  3.04T   601G         -    61%    83%  1.03x  ONLINE  -

NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rddbak  3.62T  2.96T   684G         -    55%    81%  1.03x  ONLINE  -

Not such a substantial difference; this time I got back about 2.3% free space. Note this array uses deduping.
 
After upgrading MySQL and changing the database engine from MyISAM to InnoDB, the performance penalty of misaligned 4k writes was too great, so I've had to move to ashift=12 again, on a new array with slightly larger drives:

Code:
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
db     1.81T  1.04T   790G        -         -    68%    57%  1.00x  ONLINE  -
dbnew  2.72T  1.45T  1.27T        -         -     1%    53%  1.00x  ONLINE  -

db -> 2x2TB (1x512b, 1x4k), ashift=9
dbnew -> 2x3TB (both 4k), ashift=12

Again there's a massive loss of space; on the first array the data consumes 1.04T, on the second the exact same data consumes 1.45T. :( As we've discussed in this thread, it's not the sort of overhead you'd expect with moving from 512b to 4k alignment, so I wonder where on earth that extra 400GB+ of wasted space has gone...
 
I have a 3 x 1 TB mirror-0 array, and I'm migrating the data to a second 2 x 2 TB mirror-0 array, using rsync.

The only differences between the mirrors are:

1. The version of FreeBSD which created it (FreeBSD 10.4-STABLE versus FreeBSD 12.0-STABLE). Note that the original array has had "zpool upgrade" done since moving to 12.0-STABLE.
2. The logical size of the array (1TB versus 2TB)
3. The ashift value (ashift=9 versus ashift=12)
4. The use of different partitioning schemes (MBR adaXs1 versus GPT adaXp1)

This is how it looks once the data is copied:

Code:
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
db      928G   673G   255G        -         -    75%    72%  1.00x  ONLINE  -
dbnew  1.81T   985G   871G        -         -     0%    53%  1.00x  ONLINE  -


NAME  PROPERTY              VALUE                  SOURCE
db    used                  673G                   -
db    referenced            673G                   -
db    compressratio         2.11x                  -
db    logicalused           1.37T                  -
db    logicalreferenced     1.37T                  -

NAME  PROPERTY              VALUE                  SOURCE
dbnew  used                  985G                   -
dbnew  referenced            985G                   -
dbnew  compressratio         2.11x                  -
dbnew  logicalused           1.37T                  -
dbnew  logicalreferenced     1.37T                  -


The two arrays have identical compressatio/logicalused/localreferenced values (as expected, since the data is the same), but for some reason the new array consumes an extra 312GB (about +46%) of space.

I know that ashift=12 will result in less free space, but I don't believe the penalty should be this large?

Any ideas? Thank you.

The values are not exact and you should take them with a grain of salt (at least this is what Mike Lucas writes in ZFS).

What happens when you compare the datasets?
Code:
zfs list -d2 -o space
 
The values are not exact and you should take them with a grain of salt (at least this is what Mike Lucas writes in ZFS).

The numbers report about one sixth of the array (at current 53% capacity) has apparently disappeared into thin air. How much off do they need to be before we consider there may be a problem? :)

What happens when you compare the datasets?
Code:
zfs list -d2 -o space

Code:
# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
dbnew               2.72T  1.45T  1.27T        -         -     1%    53%  1.00x  ONLINE  -
db                  1.81T  1.04T   790G        -         -    67%    57%  1.00x  ONLINE  -

# zfs list -d2 -o space
NAME                                 AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
dbnew                                1.18T  1.45T     45.0M   1.45T              0       402M
db                                    732G  1.04T        1K   1.04T              0      1.52G

The AVAIL (zfs list) and FREE (zpool list) values differ a little, but not by such a large margin.
 
Output of zfs list -ro compressratio,recordsize,name ?

i’m guessing what you’re seeing is the inability to compress at least 2x, which is what would be required to be able to save anything with compression on an ashift=12 and recordsize=8K file system.

I’m also a little confused by the performance concern but using dedup, or have you moved off of dedup?
 
Output of zfs list -ro compressratio,recordsize,name ?

i’m guessing what you’re seeing is the inability to compress at least 2x, which is what would be required to be able to save anything with compression on an ashift=12 and recordsize=8K file system.

I’m also a little confused by the performance concern but using dedup, or have you moved off of dedup?

Re dedup, that was a separate experiment where I dropped back to ashift=9 to regain some space in a crowded backup set. Unrelated.

Back to the main concern, the output of zfs list -ro compressratio,recordsize,name shows the values for both the old and new array to be identical (as expected). Note since changing to the MySQL InnoDB engine, the record size has been changed to 16k.

Code:
RATIO  RECSIZE  NAME
2.21x      16K  dbnew      # ashift=12
2.21x      16K  db      # ashift=9

Here's some additional information...


zdb -C
Code:
dbnew:
    version: 5000
    name: 'dbnew'
    state: 0
    txg: 18372
    pool_guid: 5580125512365205868
    hostid: 3190631135
    hostname: '(scrubbed)'
    com.delphix:has_per_vdev_zaps
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 5580125512365205868
        create_txg: 4
        children[0]:
            type: 'mirror'
            id: 0
            guid: 17266329692481671090
            metaslab_array: 68
            metaslab_shift: 34
            ashift: 12
            asize: 3000586731520
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_top: 65
            children[0]:
                type: 'disk'
                id: 0
                guid: 12905941275437905905
                path: '/dev/diskid/DISK-19L5LJ7ASp1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 66
            children[1]:
                type: 'disk'
                id: 1
                guid: 18139536924588654206
                path: '/dev/diskid/DISK-19M5WB1ASp1'
                whole_disk: 1
                create_txg: 4
                com.delphix:vdev_zap_leaf: 67
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data


zfs get all dbnew | grep -v default
Code:
NAME  PROPERTY              VALUE                  SOURCE
db    type                  filesystem             -
db    creation              Mon Oct 14 18:33 2019  -
db    used                  1.45T                  -
db    available             1.18T                  -
db    referenced            1.45T                  -
db    compressratio         2.21x                  -
db    mounted               yes                    -
db    recordsize            16K                    received
db    mountpoint            /db                    received
db    compression           lz4                    received
db    atime                 off                    received
db    createtxg             1                      -
db    xattr                 off                    temporary
db    version               5                      -
db    utf8only              off                    -
db    normalization         none                   -
db    casesensitivity       sensitive              -
db    guid                  4098295464672474452    -
db    primarycache          all                    received
db    usedbysnapshots       85.0M                  -
db    usedbydataset         1.45T                  -
db    usedbychildren        733M                   -
db    usedbyrefreservation  0                      -
db    mlslabel                                     -
db    sync                  standard               received
db    refcompressratio      2.21x                  -
db    written               545M                   -
db    logicalused           2.28T                  -
db    logicalreferenced     2.28T                  -

Count of files on array: 24604
 
Has all of the data on dbnew been rewritten (NOT send/recv) since the change to 16k? I would have expected to see a little better result (storage efficiency) with the compression ratio (which is calculated independent of the recordsize, as you can see) that is available.

You’re still using less space than logically referenced — what you are seeing is that ZFS compression on small record sizes works much better on ashift=9 since there are effectively eight sizes is can save (reduce) a 4K record to, and 16 it can save an 8k record to, while those numbers are 1 and 2 for ashift=12. It has to allocate enough to store the compressed data, so you’re always rounding up to an available storage size, but ashift=12 makes those round-ups more painful.
 
Has all of the data on dbnew been rewritten (NOT send/recv) since the change to 16k? I would have expected to see a little better result (storage efficiency) with the compression ratio (which is calculated independent of the recordsize, as you can see) that is available.

This is a good point. Some of the data would have been rewritten because of conversion from MyISAM to InnoDB, but I have not explicitly 'regenerated' the data through copying, nor have all tables been converted yet.

I'll use the original array drives to create another ashift=12 array, then copy back the files using rsync. We'll see how that compares.

You’re still using less space than logically referenced — what you are seeing is that ZFS compression on small record sizes works much better on ashift=9 since there are effectively eight sizes is can save (reduce) a 4K record to, and 16 it can save an 8k record to, while those numbers are 1 and 2 for ashift=12. It has to allocate enough to store the compressed data, so you’re always rounding up to an available storage size, but ashift=12 makes those round-ups more painful.

This is the first hypothesis I've seen that provides a plausible explanation. Thank you. Painful is an understatement! With the regeneration at 16k block size, the wasted space should be reduced, at least slightly, since up to two compressed blocks can fit, versus one? I'll report back once it's complete.
 
3TB array with many files still using recordsize of 8k, copied via zfs send|zfs recv: 1.48T alloc
After rsyncing to temporary array, then back, to force all files to use recordsize of 16k: 1.23T alloc, compressratio 2.33x

So the regeneration with new recordsize has freed up about 250GB.

One thing I noticed is that at least one particular file copied at a very slow speed: between around 5 to 10MB/s. I confirmed the bottleneck was on the read side by using dd to copy to /dev/null.

After the regeneration, the file reads at 200MB/sec. I rebooted to ensure that the data wasn't cached in memory, and saw the same result.

These two remarkably different read speeds are from the exact same array (same physical hardware, same ZFS config) and the same file (bit for bit identical); the only difference is that the array contents have been copied back over. This seems like too large a discrepancy to explain the change from 8k to 16k recordsize, and I'm pretty sure the file would have been all 16k anyway (it's a 29GB InnoDB database file).
 

One thing I noticed is that at least one particular file copied at a very slow speed: between around 5 to 10MB/s. I confirmed the bottleneck was on the read side by using dd to copy to /dev/null.

After the regeneration, the file reads at 200MB/sec. I rebooted to ensure that the data wasn't cached in memory, and saw the same result.

These two remarkably different read speeds are from the exact same array (same physical hardware, same ZFS config) and the same file (bit for bit identical); the only difference is that the array contents have been copied back over. This seems like too large a discrepancy to explain the change from 8k to 16k recordsize, and I'm pretty sure the file would have been all 16k anyway (it's a 29GB InnoDB database file).

ZFS is copy-on-write; what this means for any file with lots of small random writes (like a database file) is that, over time, logically contiguous portions are physically scattered.

What you’ve done is what traditional defragmentation tools do — re-layout data such that the physical and logical proximities align. In general, ZFS mitigates this penalty via aggressive cacheing of both most recently and most frequently used data, but when you’re just trying to read all of the data from (logical) start to finish, there’s not much that can be done — especially if you’re starting with a cold cache. The switch to larger record sizes will also help reduce this penalty in general, but that is a much smaller effect than re-laying out the data on magnetic media.

Your “regenerated” data is going to have much better physical alignment having been only written once, and sequentially at that; your dd read tests confirm it.

This effect is also essentially non-existent on SSDs, as there is no physical head to seek across the platter for “near” or “far” blocks.
 
3TB array with many files still using recordsize of 8k, copied via zfs send|zfs recv: 1.48T alloc
After rsyncing to temporary array, then back, to force all files to use recordsize of 16k: 1.23T alloc, compressratio 2.33x

So the regeneration with new recordsize has freed up about 250GB.

This aligns with my hypothesis above that it is the reduced compression flexibility with ashift=12 that you are witnessing. Larger and larger record sizes will ameliorate the condition (so long as files are large enough to use them) but with a commensurate increase in latency for small operations.
 
Your “regenerated” data is going to have much better physical alignment having been only written once, and sequentially at that; your dd read tests confirm it.

That would make sense, except that this data had already been rewritten only a couple of days ago when I received the new drives. I'm assuming that zfs send | zfs recv will correctly reassemble fragmented files into contiguous records? You can see from my recent update bump that fragmentation reported by "zpool list" went from 67% on the old array, to 1% on the new. This was before I used rsync to force all existing files to use the new recordsize.

This aligns with my hypothesis above that it is the reduced compression flexibility with ashift=12 that you are witnessing. Larger and larger record sizes will ameliorate the condition (so long as files are large enough to use them) but with a commensurate increase in latency for small operations.

I'm wondering whether I should just move to 32k (or higher) from the start, since the InnoDB files seem to be highly compressible. Much of the access pattern on this server is random reads.
 
That would make sense, except that this data had already been rewritten only a couple of days ago when I received the new drives. I'm assuming that zfs send | zfs recv will correctly reassemble fragmented files into contiguous records? You can see from my recent update bump that fragmentation reported by "zpool list" went from 67% on the old array, to 1% on the new. This was before I used rsync to force all existing files to use the new recordsize.



I'm wondering whether I should just move to 32k (or higher) from the start, since the InnoDB files seem to be highly compressible. Much of the access pattern on this server is random reads.

Alas, no. The fragmentation value is of the free space in the pool. A send/recv (especially clear with intermediate snapshots) plays out transactions over time, so it will create files with fragmentation. I’m not sure on a full (no intermediate snapshots) send if it does anything differently, but that would be an interesting comparison.
 
Alas, no. The fragmentation value is of the free space in the pool. A send/recv (especially clear with intermediate snapshots) plays out transactions over time, so it will create files with fragmentation. I’m not sure on a full (no intermediate snapshots) send if it does anything differently, but that would be an interesting comparison.

It was a send of a full snapshot. Something like:

Code:
zfs snapshot -r db@MOVING
zfs send -R db@MOVING | zfs receive -dF dbnew

The comment in the (currently) only answer to a question at Stack Overflow suggests that a clean send (rather than multiple incremental updates) should effectively recreate files as contiguous sets of records, but my experience with zfs send versus rsync makes me wonder.

I was thinking that it may be worth periodically refreshing a regularly updated database file that is stored on a HDD, to improve sequential access when a full table scan is required, and generally bring a possibly spread out file into the same area on disk, but since MySQL has its own internal block and indexing system, it is likely to be fragmented at more than one level. 24 hours ago I added a fast NVMe SSD as L2ARC, and now that it's warmed, HDD read levels have dropped a fair bit, so the concern about fragmentation may be moot anyway (now I just have to worry about SSD endurance :) ).

Still, it would be handy if ZFS had defragmenting functionality built into it...
 
If you had any earlier snapshots, zfs send -R will send them, in creation order on up to the named snapshot.

So if @MOVING was the only snapshot when you performed the send, then perhaps it could re-layout the data. Again, it would be an interesting thing to see; I don’t rightly know what it does in that case; it seems plausible it could send a nice contiguous stream.

In the long term, however, having extra caching like you’ve got now and letting the ARC and L2ARC do their job will serve you better than worrying about defragmenting your data. If performance is paramount, move to all solid state. 😉
 
The numbers report about one sixth of the array (at current 53% capacity) has apparently disappeared into thin air. How much off do they need to be before we consider there may be a problem? :)



Code:
# zpool list
NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
dbnew               2.72T  1.45T  1.27T        -         -     1%    53%  1.00x  ONLINE  -
db                  1.81T  1.04T   790G        -         -    67%    57%  1.00x  ONLINE  -

# zfs list -d2 -o space
NAME                                 AVAIL   USED  USEDSNAP  USEDDS  USEDREFRESERV  USEDCHILD
dbnew                                1.18T  1.45T     45.0M   1.45T              0       402M
db                                    732G  1.04T        1K   1.04T              0      1.52G

The AVAIL (zfs list) and FREE (zpool list) values differ a little, but not by such a large margin.
I don't know, I have never had this issue.
If you sent the data, the two datasets should be binarily the same. There's no reason to believe that one would get larger.
Compare your zpool's properties - maybe there is something funky there? Are the pools of the same version?

Also, can it be that some of the datasets have the "copies" property set to >1?

Or maybe you have some kid dataset that you did not notice?
I don't know otherwise.
 
I found this article, which provides some additional insight into the issues discussed in this thread, particularly where the ZFS record size is close to the disk sector size: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSRecordsizeAndCompression?showcomments . It also mentions that compression on relatively small amounts of data may be poor.

Related to my earlier experiment where I forced ashift=9 (from default of 12) on a backup set to improve compression - I'm going to try setting a record size of 1MB, up from the default of 128k, to see how compression changes. Note that a larger record size will incur a performance penalty when doing partial updates. In my case, the backup set is getting quite full, so I'm prepared to accept that potential hit in performance.
 
Here's the results of changing:

1. Recordsize from 128k to 1M
2. Compression from lz4 to gzip (default compression ratio)

Before:

Code:
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rddbak  3.62T  3.16T   473G         -    51%    87%  1.03x  ONLINE  -

NAME    PROPERTY              VALUE                  SOURCE
rddbak  used                  3.27T                  -
rddbak  available             358G                   -
rddbak  referenced            3.26T                  -
rddbak  compressratio         1.16x                  -
rddbak  recordsize             128K                  default

Pool recreated (with identical commands, using zpool history), then populated with exact same data (using rsync to copy snapshot) :

Code:
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
rddbak  3.62T  2.99T   651G         -     8%    82%  1.03x  ONLINE  -

NAME    PROPERTY              VALUE                  SOURCE
rddbak  used                  3.08T                  -
rddbak  available             532G                   -
rddbak  referenced            3.08T                  -
rddbak  compressratio         1.24x                  -
rddbak  recordsize            1M                     local

These two changes clawed back a couple hundred more gigs. Combined with ashift=9 (from default of 12, which saves about 84GB on this set) it's an overall improvement of about 6% by space.

Again, I should stress that there will be a performance penalty by making these changes; it's a temporary compromise to avoid having to replace the drives.
 
Back
Top