ZFS Huge difference in space allocated between a UFS drive vs a 2-drive ZFS striped pool

mrjayviper · Jul 4, 2015

I have a UFS-formatted drive which according df -h occupies around 739GB.

I created a 2-drive ZFS striped pool using zpool create my-pool /dev/ada2.gnop /dev/ada3.gnop. I also used gnop create -S 4096 /dev/adax before I created the pool.

My issue is that when I copied (via rsync -av --delete /src /dst)the contents of UFS-drive to the ZFS pool, the disk space used on the pool is a LOT more than the source. According to df -h, the pool is already 890GB full and rsync is not done copying. 150+GB difference seems a lot even considering sector size and other things.

Can someone please explain the space used difference?

Thanks a lot

wblock@ · Jul 4, 2015

Some of the extra space used will be due to hard links that are copied as separate files. rsync needs additional flags to preserve those links. That is probably not a significant expansion, but it could be.

Using 4K blocks is also going to add to overhead for small files. I think ZFS has block suballocation, but it is still not as efficient for small files as using smaller blocks.

mrjayviper · Jul 4, 2015

I just checked at all the space is consumed by a single folder. The source folder consumes only 500+GB while on the destination, it consumes 890GB! :O

Is it better to just not use gnop and just create the zpool? Most of the files in this folder are videos 100+MB each.

wblock@ · Jul 4, 2015

Block size should not make much difference with files that large, but it's worth trying. Incidentally, there is no need to create a gnop(8) device for each drive. ZFS will use the largest block size of any device in a pool for that pool's block size. So only one is needed.

For rsync options, consider -axHAXS, or even -axHAXS --delete --fileflags --force-change.

AndyUKG · Jul 8, 2015

Have you considered using LZ4 compression if you aren't storing already compressed data? It's on the fly and generally has negligible effect on performance, and can even improve performance in some cases...

mrjayviper · Jul 8, 2015

But I don't think it will overcome my problem though. As you already said, it's compressed.

The folder I'm trying to copy is only <550GB (according to du -h -d 1) and it won't even fit a 2x500GB striped array. :/

AndyUKG · Jul 8, 2015

Well the only other thing I think you can try is to create a pool without using gnop, this will hurt the performance but you will get the optimal (for ZFS) space utilization.

mrjayviper · Jul 8, 2015

AndyUKG said:
Well the only other thing I think you can try is to create a pool without using gnop, this will hurt the performance but you will get the optimal (for ZFS) space utilization.

I tried it already. I got the same results. I also tried a simple geom-based striped array. I got the same results again and even worse maximum space available.

1. 2x500GB ZFS stripe: 899GB
2. 2x500GB geom-stripe: 810GB

AndyUKG · Jul 8, 2015

Oh, given you are not using a redundant RAID level you could consider turning off checksums, don't think they will account for much space but you will save a little. Of course you won't know if any on disk corruption occurs, however with RAID0 ZFS isn't capable of fixing them and its no worse than UFS or other "normal" file system.

AndyUKG · Jul 8, 2015

So the geom-stripe has less available space, but can you rsync the data to it? Does it fit I mean? The geom-stripe is UFS same as the source I assume?
If you cannot fit the data on the geom-stripe then the issue isn't ZFS, maybe you have some hard or soft links that aren't being handled properly by rsync?

mrjayviper · Jul 8, 2015

AndyUKG said:
So the geom-stripe has less available space, but can you rsync the data to it? Does it fit I mean? The geom-stripe is UFS same as the source I assume?
If you cannot fit the data on the geom-stripe then the issue isn't ZFS, maybe you have some hard or soft links that aren't being handled properly by rsync?

I honestly don't think the 500+GB folder (which contains subfolders) will fit. I copied 2 subfolders (as a test as it takes several hours to copy all that data) and according to du, it only occupies 57GB on the standalone HDD. it occupied close to 70GB (checked using df -h) on the geom-based stripe.

wblock@ · Jul 8, 2015

AndyUKG said:
you could consider turning off checksums, don't think they will account for much space but you will save a little

Please do not do that. Checksums are part of the metadata, so I'm fairly sure there will be no space savings at all. But it will make unreported data loss much easier.

mrei · Jul 8, 2015

RSync does not handle sparse files correctly without the -sparse flag. So if you have sparse files and have neither set this flag nor enabled compression, that would explain your result...

TheDreamer · Jul 8, 2015

The gain of +150G sounds a lot like what I observed when I restored my home directory that had been on a zpool with 512b sector drives to one with 4K drives. Not sure if df would know that I had dabbled with dedup on the old zpool. But, it doesn't know that I'm using compression.

I don't think ZFS does sub-allocation in sectors, but rather than the filesystem's blocksize (default 128K) being fixed it, it will adjust down to sector size (512 or 4K are the common ones, I think it can do 8K as the max? I don't know what my SSDs are using for pagesize, but probably not that big an issue if it is 8K.)

The only reason people think of turning off checksums is speed, though the difference between fletcher2/4 and none is mainly something a benchmark tool to detect. (I've seen the recommendation to turn it off for the swap volume, but I once had a pair of flaky drives, where my thinking in doing swap from ZFS was to catch those errors....I eventually corrupted my zpool to where I couldn't recover, and turned out the real culprit was a bad DIMM.)

Using sha256 might be noticeable, but only the tip of iceberg on what made dedup so painful. Dedup requires sha256 (and overrides checksum) as it needs a stronger checksum to decide if a block is a dupe. There's a verify option to do additional byte-to-byte comparison in case there are blocks with the same signature, can't imagine what that would be like. But, at the time I knew I would have lots of dupes, as I was in recovery from Windoze...and only had image of corrupt disk and a couple of bad backups to try to recover all my files from...onto a zpool using the same drives that I had been using. Had been using fakeraid before, and wondered how does it decide which side is correct when recovering....which it needed to do after every blue screen.

TheDreamer