Solved Strange issue with folder size on ZFS...

scilek · Oct 18, 2023

I must destroy and recreate a ZFS pool. Please don't ask me why; it's a long story.

The pool contains just one volume, named "depo".

Of course, I had to back up the whole volume. Since ZFS replication won't work (long story...), I decided to mount two NTFS-formatted (ada0 and da0) drives and copy the ZFS volume to them, separately.

Code:

# df -h
Filesystem            Size    Used   Avail Capacity  Mounted on
...
tank                  3,4T    117K    3,4T     0%    /tank
tank/depo             3,5T    112G    3,4T     3%    /tank/depo
/dev/fuse             900G    109G    790G    12%    /mnt
/dev/fuse             298G    277G     21G    93%    /media

Please note that the commands were run after the copying was completed and that the original volume (tank/depo) is 112G in size.

Then I issued these commands separately:

Code:

# cp -Rv /tank/depo /mnt/
# cp -Rv /tank/depo /media/

When it was over, just to make sure I won't lose any data, I calculated folder sizes:

Code:

# du -sh /mnt/depo
109G    /mnt/depo
# du -sh /media/depo
109G    /media/depo
# du -sh /tank/depo
111G    /tank/depo

So why are the copies of the same size but the source folder bigger? Does it have to do with the internal workings of ZFS? Does it have anything to do with 250+ snapshots of the said volume?

Is there anything I need to worry here? Or can I go ahead and destroy the pool?

mer · Oct 18, 2023

I would guess, yes, may be related to snapshots. Sometimes it can be confusing as to "where" snapshot data is accounted for. If you have enough physical space, why not copy the data back to a new dataset, delete the original /tank/depo and rename the new dataset to /tank/depo? Destroying the dataset, if there are snapshots may give errors, indicating you need to delete the snapshots first.

scilek · Oct 18, 2023

mer said:
I would guess, yes, may be related to snapshots. Sometimes it can be confusing as to "where" snapshot data is accounted for. If you have enough physical space, why not copy the data back to a new dataset, delete the original /tank/depo and rename the new dataset to /tank/depo? Destroying the dataset, if there are snapshots may give errors, indicating you need to delete the snapshots first.

To cut the very long story short, I replaced a disk and the silvering restarts on every reboot. The pool has to go, there is no other way. Please don't make me tell the whole story; it spans 4 months.

I just have to be absolutely sure that I will be able to recover all the files from backups after I recreate the pool.

mer · Oct 18, 2023

scilek said:
To cut the very long story short, I replaced a disk and the silvering restarts on every reboot. The pool has to go, there is no other way. Please don't make me tell the whole story, which spans 4 months.

Ok, silvering implies redundancy of some sort. What is the pool? Mirror, RAID-Zsomething? If mirror, I think one of the pair should be complete. If the resilver is always happening on the same device (mirror of ada0 and ada1 and ada1 always resilvers) remove the drive (power/data cables) on the resilvering drive then copy from the degraded mirror.

scilek · Oct 18, 2023

mer said:
Ok, silvering implies redundancy of some sort. What is the pool? Mirror, RAID-Zsomething? If mirror, I think one of the pair should be complete. If the resilver is always happening on the same device (mirror of ada0 and ada1 and ada1 always resilvers) remove the drive (power/data cables) on the resilvering drive then copy from the degraded mirror.

It's a RAIDZ vdev. Yes, the resilver happens on the same device (ada3). I don't understand how copying from the degraded mirror will make any difference.

I just need to know if there is something I should worry about the situation. I cannot afford to lose data.

chrbr · Oct 18, 2023

I assume building up a new pool on new disks and restore the backup is no option?

Regarding the size there have been threads recently comparing the output of df and of the ZFS tools. This requires of course a pool which is available.

About the maintaining status of the disks which hold the original pool it might be more safe to make 1:1 copies by dd() instead of working with the original disks. This is of course a matter of resources. But if something unlucky happens then only the copies are gone. They could be recreated again.

scilek · Oct 18, 2023

chrbr said:
I assume building up a new pool on new disks and restore the backup is no option?

Not anymore...

So why is the ZFS volume size different? Can I safely assume that all the files have been copied to the backup disks?

chrbr · Oct 18, 2023

scilek said:
Can I safely assume that all the files have been copied to the backup disks?

So you can access the pool? In this case you can use mtree(8) to record for example the md5 checksum of all files in the pool. This generates a specification file. Then you can run mtree(8) with different parameter to compare the backup data checksum with the data from the pool.

ralphbsz · Oct 19, 2023

du measures the disk space usage reported by the file system, in units of 512-byte blocks. But space allocation in file systems is not necessarily done in 512-byte units. There are lots of interesting complexities in there. For example, some file systems do not even store tiny files (hundreds to low thousands of bytes) in a block of its own, but squirrel it into the metadata, so it can use zero blocks. Some file systems only allocate space in units larger than 512 bytes (4K is not uncommon). For sparse files, the granularity of sparseness boundaries can differ. For this reason, the results of du are not exactly comparable between different file systems.

scilek · Oct 19, 2023

chrbr said:
So you can access the pool? In this case you can use mtree(8) to record for example the md5 checksum of all files in the pool. This generates a specification file. Then you can run mtree(8) with different parameter to compare the backup data checksum with the data from the pool.

I wouldn't know how to do that, but I cannot help wondering how mtree(8) would give me the same results when the sizes do not match. But it is still worth trying; better safe than sorry, right?

chrbr Could you exemplify the use of mtree(8)?

VladiBG · Oct 19, 2023

You are comparing the disk usage instead of the actual file size.

chrbr · Oct 19, 2023

scilek said:
[FONT=monospace]chrbr[/FONT] Could you exemplify the use of mtree(8)?

Please find an example below. First I create a directory with three files only. The content of the files are just as string with the number appended. The result is as below.

Code:

chris@thinkpad ~/tmp> ls a/
file1   file2   file3
chris@thinkpad ~/tmp> cat a/file1 a/file2 a/file3
nummer1
nummer2
nummer3

Then I create another directory and copy file1 and file3 only. Additionally I change the content of file3. The result is as below.

Code:

chris@thinkpad ~/tmp> ls b/
file1   file3
chris@thinkpad ~/tmp> cat b/file1  b/file3
nummer1
nummer4

Now I have two directories to compare. First I generate the specification file of directory a using md5. For a quick test one could use size instead. The procedure in the example generates a.spec as below.

Code:

chris@thinkpad ~/tmp> mtree -c -k md5 -p ~/tmp/a/ > ./a.spec
chris@thinkpad ~/tmp> cat a.spec
#          user: chris
#       machine: thinkpad
#          tree: /usr/home/chris/tmp/a
#          date: Thu Oct 19 11:15:38 2023

# .
/set type=file
.               type=dir
    file1       md5digest=53b31075b6ccd7abf84c450ef965a87f
    file2       md5digest=9f9b93d39c0132bfc65d1842a10e2748
    file3       md5digest=a57e9a18f09d4a71fb39ad5aac3203b1

Now I can check the content of the directory b against the specification of directory a.

Code:

chris@thinkpad ~/tmp> mtree -k md5 -f ~/tmp/a.spec -p ~/tmp/b/
file3:  md5digest (0xa57e9a18f09d4a71fb39ad5aac3203b1, 0x5006486f4da99da595e862172ad944a0)
./file2 missing

There is no comment about file1 because it is as in the original directory. Since one file has not been copied and one has been changed the related mismatches against the spec file are reported on standard out.

mtree(8) will go through the complete tree, this means that subdirectories are included as well.

mer · Oct 19, 2023

scilek said:
It's a RAIDZ vdev. Yes, the resilver happens on the same device (ada3). I don't understand how copying from the degraded mirror will make any difference.

Since it's not a mirror, it doesn't really matter. But my line of thinking was:
If it's a mirror, it's always the same device, then there may be physical issues with that device.
Unplug the problem device, leaving the mirror "degraded" because it has only one device.
The data will at least be consistent until you can fix the problem device.