ZFS ZFS size reporting cache coherency

Hi. In both 13.1-RELEASE and FreeBSD-14.0-Release, in ZFS, if I write files and then do a du or ls -l, the wrong size is reported for several seconds. It reports it way too small. In my example, it only reports 1 block at first, then finally reports the size as 11 blocks. This seems to be some kind of cache coherency issue.

I am developing some scripts that require getting and recording a directory size after writing files to it, but this creates a significant obstacle. Am I overlooking something? Perhaps a ZFS tuning parameter to correct it? If I run the test in a UFS file system the problem does not exist. It immediately reports the correct size. I have included a short script that reproduces the problem.

sh:
#!/bin/sh

# Cleanup before starting the test
rm -rf testdir

# Make a test directory, and create a test file in it to use some space.
mkdir -p testdir
echo Creating testdir/testfile
dd if=/dev/random of=testdir/testfile bs=1k count=10 2> /dev/null
echo
ls -l testdir

du -s testdir   # Reports wrong value (way too small)
# After sleeping 5 seconds, it reports the correct size.
# It sometimes reports the correct value after 3 seconds but is not
# consistent until after 5 seconds.
sleep 5
du -s testdir
 
Thanks NapoleonWils0n for your quick response. However, that only reports the usage of the entire pool. I need it for a single directory tree. Also, zpool list shows it in GB which, for small amounts of data, does not change. I need it at least down to KiloBytes.

I accidentally hit the post button on the last reply and I see no way to delete it.
 
I would expect differences between du and ls.
zfs is a transactional filesystem and the actual disk-usage might not even be known before the transaction gets closed (txg timeout) - consider copies,compression,deduplication,raidz,...
 
i came across this post on the forum


Code:
zpool get -p -H size | awk 'size=$3/1024^2 {print $1,$2,size "MB"}'
That is a useful command, however, as I mentioned, I need to know how much space a single directory tree is consuming like what du reports. Not the size of an entire pool.
 
Thanks NapoleonWils0n for your quick response. However, that only reports the usage of the entire pool. I need it for a single directory tree. Also, zpool list shows it in GB which, for small amounts of data, does not change. I need it at least down to KiloBytes.

I accidentally hit the post button on the last reply and I see no way to delete it.
I do cntrl-a . ctrl-x. Then add a . As the post.
 
there might be a difference when I run this in my raid/dedup pools - there usually is.
currently I'm sitting in a train and have only a gadget, can't easily access my sites
you could try reduce txg timeout to 1 sec, but there is still no guarantee that this gets serviced under high load
 
default txg timeout is 5 sec, so I think that is what you are seeing. And as you are using random in your test, there is no compression gain - but I don't think that is what you want to use in production...
 
  • Like
Reactions: mer
Yea, you appear to be right. 5 seconds is what I am seeing as being the time when it consistently reports the correct disk usage. I have not found how to change txg timeout. I tried
zfs set txg.timeout=2 data
for testing, but it says
cannot set property for 'data': invalid property 'txg.timeout'

It looks like I have a work around for my needs though. I will either run my scripts in a UFS file system or mount a tmpfs file system to run them in. I guess this is one of those things I just cannot do inside of a ZFS file system.

Either way, thanks, everybody, for your responses.
 
Yes, it is a kernel parameter:
sysctl vfs.zfs.txg.timeout

But this parameter doesn't offer a correct solution to the problem, as it is a timeout only. That means, while the file write syscall happens, ZFS puts the data into the cache only. Then after txg timeout, it starts to place it onto disk. But we do not know how long that will take - on a heavily loaded system this may be quite some time.
Using O_SYNC doesn't help us either, because that does not change the processing, it only copies the cache content into the ZIL immediately, from where it can be recovered after a crash.

For now I fear there is no correct solution to the problem. The ZFS cache does not need to know the disk space accounting, and apparently it is not computed until the actual disk write happens.
 
Another thing to note is that compression does not happen until actual writing, so there is simply not knowing how much space will be taken on disk.
 
I am developing some scripts that require getting and recording a directory size after writing files to it,

If you explain to us WHY you need the on-disk size, we might be able to help you.

The whole concept of "how much space does a file / directory tree consume" is fraught with difficulty. Consider compression, write-behind, dedup, metadata overhead, lazy deletion, transaction overhead, log cleaning, all that stuff. In the case of a file system with built-in redundancy (like ZFS), add copies, RAID, snapshots. My suspicion is that you're trying to verify that the data has really been written to disk. There might be better ways to do that; and at some level, you need to have some trust in the system.
 
If you explain to us WHY you need the on-disk size, we might be able to help you.
OK. I will try to elaborate further.

We have scripts that do in house packaging of various software that copies files to a staging area, tars them up and records various information about the package, such as the unpacked size, which they get by using du on the staging directory prior to taring and compressing them. So we do not need to know how much space it actually consumes on the file system. We need to know how much data there is.

We have been using these scripts and variations of them for years across various Linux systems and BSD systems using UFS, and have never had a problem extrapolating the correct sizes immediately after being written, whether they are still cached or not. I am not an expert on ZFS, obviously, but this issue caught us on surprise because I have always expected and experienced cache coherence on any Unix based file system we have used. Meaning that once the writing process closes files, any process that reads them or stats them, should see the correct sizes of the files regardless of whether they have been physically written to disk or not. IMHO, whether the FS does compression, mirroring, or whatever, should be transparent to the userland processes. If I do a du, for example, I expect to know how big a file is, not how much space it consumes compressed on the underlying file system.

From my tests, ZFS does report this correctly, but only after the up to 5 second delay.

I really like ZFS and use it a lot on FreeBSD but this is a strange idiosyncrasy that appears like either a bug, or an architectural design flaw. I do not think it should report wrong information until after it has completed the transactions and physically written the data several seconds later.

As a followup: The individual file sizes are immediately reporting correctly. It is only the directory sizes that are not correct until written to disk.
 
As a followup: The individual file sizes are immediately reporting correctly. It is only the directory sizes that are not correct until written to disk.
Ah, now things start to make sense.

Let's have some fun. Try UFS:
Code:
root:/ # sh /media/admin/testscript
Creating testdir/testfile

total 12
-rw-r--r--  1 root  wheel  10240 Mar 21 02:47 testfile
16      testdir
16      testdir

Now try ZFS:
Code:
root:/media/admin # sh /media/admin/testscript
Creating testdir/testfile

total 1
-rw-r--r--  1 root  staff  10240 Mar 21 02:48 testfile
1       testdir
13      testdir
root:/media/admin # cd /var/sysup/
root:/var/sysup # sh /media/admin/testscript
Creating testdir/testfile

total 1
-rw-r--r--  1 root  wheel  10240 Mar 21 02:49 testfile
1       testdir
17      testdir

It's not only delayed, it's something different everywhere.

So we do not need to know how much space it actually consumes on the file system. We need to know how much data there is.
But du is short for DiskUsage. It doesn't tell you how much data is there, it tells you how much space is actually consumed.
 
Thank you for your detailed message!

... such as the unpacked size, which they get by using du on the staging directory ...
Makes perfect sense ... except ...

If I do a du, for example, I expect to know how big a file is, ...
In traditional file systems, that expectation has worked most of the time. For sparse files (where not all of the file is written to disk), the answer from du versus stat or ls will differ, but sparse files are somewhat rare in the real world. At least for files it works; the size of a directory object is not well defined, but directories are typically a small fraction of space usage.

But even if it has traditionally worked, it is still philosophically wrong. The purpose of du is to see how much disk space is in use, and that number makes sense as part of an overall space management strategy. Look at it this way: If you know your disk has 1,000,000 blocks available, and you run du on all the things you are currently storing and find they use 950,000 blocks, then you know that you could write stuff that uses another 50,000 blocks (after all the compression/dedup/replication/metadata overhead). For this to work even halfway reasonably, du has to report the actual space used, not the logical size.

In your case, you want to use how many bytes the file will occupy in a tar. With the exception of sparse files (and I don't know how tar handles them, it's probably option- and version-dependent), that's really the logical size of the file. May I suggest something that is too much work: stop using du, and instead find the sizes of all files involved (with a recursive set of stat() calls), and add the numbers? I think that will give you more accurate answer.

Or, if that's too much work (I fully understand!) just wait 5 seconds and then call du.

... but this is a strange idiosyncrasy that appears like either a bug, or an architectural design flaw. I do not think it should report wrong information until after it has completed the transactions and physically written the data several seconds later.
I agree, it's at least an idiosyncrasy, and perhaps one should call it a bug. ZFS does not report the space used in its log structures, because those logs can be cleaned, hardened, and garbage collected. But it should estimate the space that things currently in the log will eventually use; unfortunately, that's pretty hard to estimate without completely working through the logs.
 
For this to work even halfway reasonably, du has to report the actual space used, not the logical size.

May I suggest something that is too much work: stop using du, and instead find the sizes of all files involved (with a recursive set of stat() calls), and add the numbers? I think that will give you more accurate answer.
Thanks for your suggestions.

Further testing on FreeBSD 14.0-Release confirms you guys are right about du showing the actual space consumed, not the combined space based on the file sizes. My original tests were on my FreeBSD-13.1-Release workstation, which does not have compression on for ZFS. So after the delay, it was reporting what I expected, like in UFS. Apparently they started turning on compression by default in FreeBSD 14, which I just confirmed with the "zfs get compression" command. So, even after waiting the 5 seconds, I will not be able to rely on the reported total sizes from du or 'ls -l'. So to be able to do it in ZFS, it looks like I'm going to have to take your suggestion of doing a stat of all the files in the tree and adding them up. I actually thought about the possibility of doing that the other day but I avoided that option initially, hoping I could keep it simple and fast with a simple du like I have been previously doing. You just reaffirmed that that is what I will probably have to do. :)
 
du(1) has -A option ("apparent size"). Maybe it's what you want?
Yes!! I had not discovered the -A option to du. That looks like exactly what I want. It seems to report the correct data usage I am looking for with and without compression turned on and without waiting for the txg timeout.

Since the apparent size is available from ZFS, I guess I take back what I said about it appearing to be a ZFS bug :).

Thank you very much. You just saved me some work.
 
Back
Top