'du' behavior

A

ASX

Guest
hello,

I'm experiencing a questionable output from du(1)

the script below show the effect, at least on my system:
Code:
rm -rf test
mkdir test
cp /boot/kernel/kernel test/kernel1
cp /boot/kernel/kernel test/kernel2
cp /boot/kernel/kernel test/kernel3
cp /boot/kernel/kernel test/kernel4
cp /boot/kernel/kernel test/kernel5
cp /boot/kernel/kernel test/kernel6
cp /boot/kernel/kernel test/kernel7
cp /boot/kernel/kernel test/kernel8
ls -l test
du -s test/*

and the 'du' output is not what expected:

Code:
total 64876
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel1
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel2
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel3
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel4
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel5
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel6
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel7
-r-xr-xr-x  1 root  wheel  21551736 Jun 28 12:31 kernel8
21153   test/kernel1
21153   test/kernel2
21153   test/kernel3
1417   test/kernel4
1   test/kernel5
1   test/kernel6
1   test/kernel7
1   test/kernel8

The test was run on a zfs filesystem, if that matter.

Any comment ? bug or feature ?
 
First guess would be the filesystem being full during the copy of kernel4.
 
This is on ZFS so it's very unlikely that the / dataset got full unless there are quotas in place.
 
This script might be easier:

Code:
#!/bin/sh

mkdir testing

for a in `seq 1 8`; do
        cp /boot/kernel/kernel testing/kernel$a;
done;

ls -l testing
du -s testing/*
I don't think this is a bug but a timing issue. If you issue the du -s $dir/* command after the script finished you'll get the normal output you'd expect. Therefore I can only conclude that when the script is running some data is still being processed (probably cache related) which would explain the difference.
 
no the filesystem is 11% used, 400 GB free, and there are no quotas.

And yes, it seems the result depend onthe fact the the write operation are queued and the blocks aren't effectively written at the time the du command is run.

Waiting a couple of seconds and repeating the du -s test/* command lead to the expected result.
That's why I wrote 'bug or feature ?'.

But honestly to me look more like a bug.
 
Is there compression involved? As far as I know ZFS writes in "lazy" fashion when compression is used and it can take few seconds until the file properties reported by stat(2) syscall match what is actually on disk.
 
I haven't used ZFS in a while but this could be a feature of it that you just have to live with. ZFS pushes many boundaries already and exposes problems in UNIX APIs, this could one of them. One famous example is running df(1) on a system that uses ZFS, the numbers you get for free space do not match in any you would expect because the free space reported for each dataset is actually the free space of the whole pool.
 
  • Thanks
Reactions: ASX
One famous example is running df(1) on a system that uses ZFS, the numbers you get for free space do not match in any you would expect because the free space reported for each dataset is actually the free space of the whole pool.
Not always true:

Code:
breve:/home/peter $ df -lh |grep zroot
zroot                            8.4G    2.8G    5.7G    33%    /
zroot/home                       6.9G    1.2G    5.7G    18%    /home
zroot/src                        6.6G    903M    5.7G    13%    /usr/src
zroot/var                        6.9G     30M    6.9G     0%    /var
zroot/var/db                     7.0G    151M    6.9G     2%    /var/db

Here you can see that df fully honors my reservation setting for the zroot/var filesystem:

Code:
breve:/home/peter $ zfs get reservation zroot/var
NAME       PROPERTY     VALUE   SOURCE
zroot/var  reservation  2.44G   received

As such the behavior of df is fully consisted here in my opinion.
 
Yes but you have to set those reservations (or quotas for that matter) yourself, I was referring to the default behavior.
 
Of course, I can live with any behavior once I know about.

And while the df(1) output is logical in the zfs context, cannot say the same for du(1).

Related to the stat(2) syscall, there is to note that ls(1) output is immediately and always correct, therefore I guess that du(1) doesn't use the stat() syscall, and perform a different calculation.
 
It's a feature.

ZFS writes data out in chunks, known as transaction groups. By default, this writing happens every 5 seconds. Thus, depending on how much data is being written out, you may need to wait 6 seconds in order to make sure everything has been written to the pool.

The default used to be 30 seconds. :)
 
The fact that the "writings" are delayed is true practically for every filesystem out there, but that, in my opinion, should not affect the report of 'used blocks'.

As an example in the same domain, an application can read a block from a newly written file, and the fact that the block is already on disc or not is irrelevant. Or should I suppose this is also not true for ZFS ?
 
I made an additional test, using du -A test/* , the -A flag is intended to provide the apparent size of a file, mainly useful when examining sparse files.

In the case discussed here the result should be the same with or without using the -A fllag, and it is NOT.

At very least, I can now state du ouput is inconsistent with du -A output.

NB: I'm deliberately ignoring the du implementation details.
 
I believe the word "transaction" is the important one here. If ZFS wrote the file metadata (the directory entry etc) first and completely so that the output of du would immediately match the size of the file when it's written completely to the disk you would end up with a serious inconsistency if the system crashes before the actual writes to the file have finished. The directory entry would claim that the file is written completely but there would be parts of the file missing and I don't think ZFS (nor do the standard UNIX file handling APIs) has any provisions to record the fact that parts of a file are missing because of such incident. So in summary, ZFS writes only in a way that is atomic and the transactions either succeed completely or are rolled back completely in case of failure.
 
I can agree that the transaction management in ZFS could be a possible source of observed du behavior, but I do not agree about "it is a feature".
 
The fact that the "writings" are delayed is true practically for every filesystem out there ... I do not agree about "it is a feature".

You can call it a "side-effect," then. It's the expected result of a deliberate design choice. Your script removes a directory; then immediately recreates that directory; then immediately fills that directory; then asks for the space consumed by files in that directory. The shell script is executing these things before ZFS can report accurate information to the kernel, because its copy-on-write and transactional nature dictate that it does not immediately destroy and overwrite data. So it needs time to rewrite pointers, update its uberblocks and ditto blocks, eventually remove the old directory (the one you ordered deleted, which wasn't actually deleted yet), update its own information (such as that reported by zfs list), and then report its true capacity and the true size of the files. If you're executing du the instant a copy operation completes, you're not going to get accurate information.

This is not the journaling or write caching many filesystems do. ZFS is fundamentally different from every other filesystem out there. Expect quirks. Learn how it works.
 
Learn how it works.

yes, of course, but how that would fix a program du that a certain times return an incorrect value and a few second later a correct value ?

Or are you telling me that I have to insert a sleep(10) or something like that before to use du ? Is that what you mean ?

Something is not quite right.

You can call it a "side-effect

Fine, at very least I believe it should be documented, possibly fixed, if a fix is not possible it would be better to make it refusing to work with ZFS. That's my view.

What I have learned so far is that I cannot rely, and therefore cannot use du anymore. Fine too, I will live with that.
 
Are you guys still sure this behaviour is due to ZFS "transactions"?

My first thought was ZFS deduplication, as these are duplicate files and du shows actual block usage rather than file size. I would assume the selective appearance at the time du is called would be due to the process being deferred to not hinder write performance.

You could test the theory by copying a set of more unique files instead.
 
yes, of course, but how that would fix a program du that a certain times return an incorrect value and a few second later a correct value?
You're leaving out a very important detail here: on ZFS. I tried it myself and it only occurs on ZFS and not UFS. Which, as KPA above me also mentioned, isn't all that surprising because they don't call ZFS "a next-gen filesystem" without reason.

Question being: is there really something to fix here?

I agree that it would help if this caveat got mentioned in the du manualpage, but other than that I don't think there's much to fix here. And if I play the devils advocate I might even be able to argue that there isn't anything to fix:

du manual page said:
The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument.
If the file system isn't using any blocks at the time it is run, then how would du be able to fix this in the first place? How would that mark du broken as you hinted at?

It's not du, it's the underlying filesystem. And the manualpage covers for this, as shown above.

My first thought was ZFS deduplication, as these are duplicate files and du shows actual block usage rather than file size.
I tested this myself on 2 ZFS powered systems: one without mirroring and one with, both showed the same behaviour. So this isn't a deduplication but a specific ZFS based issue.
 
I agree that it would help if this caveat got mentioned in the du manualpage, but other than that I don't think there's much to fix here.
Thanks, that was what I was hoping to read. (and I though that ZFS only was already implied).

I can now add that I have performed a similar test using OpenIndiana, and the result is similar, definitely ZFS related.

I will also explain how I met the problem: I was comparing mkuzip and mkulzma output/sizes:
Code:
mkulzma testfile
mkuzip testfile
du testfile*

Then I noticed the the testfile.uzip was smaller than testfile.ulzma, that wasn't expected.
Subsequent steps allowed for detection of the erratic du behavior.

If I had inverted the two compression commands, I would have not noticed the problem and most likely I would have come up on forum claiming a 100x improvement in lzma compression compared to zip compression. :D

Seriously, what may be worrying is that the size error get unnoticed.

Think at using du -s on a just created three, to alloc a specific md(4) size ... to copy the three into, you will find yourself in truble without any real error in your code, and additionally with an issue time/load dependent.
 
I tested this myself on 2 ZFS powered systems: one without mirroring and one with, both showed the same behaviour. So this isn't a deduplication but a specific ZFS based issue.

I thought the deduplication was separate from mirroring... it's just the concept of basic compression at a logical volume level, that's separate from mirroring the entire volume.

I'm asking... I'm new to ZFS, but from what I read about dedup on ZFS it sounds like a separate option, I can't find what the default is for FreeBSD.
 
Deduplication has to be explicitly turned on if you want to use it and it's has nothing to do with mirroring.
 
Back
Top