ZFS Flaky directory appears and disappears

I am new to ZFS. A directory under ZFS is behaving strangely:
Code:
[nyan@orange ~]$ ls nyan16 |wc
   23099   23099 1108752
[nyan@orange ~]$ ls nyan16 |wc
   17561   17561  842928
[nyan@orange ~]$ ls nyan16 |wc
   12413   12413  595824
[nyan@orange ~]$ ls nyan16 |wc
    7735    7735  371280
[nyan@orange ~]$ ls nyan16 |wc
ls: nyan16: No such file or directory
       0       0       0
[nyan@orange ~]$ ls nyan16 |wc
   16205   16205  777840
[nyan@orange ~]$

The files in the directory are not being altered. I am running zpool scrub to check the file system at the moment and then going to reboot the system. However, this is rather concerning. What could have caused this behaviour?
 
Never seen that. Any other information? ZFS layout / properties?
Code:
[nyan@orange ~]$ zpool get all
NAME   PROPERTY                       VALUE                          SOURCE
zroot  size                           2.72T                          -
zroot  capacity                       14%                            -
zroot  altroot                        -                              default
zroot  health                         ONLINE                         -
zroot  guid                           9802395473873768945            default
zroot  version                        -                              default
zroot  bootfs                         zroot/ROOT/default             local
zroot  delegation                     on                             default
zroot  autoreplace                    off                            default
zroot  cachefile                      -                              default
zroot  failmode                       wait                           default
zroot  listsnapshots                  off                            default
zroot  autoexpand                     off                            default
zroot  dedupditto                     0                              default
zroot  dedupratio                     1.00x                          -
zroot  free                           2.31T                          -
zroot  allocated                      416G                           -
zroot  readonly                       off                            -
zroot  comment                        -                              default
zroot  expandsize                     -                              -
zroot  freeing                        0                              default
zroot  fragmentation                  2%                             -
zroot  leaked                         0                              default
zroot  feature@async_destroy          enabled                        local
zroot  feature@empty_bpobj            active                         local
zroot  feature@lz4_compress           active                         local
zroot  feature@multi_vdev_crash_dump  enabled                        local
zroot  feature@spacemap_histogram     active                         local
zroot  feature@enabled_txg            active                         local
zroot  feature@hole_birth             active                         local
zroot  feature@extensible_dataset     enabled                        local
zroot  feature@embedded_data          active                         local
zroot  feature@bookmarks              enabled                        local
zroot  feature@filesystem_limits      enabled                        local
zroot  feature@large_blocks           enabled                        local
zroot  feature@sha512                 enabled                        local
zroot  feature@skein                  enabled                        local
Is anything using this directory?
The directory is not currently in use. This directory was in extremely heavy use yesterday since many small log files were being untarred into it and then deleted.
Zpool status?
At this time, it looks like this:
Code:
root@orange:/home/nyan # zpool status
  pool: zroot
 state: ONLINE
  scan: scrub in progress since Tue May 30 21:37:38 2017
        187G scanned out of 383G at 38.5M/s, 1h26m to go
        0 repaired, 48.91% done
config:

   NAME        STATE     READ WRITE CKSUM
   zroot       ONLINE       0     0     0
     mirror-0  ONLINE       0     0     0
       ada0p3  ONLINE       0     0     0
       ada1p3  ONLINE       0     0     0

errors: No known data errors
 
I am new to ZFS. A directory under ZFS is behaving strangely:
I'm not sure but I do have a theory. When ZFS performs certain operations then it is not uncommon for regular tools to display incorrect values. If I recall correctly that's mostly caused by a small (mostly cosmetic) bug in the way ZFS presents info during operations (but: I'd have to look this up to be fully sure).

You can sometimes see the same behavior when some operations are being performed on the contents of a directory (such as adding / removing files) and then checking the used space using du.

But having said that: I never encountered something as bizarre as this: actually finding an empty directory...
 
What is the actual content of the ls? All you give us is word counts. Are the names of the files in there actually correct file names? Could it be that the delete of "many small log files" is still running? That would explain the first few numbers you report, where the number of files is shrinking, and finally the directory vanishes. It would not explain the directory springing back into existence afterwards though.

Any error or warning messages on the console, or something in /var/log/messages?
 
What is the actual content of the ls?
Various log files, about 500 bytes to 2000 bytes long.
All you give us is word counts. Are the names of the files in there actually correct file names? Could it be that the delete of "many small log files" is still running?
Very unlikely, the number of file counts was going up and down, and the deletion of the files was several hours before.
That would explain the first few numbers you report, where the number of files is shrinking, and finally the directory vanishes. It would not explain the directory springing back into existence afterwards though.
The directory was disappearing and then reappearing, and the number of files was changing randomly. The actual job which was deleting the files crashed several hours before this, for unknown reasons. I was able to restart the job without problem after rebooting.
Any error or warning messages on the console, or something in /var/log/messages?
Between logging in as root in an effort to delete the files, and shutting down the computer, there is nothing in /var/log/messages except the following lines:
Code:
May 30 21:37:38 orange ZFS: vdev state changed, pool_guid=9802395473873768945 vd
ev_guid=2120269406060973466
May 30 21:37:38 orange ZFS: vdev state changed, pool_guid=9802395473873768945 vdev_guid=1457454098546718061
May 31 00:28:13 orange su: nyan to root on /dev/pts/1
May 31 00:28:16 orange shutdown: reboot by nyan:
After rebooting the computer, the number of files stablised to about 14,000, and the directory was able to be removed.
 
Very strange. Makes very little sense. If the size of the directory was increasing (new directory entries appearing), and the whole directory itself disappearing and reappearing, without someone recreating the directory or creating new directory entries in it, this must be a bug in ZFS.

Obviously, during that period you didn't check the content of that directory carefully: It would be interesting to know whether the directory itself had the correct content (sensible file names), whether the files themselves were undamaged, and whether the file attributes (permissions, mtime/atime and all that) were correct. That might help with debugging; but since the problem has gone away, there may be nothing left to debug.

Just one question: Could it be that the problem wasn't the file system, but the program that was doing the deleting? Was that just a "rm" command in a shell script, or was it custom-written code? It turns out that the readdir() system call interface can be a little non-obvious. If someone is using telldir() and seekdir() while at the same time modifying the directory (by unlink()'ing directory entries), strange but logical things can happen, which may surprise the novice programmer. You say that the delete job crashed, which makes me suspect that it wasn't doing everything correctly. Perhaps the problem was a cooperation between an inherently broken (but usually kinda-sort-working) delete program, which exposed a rare bug in ZFS.
 
Just one question: Could it be that the problem wasn't the file system, but the program that was doing the deleting? Was that just a "rm" command in a shell script, or was it custom-written code?
The precise code was a script to get a listing of the tar file with tar tfz, then extract all the files with tar xfz, then go through the listing and move the files with a routine called "move" from something called File::Copy (perl), or, if the "move" operation failed, to unlink the extracted file using the built-in "unlink" function from Perl, then at the end of processing to "rmdir" all the directories mentioned in the tar file using a shell command "rmdir $dir". This operation produced errors as follows:
Code:
No such file 'nyan16'.
rmdir: nyan16/: Directory not empty
rmdir nyan16/ failed.
No such file 'nyan16'.
rmdir: nyan16/: Directory not empty
rmdir nyan16/ failed.
After the reboot, the above errors did not recur.
It turns out that the readdir() system call interface can be a little non-obvious. If someone is using telldir() and seekdir() while at the same time modifying the directory (by unlink()'ing directory entries), strange but logical things can happen, which may surprise the novice programmer. You say that the delete job crashed, which makes me suspect that it wasn't doing everything correctly. Perhaps the problem was a cooperation between an inherently broken (but usually kinda-sort-working) delete program, which exposed a rare bug in ZFS.
I do not have any record of what caused the crash of the job. There is no readdir, telldir or seekdir within my Perl script.
 
Given what you say about your Perl job, this smells more and more like a ZFS bug. Hoping that some ZFS developers (I'm not one) are reading along here. On the other hand, given the meagre debugging data, they'll probably not be able to do anything concrete.
 
Maybe a shot in the dark, but is "nyan16" a directory or a seperate dataset and is the snapdir property set to "visible"?
I've seen strange behaviour with find(1) once, when the .zfs directory is visible and snapshots are being created/destroyed while such a tool is running.

Still assuming it is a seperate dataset:
Does zfs list -o name,used,lused,refer,mountpoint give reasonable values? I.e. if the directory is / should be empty, all values should report only a few K

Is this directory present on the/a parent dataset? Some tools also act flaky with overlay mounts. Unmount the nyan16 dataset and check if there is anything left in the directory.
 
Back
Top