ZFS A power outage left my ZFS file system with inaccessible files and directories.

My power went out while building ports via poudreire (I don't know whether it is related or not, but I figured I'd share it). Upon reboot, several files and directories are in a 'phantom' state. As far as I can tell, the files that were impacted were not being edited at the time or are frequently accessed. These files and directories are missing from the file system and, at the same time, present and cannot be altered. e.g., lswill give me the name of the file or directory in question with an the error No such file or directory. cp or mv returns File exists. To work around the problem, I moved the parent directory with phantom files in it and restored the files in question from a snapshot.

Also, zfs scrub came back clean.

Questions:
What happened?
How can I return the files to their original state or make it so they can be deleted?
I only noticed these phantom files because of the programs that broke, Is there a way I can find other phantom files?
 
Generally there is no escape except what you already did, moving them away. There is no fsck for ZFS.
Ok, I was hoping that there was something that could be done with zdb. After looking at its manpage, it didn't seem like it was going to be a simple feat.

Is this even possible with zfs' sync on? ZFS is designed specifically to avoid this. I had abrupt power offs many times, zero problems.
I've had many abrupt power cycles over the years, and this is the first time I've encountered something like this.
 
These files and directories are missing from the file system and, at the same time, present and cannot be altered. e.g., lswill give me the name of the file or directory in question with an the error No such file or directory. cp or mv returns File exists.
This sounds like your ZFS metadata is corrupted, in a rather bad way. Obviously, this should have never had happened.

There might be other explanations, but I can't think of any sensible ones right now. You can get very bizarre effects if file names contain unicode characters (for example, you might have multiple files that "look like" they have the same name, but in reality the names are different, they just render the same way on the screen). To the clueless user trying to access them, it might seem like a file both exists and doesn't exist: you do "ls" and see a file that looks like it is named "a", but "cat a" fails, because the file name is really a character from another language that just happens to look exactly like a. Usually, doing something like "ls -1 > /tmp/foo" and then examining /tmp/foo with hexdump tends to find those problems. But this is very rare.

As already said: Fixing this with zdb is impractical for most people, and probably not worth the effort.

I suspect you have found a bug. In theory, you should report it and allow some developer to access your file system to see what the exact situation is. In practice, it seems unlikely a developer will have the time to do this.

Purely idle curiosity: What version of FreeBSD (and therefore ZFS) were you running? Has your hardware ever shown any symptoms of memory errors (like random spontaneous crashes)? Does your motherboard have ECC? I'm not saying "blame the hardware", but memory errors in a metadata block are a possibility.
 
If one thinks about the way ZFS does writes, they wind up in a TXG, that is roughly queued and flushed to a device at either an interval or when it grows and hits a limit.

That means there is potential for software to think "data is written" but it hasn't reached the device yet. If you lose power at just the correct instant, you can lose data.
Now extend that to metadata and I can imagine you can get inconsistency.
By definition a build machine is doing lots of write operations.

One could setup various things to mitigate this (ZIL, SLOG on separate device, configure for synchronous writes on maybe just metadata, redundancy on the vdevs)

But I think every filesystem has the potential if subjected to sudden loss of power at the exact right instant. Think of just basic gjournal device: writes go to the journal first then the "filesystem". If you lose power when only half the data has been written to the journal, on reboot replay the journal only gets what's in the journal.

Above is just my opinion, based on my understanding of ZFS and may not reflect reality.
 
This has happened here, too. I kept it:

Code:
/mnt/lib/python3.7/ctypes # ls -l
total 25
drwxr-xr-x  3 root wheel 3 Dec 10  2020 test
/mnt/lib/python3.7/ctypes # ls -l test
ls: __pycache__: No such file or directory
drwxr-xr-x  3 root wheel 3 Dec 10  2020 .
drwxr-xr-x  3 root wheel 3 Dec 10  2020 ..
(notice the "3" in the entry count of the directory)

There is a thread about it: https://forums.freebsd.org/threads/no-longer-fun-with-upgrading-file-offline.77959/#post-486310

Strangely, back then ls could read the directory, but the file had a 'reparse' flag that didn't belong there. Now, after a couple of upgrades, it has changed to not being found at all.
 
Purely idle curiosity: What version of FreeBSD (and therefore ZFS) were you running? Has your hardware ever shown any symptoms of memory errors (like random spontaneous crashes)? Does your motherboard have ECC? I'm not saying "blame the hardware", but memory errors in a metadata block are a possibility.
Code:
  % uname -a
FreeBSD mediamaster 15.0-RELEASE-p6 FreeBSD 15.0-RELEASE-p6 GENERIC amd64
When this happened, I was on p5.

Yes, I have had plenty of panics but that has been because of some interplay between FreeBSD XLibre and (I think) AMD graphics. I dual boot between Devuan also running XLibre and do not have crashes. I don't think my ZFS woes or the panics are hardware related.

That means there is potential for software to think "data is written" but it hasn't reached the device yet. If you lose power at just the correct instant, you can lose data.
Now extend that to metadata and I can imagine you can get inconsistency.
This does make sense to me; however, it doesn't fit what I experienced. The files and directories that I had problems with were not written during the build or recently. Some directories might have been written at the same time /usr/local/etc/bastille and /usr/local/share/bastille are one example. While I also have problems with /usr/local/etc/rc.d and several files in /usr/local/share/X11/kbd/compat

Code:
  % ls -la compatold

ls: README: No such file or directory
ls: accessx: No such file or directory
ls: basic: No such file or directory
ls: caps: No such file or directory
ls: complete: No such file or directory
ls: iso9995: No such file or directory
ls: japan: No such file or directory
ls: ledcaps: No such file or directory
ls: ledcompose: No such file or directory
ls: misc: No such file or directory
ls: mousekeys: No such file or directory
total 3
drwxr-xr-x  2 root wheel 13 Apr 21 05:09 
drwxr-xr-x  9 root wheel 10 Apr 22 06:26
Did you write to the pool yet?
Yes, this is my root partition. I have been resolving the files as I have come across them.
A ZIL might have helped with this. I would like to build a server that would better handle this workload, but I was waiting for better ram and ssd prices (a day that may never come).

Another issue that I've had with this computer that that the nvme drive will sometimes freeze, especially under heavy load. The computer power outage could have happened at that time. The nvme issue also doesn't happen when running devuan.
 
  • Thanks
Reactions: mer
Another issue that I've had with this computer that that the nvme drive will sometimes freeze, especially under heavy load.
That may be a good thing. I had that happen before too. Turns out, in my case, it was happening when I was maxxing out the write speed onto my zroot with sync on. That will have blocking effects if you are writing to the drive with your zroot with sync on. You probably do want sync on, but it needs to be managed.
 
That means there is potential for software to think "data is written" but it hasn't reached the device yet. If you lose power at just the correct instant, you can lose data.
Yes, and this is true for nearly all (POSIX-) file systems. When an application writes to a file (and even when it closes the file), there is no guarantee that the data is actually on disk, and will be readable in the future. That guarantee only happens once the application calls some form of fsync, or has opened the file in a sync mode. But this is not what OP is seeing here.

Now extend that to metadata and I can imagine you can get inconsistency.
No. A correctly implemented file system should NEVER get an internal inconsistency. What the OP is seeing: a file exists when looking at it in one way, and doesn't exist when looking at it another way. This must be impossible. Obviously, it is possible in the real and imperfect world (the OP is seeing it). But you are right with the following observation: The fact that disk writes can be delayed is usually the mechanism that file systems develop inconsistencies. For this reason, file system implementations are super careful about writing things in the correct order, or using mechanisms such as fsck after a crash to put things back into a consistent state. It seems that in this example, something went wrong. Given that the OP has not been seeing memory errors, the likely explanation is a bug in ZFS. I've implemented file systems for a living, and I've seen and fixed so many of these things. Usually, ZFS is extremely good about these things, since a lot of fundamental design decisions (in particular the log-structured writing) make it easy to do it right and hard to do it wrong. But mistakes happen.

But I think every filesystem has the potential if subjected to sudden loss of power at the exact right instant. Think of just basic gjournal device: writes go to the journal first then the "filesystem". If you lose power when only half the data has been written to the journal, on reboot replay the journal only gets what's in the journal.
Yes and no. File systems make certain guarantees. The big ones are explained above: If a problem occurs (detected disk write error, or crash or power loss), the system will be in a consistent state, but perhaps not the state the user = application wished it to be in. If the application calls for a sync write, the data will be "on disk" in the sense that a future read will find it as written (except for future read errors occurring). Note that the state does not have to be transactionally or ACID correct across multiple files, only for one file at a time: if the application first creates file A and then B, and the power fails, B may be on disk, but A may be missing.

Few applications use sync writes, because of the massive performance penalty on spinning disks (and moderate performance penalty on SSDs). The idea is that you usually just restart the (idempotent) program, and will get the correct result. That theory doesn't always work; in particular make with sloppy makes files can easily get confused. The way I think of it: It's not data loss, if the application didn't call sync at the correct time.

But none of this discussion is relevant to the OP's problem: They caused ZFS to make a broken situation.
 
ralphbsz thanks for the insights. I know that a filesystem should never get an internal inconsistency, but humans and "should" vs "will not ever". I was kind of hedging things there.

Hard power fails are always an interesting case because we can't quantify "there is enough power reserve to flush the next 37 bits to the device" and it all comes down to recovering.

the system will be in a consistent state, but perhaps not the state the user = application wished it to be in.
agree 100%. The problem is the user/application state is determined to be "the correct state".

But none of this discussion is relevant to the OP's problem: They caused ZFS to make a broken situation.
Again, agree. Just speculating as to how it could have happened. Could be a bug in sw, could be a hw thing, could be a combination. Hard to tell unless it can be reproduced.
 
Just for the record, ZFS is supposed to be safe against this. It carefully orders writes so that a termination of the queue writes at any point in time doesn't lead to inconsistencies.

The usual explanation/excuse for this happening anyway is that the drive disobeyed that given write order when flushing its cache. Bad things are happening in drive software in the name of speed.
 
Just for the record, ZFS is supposed to be safe against this. It carefully orders writes so that a termination of the queue writes at any point in time doesn't lead to inconsistencies.

The usual explanation/excuse for this happening anyway is that the drive disobeyed that given write order when flushing its cache. Bad things are happening in drive software in the name of speed.
This is possible. It is also possible that ZFS does something that it shouldn't do. And it is also possible (and quite common nowadays) that both parties do something in a scope that is not absolutely strict defined, and taken together this opens a gap where malfunction might appear.

You are right in that ZFS should not allow such things to happen, and code can indeed be written in a way that such flaws become logically impossible, but that would require the code to be logically verified, and ZFS isn't.
In fact, given the huge amount of features added over time, it is quite likely that there are hidden flaws lingering around[*]. But that is true for almost all other software likewise, and in comparison ZFS is still one of the most reliable pieces.

[*] Just to give you an example for an obvious defect (not related to this issue here): about half a year ago I noticed that my ZFS does accumulate a count in
kstat.zfs.misc.arcstats.l2_cksum_bad. It did repeatedly so, and after switching hardware and not finding a tangible defect, I went thru my local configuration adjustments, and switched off the vfs.zfs.l2arc.trim_ahead feature (which I had enabled because it seemed a "good thing"). The issue didn't reappear since then. Apparently that feature isn't implemented correctly.
 
The usual explanation/excuse for this happening anyway is that the drive disobeyed that given write order when flushing its cache. Bad things are happening in drive software in the name of speed.
Thank you for that observation, I had forgotten about that possibility. If the OP is running on spinning rust, this is not very likely, as those disks typically work only on one IO at a time, and will not perform a write until the previous one is completed. On the other hand, SSDs of all kinds tend to have internal parallelism, which file systems (and they IO layer below) have to handle carefully.

And it turns out OP not only is running on an SSD, they've even had problems with it:

Another issue that I've had with this computer that that the nvme drive will sometimes freeze, especially under heavy load. The computer power outage could have happened at that time. The nvme issue also doesn't happen when running devuan.

Suspicious. Hard to debug and fix.
 
The usual explanation/excuse for this happening anyway is that the drive disobeyed that given write order when flushing its cache. Bad things are happening in drive software in the name of speed.
likely.
Another issue that I've had with this computer that that the nvme drive will sometimes freeze, especially under heavy load.
This may have to do with writes when there are no pre-erased blocks. Erasing a block on an SSD is much slower than writing data. Of course, it could also be due to bad drive firmware but under heavy write load this may be normal behavior.
 
I had poudriere zfs datasets corrupted after power outage. So solution : unmount them recursively an destroy them.
The problem was no zfs but poudriere handling bad data.
 
Back
Top