ZFS ZFS: Corrupt data? Having an issue deleting a directory, causes processes to run away

Andriy · Oct 14, 2024

Nobody proved that the bug was / is because of bad memory.

moobsd · Oct 16, 2024

Thread has gone a bit silent since the latest revelation. Does the BSD dev's have interest in resolving this, not really sure what angle it should be taken from. Clearly looks like a bug to me that was never resolved. Almost wishing I stuck with UFS

PMc · Oct 17, 2024

Well yes, this is a known bug closed for convenience. :/
As it appears to have happened on linux too, it is not a matter for the BSD devs, but for the ZFS guys - whoever that might be.

For sure there is defective data on disk. Working around and properly handling that defective data is usually not feasible, and is actually an infinite task when trying to cater for all kinds of imaginable corruption; it is preferable to figure out how that corruption happened to be written to disk in the first place. But apparently in this case, nobody knows that. As already mentioned above, I once had a similar case of data written that should be impossible to be written. And while I have ECC and mirroring, I would neither exclude the possibility of hardware malfunction nor of bugs in ZFS.

Second, if the actual cause remains obscure, what precisely should be done about it? I think an endless loop in kernel code is a VERY evil thing, and one should code in a way to avoid that. But what to do instead? Checking for theoretically impossible errors costs time, and that hurts everybody always.

So I fear, as long as nobody can showcase how the error gets initially created, not much will happen. No life-support depends on this not happening, people will delete the defective filespace and restore from backup.

Andriy · Oct 17, 2024

To the excellent words of PMc I can only add that it is better to contribute to the bug report and get involved with developers than hope that forum posts would magically resolve the bug.

Cath O'Deray · Oct 17, 2024

PMc said:
… this is a known bug closed for convenience. :/ …

I hope that you don't mean openzfs/zfs issue 5346.

PMc · Oct 17, 2024

Cath O'Deray said:
I hope that you don't mean openzfs/zfs issue 5346.

Yes I do. What's the problem? That report reads "Closed", and it describes quite precisely what we have seen here too.
It was closed after the creator had figured he had probably defective memory, so there was a possible (but unprovable) explanation (if they run the machine four years before detecting the bad memory, it may well have gone defect at a later time). Then, there is another person in that report who observed the same issue happen. Adding the case of here, the likelyhood of an actual bug in ZFS does increase.

As I have mentioned occasionally, I recognize two levels of software quality (interplanetary and interstellar), and in this scenario such reports would not be closed. There would be a third path besides "actively working on a fix" and "no feasible way of action, so just close it for convenience", and that would be "under surveillance".

What I observe is, such critical surveillance does happen only for security-related issues. There is not much focus and drive behind identifying and weeding out all the "ordinary" races and corner-cases. And I don't like that.

So, that's my viewpoint. You may have a different one, then please explain.

mer · Oct 17, 2024

PMc said:
There is not much focus and drive behind identifying and weeding out all the "ordinary" races and corner-cases. And I don't like that.

You're not the only one. In my experience the probability of getting a problem fixed is often related to:
It's obvious
It's reproducible
I have a coredump
I have detailed, relevant logs

And in my experience races and corner-cases are the hardest to fix because it may be reproducible, but difficult to do and often no coredump or detailed logs. "I reboot and it went away".

moobsd · Oct 17, 2024

PMc said:
Yes I do. What's the problem? That report reads "Closed", and it describes quite precisely what we have seen here too.
It was closed after the creator had figured he had probably defective memory, so there was a possible (but unprovable) explanation (if they run the machine four years before detecting the bad memory, it may well have gone defect at a later time). Then, there is another person in that report who observed the same issue happen. Adding the case of here, the likelyhood of an actual bug in ZFS does increase.

As I have mentioned occasionally, I recognize two levels of software quality (interplanetary and interstellar), and in this scenario such reports would not be closed. There would be a third path besides "actively working on a fix" and "no feasible way of action, so just close it for convenience", and that would be "under surveillance".

What I observe is, such critical surveillance does happen only for security-related issues. There is not much focus and drive behind identifying and weeding out all the "ordinary" races and corner-cases. And I don't like that.

So, that's my viewpoint. You may have a different one, then please explain.

I understand your point of view, the only issue I have with that methodology is that someone 5 years from now is going to run into the same issue, post the issue here, everyone is going to look over the issue, get confused, run through the same steps to diagnose, find out it's the same issue, and then ignore it. Secondary to that, while my issue (and weirdly, the original reporter in the github report) had to do with very unimportant cache directories, nothing really says it couldn't be something more important. But again, agree that it seems to be a very fringe circumstance.

VladiBG · Oct 17, 2024

issue 5346 was an example of what happen when there's directory corruption. It's not necessary this corruption to be caused by bad Memory.

ralphbsz · Oct 17, 2024

PMc said:
There is not much focus and drive behind identifying and weeding out all the "ordinary" races and corner-cases. And I don't like that.

If they are not reproducible, and there is no debugging information (such as core dumps when the problem first occurred, and traces of how it happened), then it can not be diagnosed nor fixed. Keeping a bug open that will not be addressed is pointless.

Should it be reproduced in the future, and then better debugging information is available, it is easy to reopen the bug.

Andriy · Oct 17, 2024

mer said:
You're not the only one. In my experience the probability of getting a problem fixed is often related to:
It's obvious
It's reproducible
I have a coredump
I have detailed, relevant logs

And in my experience races and corner-cases are the hardest to fix because it may be reproducible, but difficult to do and often no coredump or detailed logs. "I reboot and it went away".

This is where determined / stubborn / obsessed people are sometimes of the greatest help:

when the problem is not obvious at all
when it's not reproducible at all but happens from time to time on one particular system
when there's hardly any debugging or diagnostic data (with the vanilla code)

Those people would keep banging on it.
Building customer kernels with diagnostic printf-s.
Making observations and systematizing them.
Trying to reach out to subject matter expert and raise their interest / curiosity.

PMc · Oct 18, 2024

ralphbsz said:
If they are not reproducible, and there is no debugging information (such as core dumps when the problem first occurred, and traces of how it happened), then it can not be diagnosed nor fixed.

Sure it can. If there might be bug, then it is confined into the code, it can't escape. You just need to read the code and understand what it does (in any possible case).
That is called logical verification - and the main problem with that is, it is not compatible with make-money-fast.

Usually, when accidents happen, when somebody gets killed or an airplane falls from the sky, we do not try to reproduce it. Instead we anlyze each and every detail we can obtain. You see, that is a very different approach: the police officer does not just wait until the killer sends in a written guilty plea to the right department.

And now imagine, if on this planet we had leaders in charge instead of criminals, and we had done the right thing, and instead of staging petty wars for profit, we would have stayed in space and by now would be mining the asteroid belt, and if you would know that your children are out there and depend on the software doing the right thing - then you would easily develop a different viewpoint.

moobsd · Oct 18, 2024

At any rate, seems to me that it has to do with how chrome is writing out it's root cache directory. Not betting my life on that though, just the similarity in the two issues began with a directory created by chrome. Think there was something about a archlinux pacman cache directory involved in that as well, but I'm not getting into that.

Appreciate all of the contributions to this thread thus far. I'm sort of going to block this out of my mind now, at least for the time being. I have noticed other applications I run from time to time triggers a discovery to portions of my home directory that causes the loop to start again (wine comes to mind, think it has to do with the winecfg drives, but technically anything that sorts through my homedir will trigger it) but seems to be pretty isolated scenario. Anyway, thanks!

Cath O'Deray · Oct 19, 2024

moobsd said:
chrome

Do you use (Chromium with) Widevine?

htop?

Cath O'Deray · Oct 20, 2024

moobsd said:
…

Upon restarting from single user mode at this point, it seemed that there was a kernel panic (image attached)

…
init died (signal 0, exit 0)
…

Can you share an edition of the photograph that's less cropped? I'd like to see more of what preceded the init lines.