ZFS ZFS: Corrupt data? Having an issue deleting a directory, causes processes to run away

ralphbsz · Sep 26, 2024

You use the words "folder" (I think you mean directory), and "seek into" (I think you mean read the directory, or cd into it). Can you explain exactly what you do when you try to access the directory by hand? Then you say "unable to seek...", and I'd love to know what "unable to" means. What exactly happens when you try "cd /home/user..." or "ls -lF /home/user..."? I want to reduce the problem to really simple operations, and see which ones succeed and which ones fail (and how they fail). I know this might be tedious, since any of these operations might cause an OS hang/crash and reboot.

So let's start with:

ls /home/user/tmp, which proves that the parent directory is OK, and that it can find an entry called Cache_Data
ls -lF /home/user/tmp, which shows us that Cache_Data is a directory, what its permissions are, and how big it is. If there are ACLs or EAs involved, use the ls options to show those too.
ls /home/user/tmp/Cache_Data, which shows that the directory is readable, and what entries it contains. The number of entries should be reasonable, whatever that might mean.
ls -lF /home/user/tmp/Cache_Data, which means that the directory is traversable (meaning the dirent structures returned by reader have valid content and point to real things)

If all of this succeeds with no problems, we've learned a lot of things - in particular that the reason the find keeps failing must be buried further down in the directory hierarchy. If any of these fail, we can start debugging that one isolated failure.

And just to be obnoxious and repeat myself: Concurrently with those checks above, learning a little about zdb and wandering around in there wouldn't hurt either.

moobsd · Sep 26, 2024

ralphbsz said:
You use the words "folder" (I think you mean directory), and "seek into" (I think you mean read the directory, or cd into it). Can you explain exactly what you do when you try to access the directory by hand? Then you say "unable to seek...", and I'd love to know what "unable to" means. What exactly happens when you try "cd /home/user..." or "ls -lF /home/user..."? I want to reduce the problem to really simple operations, and see which ones succeed and which ones fail (and how they fail). I know this might be tedious, since any of these operations might cause an OS hang/crash and reboot.

So let's start with:

ls /home/user/tmp, which proves that the parent directory is OK, and that it can find an entry called Cache_Data

ls -lF /home/user/tmp, which shows us that Cache_Data is a directory, what its permissions are, and how big it is. If there are ACLs or EAs involved, use the ls options to show those too.

ls /home/user/tmp/Cache_Data, which shows that the directory is readable, and what entries it contains. The number of entries should be reasonable, whatever that might mean.

ls -lF /home/user/tmp/Cache_Data, which means that the directory is traversable (meaning the dirent structures returned by reader have valid content and point to real things)

If all of this succeeds with no problems, we've learned a lot of things - in particular that the reason the find keeps failing must be buried further down in the directory hierarchy. If any of these fail, we can start debugging that one isolated failure.

And just to be obnoxious and repeat myself: Concurrently with those checks above, learning a little about zdb and wandering around in there wouldn't hurt either.

Yeah I mean, are you trying to just be obnoxious and repeat yourself? I actually used folder and directory interchangeably because in laymen terms they pretty much mean the same thing, is laymen not simple enough for you? Are you reading this forum from a terminal TTY or something? When I mentioned seek, I prefaced it with the idea of any form if seek, whether it be cd'ing, ls'ing, or running something like find. "Unable to" really means exactly what it might seem, "to not be able to"..

What exactly are you wanting by all of your starters... to get the debug information from truss? If so, why not just say that? I have no issues getting that if that's the case, but really what exactly is it that you would do with it? Wouldn't they be all using a similar system call than find did?

Here's a real question though, what is zdb and "wandering around in there" mean?

richardtoohey2 · Sep 26, 2024

moobsd said:
what is zdb and "wandering around in there" mean?

man zdb zdb – display ZFS storage pool debugging and consistency information

So the suggestion is that you have a play with the ZFS debugger tool - that might help spot what is happening in your situation (or might help in the future).

I've just started using ZFS myself so I'll add looking at zdb to my TODO list.

PMc · Sep 26, 2024

moobsd said:
Yeah I mean, are you trying to just be obnoxious and repeat yourself?

It is just that the phenomenon does not really make sense yet.
As it seems, processes get stuck when entering that directory. But then also, they do not really get stuck, because then they would be in "D" state (and have something they wait for), not in "R". The latter means endless-loop, and probably nobody here has seen a process starting an endless loop by simply entering into a directory.
So this is quite creepy. And probably many here would like to be hands-on the system and then use their individually favoured debug-tools to look closer into this. I for my part would kill that find in a way to obtain a core-dump and then look into the sourcecode where that loop is actually walking along - but then You say, it cannot be killed...
The other thing one would want to know is, what is actually in this directory? Normally there should be just web content files collected by chromium. (curious thing: what would happen if one tried to run chromium?) A way to find out what is in there would be zdb. zbd is the debugger for ZFS, and it can tell all the low-level bits and bytes in the pool - and it is quite a pain to read and understand that...

Once I had something vaguely similar: somehow a file in the zfs had aquired a wrong flag (of those flags which can be set with chflags). Nobody could explain how that flag might have appeared, because it is not used in FreeBSD, and nothing in the code or in ZFS is able to handle it - so anything accessing that file would just fail, and it was impossible to delete the file without deleting the pool.
These strange effects usually fall into two categories: they are either spurious effects resulting from a cosmic ray flipping some bit, or they are (rarely encountered) bugs.

PMc · Sep 27, 2024

moobsd said:

Here is truss for the manual run of /etc/periodic/security/110.neggrpperm

Code:

49289: getdirentries(5,"\M-fA\n\0\0\0\0\0\^A\0\0\0\0\0\0"...,4096,{ 0x0 }) = 200 (0xc8)
49289: fstatat(AT_FDCWD,"Cache_Data",{ mode=drwx------ ,inode=101914,size=110419,blksize=16384 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
49289: fstatat(AT_FDCWD,"tmp2",{ mode=-rw-r--r-- ,inode=768596,size=2207,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
49289: fstatat(AT_FDCWD,"old",{ mode=-rw-r--r-- ,inode=672354,size=9411,blksize=9728 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
49289: fstatat(AT_FDCWD,"new",{ mode=-rw-r--r-- ,inode=699468,size=9376,blksize=9728 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
49289: getdirentries(5,0x2ecb260f1000,4096,{ 0x1f284822 }) = 0 (0x0)
49289: close(5)                     = 0 (0x0)
49289: open("Cache_Data",O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC,017354302024) = 5 (0x5)
49289: fcntl(5,F_ISUNIONSTACK,0x0)         = 0 (0x0)
49289: fstat(5,{ mode=drwx------ ,inode=101914,size=110419,blksize=16384 }) = 0 (0x0)
49289: fchdir(0x5)

... and there it stays. Doesn't explain much of anything outside the fact that Cache_Data is what's causing this problem, and I cannot delete it / seek into it.

Well, if you look closely, this shows a few things.

There is no result code for the final fchdir(). Lets assume truss works nicely and prints the syscall on entering the kernel, and the result after leaving the kernel, this means we are still inside the kernel.
So we would be looping inside the kernel - which is an ugly thing and should not happen. That might explain why we cannot kill the thing. A normal signal can very likely not be delivered while in the kernel, and I do not exactly know how kill -9 acts.
The other interesting thing is, apparently 110419 files in that directory. That is not really few (but should still be manageable).

Cath O'Deray · Sep 27, 2024

moobsd said:
… into the /etc/periodic/security/110.neggrpperm … After this change, I was able to run the job successfully, …

Without the change, did the cron job never complete on 14.1?

Cath O'Deray · Sep 27, 2024

moobsd said:
… this find process … can't be killed …

With the original, nonmodified 110.neggrpperm:

service cron stop

– and then (with what's below as an example):

/bin/kill -- -32935

moobsd said:

Here is the htop tree view

Code:

10446 root        20   0 12916  2552 S   0.0  0.0  0:00.19 ├─ /usr/sbin/cron -s
32935 root        21   0 12916  2556 S   0.0  0.0  0:00.00 │  └─ cron: running job
33469 root        40   0 13376  2924 S   0.0  0.0  0:00.00 │     └─ /bin/sh - /usr/sbin/periodic daily
34220 root        40   0 12712  2136 S   0.0  0.0  0:00.00 │        └─ lockf -s -t 0 /var/run/periodic.daily.lock /bin/sh /usr/sbin/periodic LOCKED daily
34580 root        68   0 13376  2908 S   0.0  0.0  0:00.00 │           └─ /bin/sh /usr/sbin/periodic LOCKED daily
36476 root        68   0 13376  2920 S   0.0  0.0  0:00.00 │              ├─ /bin/sh /usr/sbin/periodic LOCKED daily
65291 root        68   0 13376  2916 S   0.0  0.0  0:00.00 │              │  └─ /bin/sh /etc/periodic/daily/450.status-security
65938 root        68   0 13376  2912 S   0.0  0.0  0:00.00 │              │     └─ /bin/sh - /usr/sbin/periodic security
66524 root        68   0 12712  2128 S   0.0  0.0  0:00.00 │              │        └─ lockf -s -t 0 /var/run/periodic.security.lock /bin/sh /usr/sbin/periodic LOCKED security
66649 root        68   0 13376  2916 S   0.0  0.0  0:00.00 │              │           └─ /bin/sh /usr/sbin/periodic LOCKED security
68743 root        20   0 13376  2924 S   0.0  0.0  0:00.00 │              │              ├─ /bin/sh /usr/sbin/periodic LOCKED security
73105 root        36   0 13376  2936 S   0.0  0.0  0:00.00 │              │              │  └─ /bin/sh - /etc/periodic/security/110.neggrpperm
74439 root        37   0 13376  2928 S   0.0  0.0  0:00.00 │              │              │     └─ /bin/sh - /etc/periodic/security/110.neggrpperm
74779 root        20   0 31200 17812 R 100.0  0.1  8h13:51 │              │              │        ├─ / /usr/src /zroot /home /usr/local/bastille /usr/local/poudriere /var/mail /var/log/bastille /usr/local/poudri

Does there remain a non-killable find?

Side note

Below the / /usr/src /zroot /home /usr/local/bastille /usr/local/poudriere /var/mail /var/log/bastille /usr/local/poudri… line, above, I half-expected to see lines for these two processes:

tee /dev/stderr

wc -l

PMc · Sep 27, 2024

Code:

# dtrace -n 'profile-1  { stack(); ustack(); }'

This should list each CPU and what it is currently doing, once a second.

A CPU does either compute inside the kernel or some user process. If it is idle, it computes the idle task inside the kernel.
It will be shown whether it computes the kernel or which user process (or library).

We would expect either some stack starting with "find", or starting with "kernel" and having "sys_fchdir" in the third-from-bottom line (give or take a bit).

ralphbsz · Sep 27, 2024

moobsd said:
Yeah I mean, are you trying to just be obnoxious and repeat yourself?

Yes, I'm being obnoxious and repeating myself. Because in this thread, I haven't yet seen information that would be required to diagnose the problem surgically and cleanly. There are vague descriptions of something going wrong (what exactly goes wrong?) when doing a complex series of operations (namely a find). I would like to see a single operation that goes wrong, and then see what exactly that "wrongness" is (user space hang, looping, kernel hang, return code, ...). And when I say "operation" in this context, I mean syscall. That's why I asked about relatively low-level programs (such as ls with various options), because they run a small and understandable set of sys calls.

I actually used folder and directory interchangeably because in laymen terms they pretty much mean the same thing, is laymen not simple enough for you?

Actually, the thing called "folder" (typically a folder on a GUI) can also be a softlink, while the term "directory" is unambiguous.

When I mentioned seek, I prefaced it with the idea of any form if seek, whether it be cd'ing, ls'ing, or running something like find.

There is a world of difference between ls, ls, ls, cd, and find. And I mentioned "ls" several times because depending on options, ls runs different operations. It nearly always does opendir followed by readdir, but whether it runs the opendir on "." or on a named entity depends on the arguments. And whether it then runs stat depends on the options. I would like to see exactly which syscall fails, and in which fashion.

The term "seek" is used heavily in file system interfaces and implementation. It does not mean at all what you are using it for.

"Unable to" really means exactly what it might seem, "to not be able to"..

What exactly happens when you try (other than: the find hangs, and it isn't even clear yet where it hangs)? Can you reduce the complex find to a simpler operation? Can you give us more details about exactly in what fashion it fails?

Look, in the ideal world, if I were being paid to debug this, I would ask you to execute exactly the following series of syscalls or C library calls, with exactly the following arguments, and report exactly what happens on each step. And I'd e-mail you a small program (in a language du jour) that does exactly this. I don't get paid to help debug your problems, so I'm trying to get some clear and crisp information, using language that makes the information actionable, with the minimum hassle for everyone.

What exactly are you wanting by all of your starters... to get the debug information from truss? If so, why not just say that?

You can use truss to run any of the small examples, that wouldn't hurt, and it would probably even help. But it isn't even necessary. It would already be great if you could report something like "Step A worked with no problem, step B caused the following error message to be printed, and step C hung, didn't react to Control-C, and ps showed the hung process to be in D state".

Wouldn't they be all using a similar system call than find did?

No, find uses a lot of different system calls, and then a long sequence of them.

Here's a real question though, what is zdb and "wandering around in there" mean?

Sorry about not explaining that. Every file system has metadata, which is everything that is not "data", which is defined as the content of the files. So metadata includes things like

directories (which are lists of names, and then pointers to what these names are),
whether the object pointed to by a name is a file, directory, link, or something else,
attributes of the object, such as mtime and atime, permissions, size (important for directories in this problem I suspect), and link counts,
a few more uncommon things including ACLs (a more complex way to express permissions), EAs (extended attributes), and flags (is this object changeable or has it been archived),
and file system internal things that make everything work, like inode numbers and allocation bitmaps.

Zdb is a program that allows a user to read that metadata in quite a raw format, and then use it to follow links, most important directory entries. That following structures is what I meant by "wandering around". What I didn't mention is "take a look while you wander". For example, if this were a file system I was familiar with, I would start by looking at the /home/user/tmp directory: Does it have a sensible number of entries? How many of the entries are directories? Is the link count of the directory 2 + number of subdirectories? Is one of the entries something called "Cache_Data", and is that entry an object of type file? Does the stat of that entry look like it would be readable, and does it have any suspicious looking flags, EAs, or ACLs? Is its size somewhat reasonable (0 or a huge number are implausible)? Where on disk is Cache_Data stored? Is that place on disk plausible, and is not shared with any other object? If I look at these blocks on disk, does their content look like directory entries should look? Is the number of directory entries found on disk for Cache_Data match its size reasonable or perfectly? If I try to read Cache_Data as a directory, do I get names and objects, and the correct number? Does it have . and .. entries? Is the link count on those entries good? How many subdirectories, and does that match the link count itself? And so on and so on. With just a few minutes, a ZFS internals expert (which I am not!) would be able to validate that the directory itself is in great health, or find a problem in the metadata structures. If they find a problem, how did that happen, does the syndrome match a known cause? If there is no problem with the metadata, then why reading it "not work" (whatever that might mean?

Mirror176 · Sep 29, 2024

If you didn't have one already, you would want to make sure the data is backed up and for troubleshooting it would be wise to have an offline dd image of this drive too. Several issues come to mind that could be relevant to such a problem besides just FS code:

drive overheating
drive is failing
communication path to drive is faulty
drive has a firmware bug
BIOS/UEFI issue
FreeBSD has no swap memory; normally requires memory gets used up or overallocated to exhibit processes that become stuck/nonresponsive in my experience.

Have you confirmed firmware and BIOS are up to date? What drive and motherboard model? Does the drive have good SMART status and pass its tests? Anything special/different/customized to the ZFS pool beyond it being created with 14.1 on a single disk without any extra geom layers like geli added?

A successful clone of the drive that then repeats a hardware problem sounds like either a hardware failure outside the drive, the new drive has the same type of problem, or the data is already in a bad state. Are there issues reading from the pool if placed in read only mode? Does dd fail to read the device fully? can the pool be scrubbed successfully?

Have you tried to delete offending content from single user mode or when booted from separate media?

If normal tracing tools are failing to be useful, debugger tools may still get farther. DTrace may be beneficial too with its many hooks both in and out of the kernel.

If you can find a particular directory/file that causes the glitch, then the calls to it that are troublesome can be more quickly narrowed down. I'm not a coder but thought that stat can make more selective calls to reading properties of the object.

Some examples of debugging ZFS issues including userspace and kernelspace issues are found at

View: https://www.youtube.com/watch?v=JoD_Kmqnkgg
.

ZFS under heavy disk I/O usually makes noticeable impacts on my machine 's responsiveness but that has been mostly magnetic drive testing/response and those have horrible multitasking and random-read abilities.

Andriy · Sep 29, 2024

If a process is stuck and unkillable, then procstat -kk is the best tool to see where it is stuck.

PMc · Sep 29, 2024

Andriy said:
If a process is stuck and unkillable, then procstat -kk is the best tool to see where it is stuck.

That is exactly the point why I wrote the above message, procstat -kk does NOT show processes stuck in a loop:

kstack | -k
Display the stacks of kernel threads in the process, excluding
stacks of threads currently running on a CPU

That's why I was happy finding this one:

PMc said:
Code:

# dtrace -n 'profile-1 { stack(); ustack(); }'

This actually shows what is going on.

Anyway, the OP seems to have disappeared, so probably they found what was going wrong...

Andriy · Sep 30, 2024

PMc said:
procstat -kk does NOT show processes stuck in a loop

I am not sure if that's entirely true. I think that only if the running / looping thread also has interrupts disabled then procstat won't be able to "sample" it.
But I am also not sure if DTrace profile probe would be able to sample such a thread either.
FWIW, procstat used to be able to collect stack traces even from such threads.
But then the mechanism for stack collection was changed to something more light weight and that ability was lost.
See commit 1c29da02798d9 aka r357334 in subversion.

PMc · Sep 30, 2024

Andriy said:
I am not sure if that's entirely true. I think that only if the running / looping thread also has interrupts disabled then procstat won't be able to "sample" it.
But I am also not sure if DTrace profile probe would be able to sample such a thread either.

Me neither. Just add it to the toolbox and time will tell.

moobsd · Sep 30, 2024

Haven't forgot about you all. Having a busy Monday. Will check back with you all soon throughout the week. Thanks for all your support thus far.

Andriy · Sep 30, 2024

Another useful tool is pmcstat(8), on hardware where it works, of course.
Profiling interrupts are unmaskable, so the tool should see all kinds of things more reliably.
But the interface is not very intuitive and there used to be some bugs.

Cath O'Deray · Oct 4, 2024

Cath O'Deray said:
… I don't know whether it's a bug, I'm fairly certain that there's no problem if the cron job is given time to complete. …

Cath O'Deray said:
…

– and then (with what's below as an example):

/bin/kill -- -32935

Not allowing time for normal completion of a cron job:

Code:

root@mowa219-gjp4-zbook-freebsd:~ # pkg upgrade -f -r local-poudriere emulators/virtualbox-ose-kmod sysutils/sysctlbyname-improved-kmod x11/nvidia-driver-470
Updating local-poudriere repository catalogue...
Fetching meta.conf: 100%    178 B   0.2kB/s    00:01   
Fetching data.pkg: 100%  145 KiB 148.2kB/s    00:01   
Processing entries: 100%
The provides database is up-to-date.
local-poudriere repository update completed. 528 packages processed.
All repositories are up to date.
pkg: Cannot get an advisory lock on a database, it is locked by another process
root@mowa219-gjp4-zbook-freebsd:~ # service cron stop
Stopping cron.
Waiting for PIDS: 2968.
root@mowa219-gjp4-zbook-freebsd:~ # ps aux | grep cron
root         29049   0.0  0.0      14260    2028  -  I    03:01      0:00.00 cron: running job (cron)
root         48954   0.0  0.0        508     316  5  D+   04:16      0:00.00 grep cron
root@mowa219-gjp4-zbook-freebsd:~ # /bin/kill -- -29049
kill: -29049: No such process
root@mowa219-gjp4-zbook-freebsd:~ #

Why no such process?

Cath O'Deray · Oct 4, 2024

Ah,

Code:

root@mowa219-gjp4-zbook-freebsd:~ # service cron stop
Stopping cron.
Waiting for PIDS: 2968.
root@mowa219-gjp4-zbook-freebsd:~ # ps aux | grep cron
root         29049   0.0  0.0      14260    2028  -  I    03:01      0:00.00 cron: running job (cron)
root         48954   0.0  0.0        508     316  5  D+   04:16      0:00.00 grep cron
root@mowa219-gjp4-zbook-freebsd:~ # /bin/kill -- -29049
kill: -29049: No such process
root@mowa219-gjp4-zbook-freebsd:~ #

…

Code:

root@mowa219-gjp4-zbook-freebsd:~ # date
Fri Oct  4 04:35:07 BST 2024
root@mowa219-gjp4-zbook-freebsd:~ # /bin/kill -9 29049
root@mowa219-gjp4-zbook-freebsd:~ # /bin/kill -- -29051
root@mowa219-gjp4-zbook-freebsd:~ #

So:

after SIGKILL killing the one process for cron: running job (cron)
it became possible to /bin/kill -- the group of processes that remained of the tree that was previously rooted in the one above – with the negative number for the root of the current group.

moobsd · Oct 4, 2024

Hey all

It's sort of been one of those weeks for me but did want to log on and provide a brief update. As mentioned previously I have the job behind cron with the adjustment to ignore the directory in question which has put a band-aid on the problem. This *potentially* could be caused by a corrupt BIOS but haven't had a lot of time to look more closely. By potentially I mean that the systems BIOS hasn't been updated in a few years, when I checked for available updates my system is behind several revisions (version 1.14.1 if anyone cares to poke through the release notes).

The reason I'm suspecting possible BIOS corruption is because I've attempted to install the latest version but for whatever reason it doesn't appear to want to install, so at this time I'm sort of stuck on that version. My plan when I get some time to do so is reset the CMOS and attempt another update. After that, hopefully if the update works I'll come back and see about troubleshooting this further.

Since others have asked, the systems specs are below. Aside from the new HDD the rest of the system is stock.

Dell Latitude 5420
16GB RAM

New HDD: Silicon Power 512GB NVMe M.2 PCIe Gen3x4 2280 SSD (SP512GBP34A60M28)

Original HDD: Think it was a SanDisk, sort of a piece of crap, smaller form factor. Model I suspect is in the spec sheet for the laptop.

BIOS versions: https://www.dell.com/support/home/e...oscode=biosa&productcode=latitude-5420-laptop ... Just scroll down a bit and click other available versions

Thanks again, will come back soon.

moobsd · Oct 4, 2024

Another small update:

Went through the fixes for the BIOS versions, didn't see much aside from mostly vulnerability fixes, but at any rate there are a crap load of fixes and who knows if they would even mention everything they fixed in the notes.

I will also mention that the system is primarily plugged into a docking station, model WD19TBS.

Thanks

Mirror176 · Oct 5, 2024

moobsd said:
Another small update:

Went through the fixes for the BIOS versions, didn't see much aside from mostly vulnerability fixes, but at any rate there are a crap load of fixes and who knows if they would even mention everything they fixed in the notes.

I will also mention that the system is primarily plugged into a docking station, model WD19TBS.

Thanks

As a general concept, BIOS updates do not mention all changes/fixes. As risky/scary as doing a BIOS update can be, I've fixed too many strange bugs and seen too many performance improvements to leave out updates.

There certainly are risks that shouldn't just be blindly be overlooked. It is best to confirm if there is a way to update the BIOS by inserting a USB stick (not all will be compatible) with a file and pressing a button or maybe a key to know you can likely recover from a bad/incorrect update. Given the choice, I always prefer to default to an update done this way or through the BIOS menus reading from USB or internal drive next. The less dependencies in the way mean the less likely things may be trouble due to things like OS updates/availability later.

I have also dealt with a bricked MSI motherboard years ago (amd phenom ii 955 cpu or something) that failed an update despite the 'update successful' message (I suspect the BIOS file may have been fragmented on the USB stick and blindly+improperly read as if it was a sequential file from its start without any sanity checks but never properly diagnosed). MSI's fix would be sending the board overseas to have the BIOS reprogrammed. Chip was surface mount soldered to the motherboard but I found a JTAG(?) header on the board that used a smaller-than-normal set of pins so I couldn't just splice something up off of USB headers or anything I had access to. I bought a set of pins for making a VGA/serial/etc. female connector and custom made a cable. Those pins do not hold on properly but I was able to very carefully place them over the pins and isolate them from each other with strips of wax paper and while carefully holding the cable by hand I ran a BIOS reflashing tool (probably on FreeBSD but don't remember) from another machine connected at the other end with a serial port. That was cheaper+slower than buying a new board and cheaper+faster than sending it off for repair but was certainly an task to work through.

Mirror176 · Oct 6, 2024

moobsd said:
It's sort of been one of those weeks for me but did want to log on and provide a brief update. As mentioned previously I have the job behind cron with the adjustment to ignore the directory in question which has put a band-aid on the problem. This *potentially* could be caused by a corrupt BIOS but haven't had a lot of time to look more closely. By potentially I mean that the systems BIOS hasn't been updated in a few years, when I checked for available updates my system is behind several revisions (version 1.14.1 if anyone cares to poke through the release notes).

The reason I'm suspecting possible BIOS corruption is because I've attempted to install the latest version but for whatever reason it doesn't appear to want to install, so at this time I'm sort of stuck on that version. My plan when I get some time to do so is reset the CMOS and attempt another update. After that, hopefully if the update works I'll come back and see about troubleshooting this further.

Assuming I found the correct machine's notes..
1.18.2 - - Fixed the issue where the BIOS update fails when you try to update it through BIOS flash update - Remote option in the BIOS setup.

1.27.0 -- Dell update tools will not allow reverting to a version before this once this is installed; may be doable with 3rd party tools but Dell restricts it due to security fixes and other bugfixes being considered important by Dell.

Their advice of "You can install the updates in the background while using the system" is one I recommend against. Maybe their system can avoid applying it unless a write is verified successfully but I'd rather have a way to reflash from a bad BIOS state than have such a verification during flash be given to me in case things still go bad. I'd start with a fresh cold boot and just do that task when doing it. If using Windows >7, either do a reboot and power off+on system at the BIOS step or do a reboot + power off+on through Windows after to avoid any chance of fast boot issues. Alternatively disable fast boot and cold boot it or I haven't tested but I think there was amother fast boot bypass like holding shift during poweroff or something.

Saw some notes about fixes for system that stopped responding (after resuming from shut down? probably typo for after resuming from sleep but maybe refers to Windows fast boot bugfixes) with potentially lesser matches of POST stops responding issues being fixed.

Otherwise skimming notes I'm just seeing things related to caps lock LED, power on issues (including and excluding issues related to BIOS updates being applied), charging issues, BIOS boot issues, BIOS security issues, and issues involving certain devices being connected and/or devices disabled through BIOS. I didn't try to follow the CVEs to see if there is anything more than security to expect from such fixes/changes.

Skimming that list, I seen enough things I wouldn't want failing me even in a basic use scenario that I'd be focused on getting that BIOS updated. If you cannot succeed, I'd reach out to Dell for assistance.

moobsd · Oct 14, 2024

PMc said:
That is exactly the point why I wrote the above message, procstat -kk does NOT show processes stuck in a loop:

That's why I was happy finding this one:

This actually shows what is going on.

Anyway, the OP seems to have disappeared, so probably they found what was going wrong...

Some updates here.. I upgraded my BIOS successfully all the way to the latest release. The issue did not get corrected. I booted into single user mode and ran a few of the commands as suggested here, the dtrace is attached. From what I can see aside from the ACPI messages, there are traces of the same call that never appear to end (this is using rm -rf tmp/Cache_Data):

Code:

0  86466                       :profile-1
              zfs.ko`zap_leaf_lookup_closest+0x100
              kernel`kern_getdirentries+0x221
              kernel`sys_getdirentries+0x29
              kernel`amd64_syscall+0x100
              kernel`0xffffffff80fd765b

              libc.so.7`__sys_getdirentries+0xa
              libc.so.7`readdir+0x2d
              libc.so.7`0x25bd315a964
              libc.so.7`fts_read+0x38c
              rm`0x253b1351f69
              rm`0x253b1351789
              libc.so.7`__libc_start1+0x12a
              rm`0x253b135151d
              `0x34ba8a003008

Here is also the procstat (ran a few times):

Code:

  PID    TID COMM                TDNAME              KSTACK                      
   88 100335 rm                  -                   fzap_cursor_retrieve+0x206 zap_cursor_retrieve+0x1ed zfs_freebsd_readdir+0x383 VOP_READDIR_APV+0x20 kern_getdirentries+0x221 sys_getdirentries+0x29 amd64_syscall+0x100 fast_syscall_common+0xf8
  PID    TID COMM                TDNAME              KSTACK                      
   88 100335 rm                  -                   fzap_cursor_retrieve+0x206 zap_cursor_retrieve+0x1ed zfs_freebsd_readdir+0x383 VOP_READDIR_APV+0x20 kern_getdirentries+0x221 sys_getdirentries+0x29 amd64_syscall+0x100 fast_syscall_common+0xf8
  PID    TID COMM                TDNAME              KSTACK                      
   88 100335 rm                  -                   fzap_cursor_retrieve+0x206 zap_cursor_retrieve+0x1ed zfs_freebsd_readdir+0x383 VOP_READDIR_APV+0x20 kern_getdirentries+0x221 sys_getdirentries+0x29 amd64_syscall+0x100 fast_syscall_common+0xf8

Upon restarting from single user mode at this point, it seemed that there was a kernel panic (image attached)

Mirror176 said:
There certainly are risks that shouldn't just be blindly be overlooked. It is best to confirm if there is a way to update the BIOS by inserting a USB stick (not all will be compatible) with a file and pressing a button or maybe a key to know you can likely recover from a bad/incorrect update. Given the choice, I always prefer to default to an update done this way or through the BIOS menus reading from USB or internal drive next. The less dependencies in the way mean the less likely things may be trouble due to things like OS updates/availability later.

Typically I do the same, however I was only able to flash them via Ubuntu's FW upgrade tool. No idea why.

Edit: now with actual dtrace txt

VladiBG · Oct 14, 2024

zap_leaf_lookup_closest

Directories corrupted, commands hanging indefineltly · Issue #5346 · openzfs/zfs

I'm having an issue with that from what I can tell is related to zfs, I could be wrong though. I have the directory that I'm unable to run ls on, or rather ls never returns. When I run ls on the di...

github.com

moobsd · Oct 14, 2024

VladiBG said:
zap_leaf_lookup_closest

Directories corrupted, commands hanging indefineltly · Issue #5346 · openzfs/zfs

I'm having an issue with that from what I can tell is related to zfs, I could be wrong though. I have the directory that I'm unable to run ls on, or rather ls never returns. When I run ls on the di...

github.com

Yeah, that indeed looks to be very similar to what is going on for me. I don't have memory issues though (confirmed with memtest and Dell's diagnostic tools).

ZFS ZFS: Corrupt data? Having an issue deleting a directory, causes processes to run away

Attachments