Frequent kernel panic and ZFS mount problems

crinx · Jul 8, 2023

Hello everybody,
I am not an IT professional but as I can understand I do appreciate the FreeBSD os, zfs, bhyve and all stuff and I decided to learn more about these, So I built a customized computer which I consider appropriate for using with FreeBSD and ZFS, with the following configuration:
- processor AMD Ryzen 5 3400G 3.76GHz
- motherboard Asus Tuf Gaming B550-plus AMD B550 socket AM4
- ram Corsair Vengeance LPX Black DIMM DDR4 32 GB (2x16GB) CL15 3200Mbs
- LSI 9300-8i PCI E 3.0 12Gbps HBA
- two DELL 600 GB 10k SAS ISE 12GBps
- two 1TB HDD Seagate SAS 10k 2.5''
- cords SAS Controller SAS HD SFF 8643
- an extra Gigabit Ethernet card 1000/100/10 Mbps

Here is my ZFS configuration
zpool0
mirror-0
da2p4
da3p4
zpool1
mirror-0
da0
da1
(da0, da1 correspond to the 1 TB disks; da2, da3 correspond to the 600GB disks)
At the installation time I created zpool0. the second zpool was created thereafter with zpool create ...

I installed FreeBSD 13.2 stable version using ZFS, the desired packages, I installed Windows 10 vm with bhyve, but I experienced frequent kernel panic crashes and after a few such events I run into "Mounting from ZFS: ... failed with error 5". I thought that I made something wrong with software installation so I reinstalled everything. After a few such cycles, I considered the possibility of a hardware problem and after the last mounting zfs issue I reboot the computer with an usb install media and in the shell, after importing the zpool i run

Code:

zpool scrub zpool0

then in the output of

Code:

zpool status

the recommended action was to perform zpool0 clear... or to replace two disks (da2, da3).
After

Code:

zpool0 clear ...

everything looked normal , but at reboot the mount zfs issue persisted.
So I sent the computer back to the hardware technicians who built it, but the problem is that in my hometown it seems that nobody have even a clue how to build a non-windows computer and about zfs, well, they don't even heard about it.
1. Is it possible that my computer configuration be the cause of the problems?
2. Now I cannot boot and login into my computer. Is it possible to check the /var/log/messages from the shell of the usb installer?
3. Is HD tune pro 5.70 a suitable tool for checking for I/O HDD errors if the disks are working under FreeBSD (this tool is used by the hardware firm to test the computer) ?
4. Can I install the package smartmontools from the shell of the usb install media in order to use smartcli command to check the disks ?
5. Any other guidance would be very much appreciated

ralphbsz · Jul 9, 2023

Error 5 is IO error. Maybe you have a disk problem?

You say you "cannot boot and login". What does that even mean? What happens when you try? What are the error messages?

Yes, you can use a live USB (like the USB installer), and mount the disk and check /var/log/messages.

For checking HDD errors, begin with looking at dmesg (if the system is running) and /var/log/messages. Then use the smartmontools utilities, in particular smartctl, to see what the disk drive health is. But that is only half the battle: The problem may very well be in the disk interface (SATA cables, insufficient power supply, damaged HBA), and not in the disks themselves. I don't think "HD Tune" is worth anything.

I don't know how to install software on the target system while you are running the OS from the USB install media.

richardtoohey2 · Jul 9, 2023

Also make sure motherboard BIOS is up-to-date.

crinx · Jul 9, 2023

Thanks you both for replying.
I can boot the computer but I cannot login because the computer is stuck at "Mount zfs ... error 5".
But I can use the shell of an install media. I was not sure if I can install the the smartmontools utilities from this shell and then use this utilities also from install shell to check the disks.
What would be the proper way or tools to check the disk interface (cables, power supply, HBA)?
I don't know exactly how could I mount the disk to check var/log/messages ?
Something like

Code:

# mount -t zfs /dev/da2 /mnt

would be correct?

ralphbsz · Jul 9, 2023

No, for Zfs one doesn't use a mount command. Usually it auto mounts, if not one uses the "zfs mount" command. In your case, it seems it tried to auto mount. If you look at the console messages when that failed with error 5, what do you see?

From an installer, I would try to just read the disk, with the dd command. You should get the same error message in dmesg.

_martin · Jul 9, 2023

First thing to check is the kernel panic message, backtrace that was shown when that happened. As those panics were frequent there's a good chance ZFS is victim of those crashes; i.e. due to the frequent crashes you did get to the state with ZFS you are in now.
You had somebody locally checked the PC so I'd assume basic checks were done (PSU, cables, memory tests, ..).

If you still can reinstall the FreeBSD do that and record/capture the kernel panic message and its backtrace. That will give us starting point for troubleshooting.

crinx · Jul 10, 2023

Thanks everybody,
Now the computer is checked by some local hardware technicians from which I don't have very much expectations, but we'll see. If they cannot found the cause of crashes, I'll take the computer back and I'll try your suggestions, as ralphbsz suggested I'll begin with a dd check of the disks, probably something like

Code:

# dd if=/dev/da2 of=/dev/null bs=1m

followed by

Code:

# dmesg

Thanks again everybody, it is very much appreciated, I'll come back probably in a couple of days, when I'll take the computer back from the hardware firm.

richardtoohey2 · Jul 10, 2023

I built a computer recently and it was horrendously crashy - but this was with Windows.

BIOS updates made it rock-stable.

The other stability issues were power connectors - I managed to not connect the power cable properly on the motherboard leading to sporadic issues. After that plus BIOS updates it was fine for about a week … then the power connector worked loose from the PSU.

Doesn‘t help with your ZFS issues but worthwhile getting all BIOS and other firmware up-to-date and checking connections.

SirDice · Jul 10, 2023

Also check any overclocking that might happen on memory. Enabling XMP should be fine, provided your DIMMs actually support it. But some mainboards can automagically tune memory timings and such, that doesn't always work correctly.

crinx · Jul 13, 2023

I've got my computer back, it was tested with some windows tools (HD tune pro 5.70) and with no results.
I tried to reinstall FreeBSD 13.2 stable, but it didn't work (installer was stopping suddenly witn - unknown error) until I completely erased tthe disks (da2,da3) with

Code:

# dd if=/dev/zero of=/dev/da2 bs=64k

After I reinstalled FreeBSD with ZFS, I began reinstalling the packages, using ports this time. I succesfully installed: vim, xorg, sudo, bspwm, rxvt-unicode and they work without issue so far.But when I tried to install a browser (iridium), at the end of the installation process the computer exited with signal 10, the login prompt appeared, but the password was not recognized. The problem repeated identically when I tried to install firefox.
I attached the /var/log/messages file and the output of dmesg -a

SirDice · Jul 13, 2023

These don't look good:

Code:

Jul 12 23:35:14 z2x2nx1 kernel: pid 2934 (egrep), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:16 z2x2nx1 kernel: pid 49804 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:52 z2x2nx1 kernel: pid 62076 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:55 z2x2nx1 kernel: pid 69096 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:58 z2x2nx1 kernel: pid 76756 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:00 z2x2nx1 kernel: pid 15400 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:02 z2x2nx1 kernel: pid 3052 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:04 z2x2nx1 kernel: pid 16974 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:06 z2x2nx1 kernel: pid 38976 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:08 z2x2nx1 kernel: pid 61713 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:11 z2x2nx1 kernel: pid 55244 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:13 z2x2nx1 kernel: pid 22287 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:15 z2x2nx1 kernel: pid 49456 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:19 z2x2nx1 kernel: pid 26229 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:21 z2x2nx1 kernel: pid 37501 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:37:00 z2x2nx1 kernel: pid 44572 (cron), jid 0, uid 0: exited on signal 10 (core dumped)
Jul 12 23:38:01 z2x2nx1 kernel: pid 61198 (sh), jid 0, uid 0: exited on signal 10 (core dumped)
Jul 12 23:39:13 z2x2nx1 kernel: pid 48401 (csh), jid 0, uid 0: exited on signal 10 (core dumped)

That all happens just before it reboots. It's primarily sed(1) and egrep(1) that seem to crash and burn. You also got a lot of these during a flurry of package installs. I would run a memory check, just to be sure.

crinx · Jul 13, 2023

Thanks for replying ,SirDice.
I would run a memory check, but I don't know how to do it.
Meanwhile I experienced first kernel panic crash in the current cycle of installing - kernel panics - final zfs mount problem. I attached the resulted file /var/crash/core.txt.0 .
If you are so kind to provide the information on how to perform a memory check, the required tools and the necessary steps, I'll do it as soon as possible.

_martin · Jul 13, 2023

You can setup bootable usb stick, something like this.

Your core.txt doesn't have anything useful as you don't have gdb installed (use pkg install gdb) to have it ready for next time. Still older dmesg and/or messages should give you the info about the crash. Can you share those? It should include message "kernel panic" and messages around that.

I wonder what tests were done when that system was booted Windows. It is suspicious why Windows can boot/use it just fine (maybe just not enough memory stress?).

With gdb one could check those core dumps -- to see what is the common denominator for those. Can you check if you have anything under / or /root ? Default naming convention is {comm}.core, e.g. sed.core.

SirDice · Jul 13, 2023

_martin said:
You can setup bootable usb stick, something like this.

Yep. If you install sysutils/memtest86 it contains an image you can burn to a USB stick. Then boot from that stick and let it run for a while.

_martin said:
It is suspicious why Windows can boot/use it just fine (maybe just not enough memory stress?).

I've had that happen once. Memory errors somewhere in the upper region of the memory. All was fine until it started using that part of the memory, only happened when I had a poudriere bulk running for some time. Then suddenly weird crashes and build failures. After a reboot it worked fine for some time, until that upper memory region was hit again. Memtest found the issue quite quickly.

tingo · Jul 13, 2023

Note carefully: for memtest type operations "a while" is defined as between 4 - 12 hours or more.

crinx · Jul 13, 2023

From the hardware technicians I received four funny images with a lot of green rectangles and a phone reassurance that everything is ok, as I could see from the pictures.
I asked them if they tested the PSU, they said they tested it from bios where they saw that all the volt values are ok.
I insisted to give me a more detailed response and they sent me the following text:
"The system was tested and run under windows
HDDs where tested with hd tunes and hdd sentinel and they work perfectly, without bad-s or other issues.
The RAMs where tested 5 hours with memtest.
The system is working under normal working conditions."

I installed gdb with # pkg install gdb and right now I am trying to squeeze another core.text file by trying to install a browser
I also installed memtest86 with # pkg install memtest86 and I'll run a test immediately, and I intend to run it for 12 hours or more.
Thanks everybody, your support is great.

yuripv79 · Jul 13, 2023

crinx said:
right now I am trying to squeeze another core.text file by trying to install a browser

You don't need to, just run crashinfo on the core file you already have.

crinx · Jul 13, 2023

I found two .core files, /root/csh.core and /root/shutdown.core, the first one is to big to be attached

yuripv79 · Jul 13, 2023

I mean the ones in /var/crash/; now that you have gdb installed, crashinfo should be able to decode them properly.

_martin · Jul 13, 2023

We don't know which exact version of FreeBSD you are running coredump you shared doesn't help us much (we need to match it to exact binary it was dumped from).
As you do have those two core files and gdb installed could you share output of these:

gdb `which csh` /root/csh.core
gdb `which shutdown` /root/shutdown.core

For each execute these gdb commands:

Code:

bt
x/12i $pc
i r

So we could check if there's something fishy going on.

Memtest you're showing is not complete. The overall test % has to be 100%. While it may be version dependent I think it should display big "PASS" after it finishes.

If you experienced kernel panic after you installed gdb core.txt should now show you reason why it crashed (but even without it dmesg and/or messages should have some info).

Erichans · Jul 13, 2023

crinx said:
I am not an IT professional but as I can understand I do appreciate the FreeBSD os, zfs, bhyve and all stuff and I decided to learn more about these, [...]

I installed FreeBSD 13.2 stable version using ZFS, the desired packages, I installed Windows 10 vm with bhyve, but I experienced frequent kernel panic crashes and after a few such events I run into "Mounting from ZFS: ... failed with error 5". [... ]

_martin said:
We don't know which exact version of FreeBSD you are running coredump you shared doesn't help us much (we need to match it to exact binary it was dumped from).

@crinx: as you are delving into FreeBSD from a starting position, is there a particular reason that you have selected a -STABLE version? These are supported just as well as -RELEASE versions, but perhaps, also from a clear version point of view in relation to your error hunt, debugging could just as well be done with a 13.2-RELEASE version (that you'll be able to update more easily too).

crinx · Jul 14, 2023

Initially I thought that the cause of crashes should be the software and when I saw the STABLE version , I changed the installation media accordingly because ... stable sounds great.

Now I am struggling with the memtest, because instead of booting from usb and starting the test, the bios shows up. I tried to boot from keybord, then I changed the order of booting în bios, with the same outcome: bios shows up instead of memtest.

cy@ · Jul 14, 2023

SirDice said:

These don't look good:

Code:

Jul 12 23:35:14 z2x2nx1 kernel: pid 2934 (egrep), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:16 z2x2nx1 kernel: pid 49804 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:52 z2x2nx1 kernel: pid 62076 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:55 z2x2nx1 kernel: pid 69096 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:35:58 z2x2nx1 kernel: pid 76756 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:00 z2x2nx1 kernel: pid 15400 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:02 z2x2nx1 kernel: pid 3052 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:04 z2x2nx1 kernel: pid 16974 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:06 z2x2nx1 kernel: pid 38976 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:08 z2x2nx1 kernel: pid 61713 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:11 z2x2nx1 kernel: pid 55244 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:13 z2x2nx1 kernel: pid 22287 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:15 z2x2nx1 kernel: pid 49456 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:19 z2x2nx1 kernel: pid 26229 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:36:21 z2x2nx1 kernel: pid 37501 (sed), jid 0, uid 0: exited on signal 11 (core dumped)
Jul 12 23:37:00 z2x2nx1 kernel: pid 44572 (cron), jid 0, uid 0: exited on signal 10 (core dumped)
Jul 12 23:38:01 z2x2nx1 kernel: pid 61198 (sh), jid 0, uid 0: exited on signal 10 (core dumped)
Jul 12 23:39:13 z2x2nx1 kernel: pid 48401 (csh), jid 0, uid 0: exited on signal 10 (core dumped)

That all happens just before it reboots. It's primarily sed(1) and egrep(1) that seem to crash and burn. You also got a lot of these during a flurry of package installs. I would run a memory check, just to be sure.

Indeed, bad memory trashed a zpool on me once. If you have bad RAM, all the filesystem checksums and error correction won't do you any good. It's like building a house on sand without a foundation. The house can be built well but without a good foundation (RAM) the house (ZFS) will come crashing down.

crinx · Jul 17, 2023

I performed a memory test using memtest86 during 71 hours, there were 15 passes with no errors. Should I conclude that there are no RAM problems, or should I try other memory tests?

crinx · Jul 18, 2023

I've just got a new core.text file, with gdb installed this time.

Frequent kernel panic and ZFS mount problems

crinx

ralphbsz

richardtoohey2

crinx

ralphbsz

_martin

crinx

richardtoohey2

SirDice

Administrator

crinx

Attachments

SirDice

Administrator

crinx

Attachments

_martin

SirDice

Administrator

tingo

crinx

Attachments

yuripv79

crinx

Attachments

yuripv79

_martin

Erichans

crinx

cy@

crinx

Attachments

crinx

Attachments