Frequent kernel panic and ZFS mount problems

Just for clarification: Those crashes only happen with an OS *installed to disk*?

Any chance that HBA doesn't run IT mode/firmware and there is some form of RAID configured with those disks? I've seen ZFS corruption with some buggy RAID-firmwares on more than one occasion.

Additionally, check the error log of those disks ( sg_logs --all), especially log pages 0x02-0x06, i.e. the error counter pages and non-medium error pages (=>cabling/backplane problems!!)
Since this seems to be some kind of repurposed gaming system: how are those SAS disks connected to the HBA?
 
I noticed the crashes only with FreeBSD OS installed to disk. I didn't notice crashes when the computer was running from usb install, and in this weekend the computer run three days with the memtest86 usb stick. (I am sorry if I don't understand correctly the first question)
The computer was built by a local hardware company. I specifically asked tthem o avoid RAID configurations.
I checked in bios, at sata mode is specified AHCI. I don't know how to check elsewhere.
They connected two SAS disks at one HBA port with cables SAS Controller SAS HD SFF 8643.
I tried to execute sg_logs --all but the command is unknown. What package should I install to run this command?
 
As it was asked before: is there any reason why you are using STABLE and not RELEASE version of FreeBSD ? Can you try this on a RELEASE version of 13.2 ?

The core.txt you shared shows you hit GP (protection fault) in kernel during zfs process execution (GP can be many things, without proper kernel you're running it's still impossible to tell why the trap was caught.

Code:
...
#4 0xffffffff81095f08 at calltrap+0x8
#5 0xffffffff80f588f4 at uma_zalloc_arg+0x104
#6 0xffffffff823aa450 at arc_buf_alloc_impl+0x50
..
It hit GP in uma_zalloc_arg() ; this is not necessarily a memory issue (could be but it can be a bug too). That's why trying this on RELEASE with GENERIC kernel should be tested.

Running memtest 3 days can be considered a solid test. :) But is there a reason why you chose older version of it ? memtest home page shows way higher version you have.
I'm not saying the older versions are not ok but I'd assume you opt for highest version possible.
 
Initially I used a RELEASE installation media obtained without validating the .img file with sha256sum eventhough I downloaded it multiple times. After a few crashes I switched to STABLE.
Now i'll come back to RELEASE. If I'll be able to validate the .img file.
The memtest version was what I get after installing the memtest package.
 
I'd use one from the link I shared. In theory it should be OK to use older versions but it just makes sense to use newer stuff from them. Maybe they did improve the testing patterns over the years, etc.

Are you able to provide the output of those gdb commands I shared above ? To see what it crashed on..

Now i'll come back to RELEASE. If I'll be able to validate the .img file.

Do you use this machine (i.e. machine that has this problems) to download the img ?

I would recommend doing these steps:
- Use 13.2 RELEASE (patch to the latest-greatest with freebsd-update(8)), don't forget to install gdb. Wait for the crash, share the crash (core.txt) along with the freebsd-version -kru so we can match it
- Install 13.2 RELEASE but opt for UFS instead of the ZFS. Then do as above, try to trigger panic (make -j8 buildword from sources is always a good test)
- Try to use different disk for system, test those two points above
 
In the morning I'll retry to provide the output of those gdb commands.
Then I'll reinstall the OS using 13.2 RELEASE but the problem is that I cannot validate the ...memstick.img file with sha256sum or sha512sum, even if I downloaded it twice. I didn't use the crippled machine to download the .img file, but another FreeBSD machine. Is something wrong with these files? Should I proceed without validating it with sha256sum?
 
I was able to perform some of the gdb commands. I attached the outputs.
 

Attachments

  • gdb_bt_cshcore.txt
    937 bytes · Views: 43
  • which_csh.txt
    1 KB · Views: 57
  • which_shutdown.txt
    1 KB · Views: 51
I noticed the crashes only with FreeBSD OS installed to disk. I didn't notice crashes when the computer was running from usb install, and in this weekend the computer run three days with the memtest86 usb stick.

For me this sounds like there is something fishy about those disks - be it the disks themselves, bad/loose cabling or the controller configuration (again: IT-mode; no RAID)
Did you look at the logpages of the drives used in the root pool?
 
sko, I tried sg_logs --all but the command is unknown and I don't know which package should I install. In bios at SATA mode is selected AHCI and at PCEX16_1 bandwidth bifurcation configuration (which I suppose is the HBA controller) is selected Auto Mode . I don't see Raid anywhere
 
Those SATA settings are for some on-board SATA-controller, the drives are connected to the SAS controller. So you have to enter the controller BIOS at boot (IIRC alt+s for LSI/broadcom) to see what settings may have been applied there.
To quickly check the firmware version and board variant you can also use mprutil show adapter. If it shows "LSI3008-IR" under "Board Name", your are running the raid firmware.

But again: this smells very highly of a faulty drive/connection; the controller firmware/configuration comes much later in the line of possible suspects...

sg_logs is part of sysutils/sg3_utils
 
Some general & specific information (some already mentioned).
Just for clarification: Those crashes only happen with an OS *installed to disk*?

Any chance that HBA doesn't run IT mode/firmware and there is some form of RAID configured with those disks? I've seen ZFS corruption with some buggy RAID-firmwares on more than one occasion.
I noticed the crashes only with FreeBSD OS installed to disk. I didn't notice crashes when the computer was running from usb install, and in this weekend the computer run three days with the memtest86 usb stick. (I am sorry if I don't understand correctly the first question)
The computer was built by a local hardware company. I specifically asked tthem o avoid RAID configurations.
I checked in bios, at sata mode is specified AHCI. I don't know how to check elsewhere.
They connected two SAS disks at one HBA port with cables SAS Controller SAS HD SFF 8643.
I tried to execute sg_logs --all but the command is unknown. What package should I install to run this command?
Usually HBA/RAID disk interface cards have options to support various hardware RAID configurations (see for example RAID). Apart from the immediate clear difference of hardware versus software RAID, there are many important differences between the traditional (hardware & software) RAID systems and ZFS RAID-Zn (n=1,2,3). ZFS is intended to function reliably without any (traditional) hardware RAID between your motherboard (OS) and your disks. Cards that have (support for) hardware RAID can often be set in a mode where those hardware RAID functions are removed/disabled/inactivated; the card is then (flashed) in IT mode; see What are IT mode, HBA mode, RAID mode in (SAS) Controllers? Often an interface card in IT mode will show "-IT" at the end of its firmware version string.

For search of an unknown command/package, use:
  1. for the appropriate OS version, look up the command - sg_logs
  2. and/or use an appropriate string to search FreshPorts
  3. install the desired package
    or download and build/compile the desired port (this becomes a package when installed on your FreeBSD system)
Freshports lets you view and query for a wide variety of information about all ports and packages for FreeBSD (this is not where they are stored or downloaded from).

[...] and when I saw the STABLE version , I changed the installation media accordingly because ... stable sounds great.
Short version: FreeBSD -STABLE versions are development versions; stable refers to the stable ABI (Application Binary Interface), not the stability in general of the version (-STABLE versions are quite stable though). The FreeBSD -RELEASE versions are intended for production grade systems: usually your starting point into the FreeBSD world. Naturally -RELEASE versions have a stable ABI as well. The designations -CURRENT, -STABLE and -RELEASE have a specific FreeBSD meaning; for their differences and inherent relation to major/minor FreeBSD version numbers, see: Changes to the FreeBSD Support Model and the thread around this message.*

The FreeBD Handbook contains a lot of information, for example: Finding Software -> FreshPorts and Updating and Upgrading FreeBSD -> FreeBSD -STABLE.

___
* Note: this is (very) different from other xxxNIX-es.
 
I ran sg_logs --all /dev/daN for every disk and I attached the outputs. The active disks are da2, da3.
About the HBA controller I attached an image from my bios for what I believe to represent that controller. It is selected Auto Mode, the other option being RAID. I suppose that it is OK as it is.
 

Attachments

  • sg_logs_all_da0.txt
    5.6 KB · Views: 51
  • sg_logs_all_da1.txt
    12.5 KB · Views: 55
  • sg_logs_all_da2.txt
    5.4 KB · Views: 65
  • sg_logs_all_da3.txt
    5.4 KB · Views: 48
  • IMG_0227.JPG
    IMG_0227.JPG
    122.2 KB · Views: 40
Bingo.

da0:
Code:
Read error counter page  [0x3]
  Errors corrected without substantial delay = 123143520
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 123143520
  Total times correction algorithm processed = 0
  Total bytes processed = 53317232128
  Total uncorrected errors = 0

Verify error counter page  [0x5]
  Errors corrected without substantial delay = 4120
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 4120
  Total times correction algorithm processed = 0
  Total bytes processed = 0
  Total uncorrected errors = 0

da1:
Code:
Read error counter page  [0x3]
  Errors corrected without substantial delay = 195215204
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 195215204
  Total times correction algorithm processed = 0
  Total bytes processed = 53319728128
  Total uncorrected errors = 0

Verify error counter page  [0x5]
  Errors corrected without substantial delay = 3749
  Errors corrected with possible delays = 0
  Total rewrites or rereads = 0
  Total errors corrected = 3749
  Total times correction algorithm processed = 0
  Total bytes processed = 0
  Total uncorrected errors = 0

Those two drives *definitely* have some issues. Compare the corrected error numbers to the other two drives (=0).
Given those are SAS2 Seagates, my bet is on their crappy firmware that doesn't play well with the SAS3 HBA - I also RMA'd several seagate drives because of such issues back when SAS3 was relatively new and those drives caused all kinds of issues.


Also, those drives are *VERY* old:
Code:
  Date of manufacture, year: 2014, week: 29
Code:
  Date of manufacture, year: 2014, week: 46

True, the "number of hours powered up" are quite low, but those values can be (easily) cleared for some vendors and firmware versions, as that's whats usually also done with "refurbished" drives.

Given those are rather small (by todays standards) drives, I'd replace (at least) those two with SSDs and call it a day. E.g. Samsung PM1643a 960GB SAS SSDs can be found for ~160EUR new (US prices should be even lower). Using spinning disks for small (e.g. root/boot) pools really makes no sense nowadays as SSDs and even small enterprise-grade M.2 NVMe have become very cheap and are multiple orders of magnitude faster than spinning rust. For a low-load home server almost every consumer-grade SATA SSD will be significantly better in every way than HDDs, especially for the root pool...
 
Thanks very much sko and everybody. Great job, great support.
I'll change the disks. The problem is that in my area I had a hard time to find a hardware company willing and able to build a non-windows, non-RAID computer. I put those HDDs in the computer because the guys who finally found to built it said to me that it is impossible to use SSD or NVMe drives with an HBA controller. It seems they are not only ignorant, but also non-decent people, because I bought those Seagate drives as new.
But with the level of support I found here I feel encouraged to continue my FreeBSD ambition.
Thanks again everybody.
 
[...] I put those HDDs in the computer because the guys who finally found to built it said to me that it is impossible to use SSD or NVMe drives with an HBA controller. It seems they are not only ignorant, but also non-decent people, because I bought those Seagate drives as new.
Just to be sure: your LSI 9300-8i is suited for SAS and SATA disks; see for example LSI_SAS_9300-8i_UG_v1-3-2.pdf or Broadcom SAS 9300-8i. SAS interface cards are "downward" compatible with SATA drives: you can connect SATA drives to a SAS interface card (not the other way around). That means you can use SATA SSDs (non-NVMe) to connect to your 9300-8i, however, those can just as well be connected to your SATA interfaces on your motherboard. Given the current trend in (especially) consumer NVMe drives (a lot are in the same price range or perhaps even cheaper than their SATA parts equivalent in size), NVME drives are usually the preferred option and much faster then their SATA SSD counterparts.

From a quick scan of your ASUS TUF GAMING B550-PLUS motherboard specs, I see that it has one PCI 3.0 and one PCI 4.0 M.2 NVMe connection; thus it would be possible to put 2 PCI 3.0 NVMe M.2 drives on your motherboard and use them as a ZFS mirror. Based on the present 2 NVMe options this means further extensions in size as well as RAIDZ-n are somewhat limited; all depending on your intended or desired use.
 
I was able to perform some of the gdb commands. I attached the outputs.
Following on "my" part of the investigation: the output you shared didn't include the gdb commands. But from what you shared it always crashed at the same code (somewhere in libc; unable to match it as you're running stable).

It would be still interesting to see full crash dump on release version. My suggestions above include the advise as others gave you - try it with different disk.

It's still interesting how come that company that was testing your box under Windows was not able to trigger the crash or at least detect the issue.
 
[...] It would be still interesting to see full crash dump on release version. My suggestions above include the advise as others gave you - try it with different disk.
I agree, as it stands one might conclude—not based on decidedly conclusive facts—that a disk (interface?) behaving badly in a redundant ZFS configuration could trigger a panic ... (instead of expected error messages).
 
It's still interesting how come that company that was testing your box under Windows was not able to trigger the crash or at least detect the issue.

Because NTFS and Windows doesn't give a crap about data integrity. It will chug on even with bogus data until it actually hits something that has bit-rotted beyond recognition and causes it to crash and burn.

Also: the drives stated they "corrected" (A LOT) of errors* - the thing is, if the disk is so deeply damaged that even the ECC information rots away in mere seconds, it may even start to "correct" undamaged data, because the checksums are wrong. As long as the drives reports back "here is your data, I have corrected it for you" neither SMART nor other tool (that rely on SMART data) might trigger an alert. ZFS should very quickly detect such things - would be interesting if multiple zpool scrubs will reveal checksum errors. My bet is, it will. I've seen this with malfunctioning SSDs before - every scrub detected more and more checksum errors from the same drive, yet its SMART values were all perfectly clean until shortly before the drive finally died and went dark.
If both drives in a mirror constantly return bogus data, even ZFS is powerless and you end up with corrupted data.

* I doubt the firmware will ever admit to return corrupt data and increase that counter, except maybe shortly before it completely dies. Firmware lies. Always. *Especially* HDD and RAID controller firmware...
 
Because NTFS and Windows doesn't give a crap about data integrity.
But still I'd expect BSOD if system works with faulty data the same way FreeBSD crashed here.

Kernel backtrace that was OP able to provide suggests the crash is not voluntary ( i.e. panic not triggered by actual integrity check where panic would be lesser of two evils) but rather unexpected event. GP would fit this description: bogus data stored in memory. All this is just assumption based on very little info that was provided. Looking at the crash would definitely shine more information on why system crashed.

Plugging other disk to the HBA, or even better using different disk outside of the HBA as part of the test would make sense.
 
I've just ordered for the SAS SSDs mentioned by sko, they'll arrive by 3'rd of august. I'll replace the two Seagate disks with the new ones, I'll install the RELEASE version, the packages, starting with gdb, and I'll share the outcome on this forum.

_martin, I strongly believe that the company which assembled the computer sold refurbished disks as new (and possible other components), so they had no reason to really test the computer.
As an explanation of the crashes they told me: "See what happened if you insist with your linux and that strange RAID(ie ZFS in their understanding).With windows we can help you immediately". For those guys everything that's not windows is linux.In my hometown, which is Timișoara, România, this was the only company I found that accepted to build such a computer.

Nevermind, I do apreciate the Freebsd OS, so I am not going to give up, I already installed Freebsd on three older computers, but I need a better suited machine to use ZFS, jails etc. Afterall what happened so far is for me learning by doing, the hard way.
 
I've just ordered for the SAS SSDs mentioned by sko, they'll arrive by 3'rd of august. I'll replace the two Seagate disks with the new ones, I'll install the RELEASE version, the packages, starting with gdb, and I'll share the outcome on this forum.

_martin, I strongly believe that the company which assembled the computer sold refurbished disks as new (and possible other components), so they had no reason to really test the computer.
There is no way to "refurbish" disks. "Refurbishing" disks usually means to falsify the error log in the disk.
 
PMc , I see, I used the term improperly. However, I am sure you understand what I meant.
It appears to be a more widespread nuisance. Some companies retire their enterprise disks after five years of operation, because failure probability can grow after that time.
These disks are given to some company for secure erase and disposal. (There is actually paid for the disposal.)
Then, strangely, these disks reappear at ebay and amazon as "refurbished", with price tags that are beyond good or evil.
 
Back
Top