Kernel panic and reboot dramatically (related to ZFS)

Hi All:
I just upgraded form 12.1 to 12.2 this week and nightmare beginning, I got the system randomly reboot and checked the LOG to have lots of

kernel: mfi0: COMMAND 0xfffffe0098274a08 TIMEOUT AFTER 851 SECONDS

and finally got

kernel: panic: I/O to pool 'HOME' appears to be hung on vdev guid 2612857462550589903 at '/dev/mfid0p4'

it seems related to zpool as pool 'HOME' is my ZFS pool, anyone have such issue on upgrade too and solution(s) ?


Thanks in advance for any of your help.
 
That could be a hardware or firmware issue? Firmware updated? EDIT: NOT suggesting you start changing firmware now - just asking if you changed it recently or know what version etc. it is. Then check for any known issues with the version you have.

mfi definitely the right driver? Rather than mrsas e.g. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=198463

But if you've been running FreeBSD 12.1 happily on there for some time, seems unlikely.

Possibly FreeBSD is saying the RAID/disk controller is wodged or taking too long to respond, so it might be lower-level than ZFS.

Tried a full power-cycle?
 
Is it possible that mfid0p4 disk is dying? That could potentially hang up the entire bus and/or cause other disruptions. Still shouldn't panic in my view even if it's completely hosed. But I can imagine ZFS getting really weird responses and then triggering code with really bad information.
 
Yes, it is running long time before 12.2 and I do not upgrade any hardware-related firmware.

For the suggestion of full power-cycle mode, I will try to check for this stuff and report back.
 
I've had Dell PERCs using mfi (maybe and/or mrsas) "wodge" for a while - showing "suspfs" as the status in top, but usually after waiting a while they catch up and unwodge themselves. This is from a couple of years ago; happened a couple of times when doing a big MySQL import plus copying a lot of files (100s of GB) - so making the file system (on UFS) very busy.

And as SirDice says, could also be a drive failure.

If you are using RAID, have you turned off the RAID capabilities of the mfi?

Have you got any more information about the machine? Other people may have the some hardware and can confirm if 12.2 worked for them or if they also encountered issues.

I've installed 12.2 on Dell R430 with H330 RAID:

da0: <DELL PERC H330 Mini 4.30> Fixed Direct Access SPC-3 SCSI device, using mrsas:
...
AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd
mrsas0: <AVAGO Fury SAS Controller> port 0x2000-0x20ff mem 0x91d00000-0x91d0ffff,0x91c00000-0x91cfffff irq 26 at device 0.0 numa-domain 0 on pci2

... and a Dell T330 with H730 RAID:

da0: <DELL PERC H730 Adp 4.30> Fixed Direct Access SPC-3 SCSI device
...
AVAGO MegaRAID SAS FreeBSD mrsas driver version: 07.709.04.00-fbsd
mrsas0: <AVAGO Invader SAS Controller> port 0x3000-0x30ff mem 0x92c00000-0x92c0ffff,0x92b00000-0x92bfffff at device 0.0 on pci1

I've not rolled out to production servers yet so have an interest in knowing of any issues!

Can you boot off a USB/CD and check the disks from that? Run mfiutil or other diagnostic tools? Not sure if mfiutil would be on the live CD, but it might be.

Check in the BIOS settings to see if anything reported in the events/logs to do with that drive?

Good luck.
 
Hi richardtoohey2
I just tried 3 days to disable the C6 power saving features and find no help for the case, the box still kernel picnic and reboot randomly (but for the first day it could keep stable 15 hours long).

For the drive issue mentioned by SirDice, it is nearly new hard-disk as I just used about half years (WD 4 x 4TB Data Center series), so seems less chance of drive failure.

For the last post you say using "mrsas" driver instead, how could I do that as it is recognized by default with "mfi" driver.
By the way I am using "LSI MegaRAID SAS 9260-8I with CacheCade Pro 2.0"

Many Thanks.
 
New drives can fail - and the error message you are getting seems (I'm no expert!) hardware-related.

Some cards can use either the mfi or the mrsas driver. "man mrsas" lists LSI 92xx drives but not yours. "man mfi" lists your card explicitly: "LSI MegaRAID SAS 9260" so mfi seems correct. You can use mfiutil to view the drive status etc.

If you are using ZFS then you should be making sure you are turning off the RAID functionality of the card, but if you've been running 12.1 for some time without issue then doesn't seem relevant.

If it can boot and run for 15 hours then looks like an intermittant fault which will be more difficult to find. Was the 15 hours up-time on 12.2?
 
Looks like there was a newer mfi driver added to 12.2, as it is mentioned in the release notes.

To me the timeout error indicates either hardware or driver (configuration) problem.
 
I find it is the most strange case that if I stop the ZFS snapshot creation (I make a sh script to execute via crontab) the systems do not timeout, pinic and reboot anymore !!!

root@HOME:~ # uptime
12:03AM up 2 days, 16:59, 1 user, load averages: 0.12, 0.13, 0.14
 
Doesn't that still point it it most likely being a hardware/firmware issue?

If you do something in ZFS (e.g. the snapshot) that touches the bad sector or triggers the firmware issue, you get the hang/timeout & crash.

If you stop ZFS touching the failed/failing drive/sector/area (or the firmware code path) - no crash.

It still comes down to checking the hardware with any tools you've got e.g. mfiutil (anything in the event logs?)
 
Back
Top