Solved mfi0 timeout LSI2108 10.2-RELEASE

I have an old server hardware running 10.2-RELEASE that has been running quite long without problem, until several power outage hit him, and suddenly yesterday the server freeze and displaying mfi0 timeout. As for the drive layout, I'm using 24x1T SATA for the data itself (raidz1, zsata), and 1x1T SATA for zroot. Normal booting causing the OS to stuck after trying to mount root from zfs:zroot/ROOT/default. I was able to enter single user mode and do some basic troubleshoot. Executing any zfs command in single user mode (zpool, zfs) also causing mfi0 timeout error popout. Here some information I could dig :

pciconf -lv

Code:
mfi0@pci0:3:0:0:        class=0x010400 card=0x070015d9 chip=0x00791000 rev=0x04 hdr=0x00
    vendor     = 'LSI Logic / Symbios Logic'
    device     = 'MegaRAID SAS 2108 [Liberator]'
    class      = mass storage
    subclass   = RAID

root@:~ # mfiutil show adapter

Code:
mfi0 Adapter:
    Product Name: LSI 2108 MegaRAID
   Serial Number:
        Firmware: 12.15.0-0239
     RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
  Battery Backup: not present
           NVRAM: 32K
  Onboard Memory: 512M
  Minimum Stripe: 8K
  Maximum Stripe: 1M

root@:~ # mfiutil show volumes

Code:
mfi0 Volumes:
 Id     Size    Level   Stripe  State  Cache   Name
mfid0 (  930G) RAID-0      64K OPTIMAL Enabled
mfid1 (  930G) RAID-0      64K OPTIMAL Enabled
mfid2 (  930G) RAID-0      64K OPTIMAL Enabled
mfid3 (  930G) RAID-0      64K OPTIMAL Enabled
mfid4 (  930G) RAID-0      64K OPTIMAL Enabled
mfid5 (  930G) RAID-0      64K OPTIMAL Enabled
mfid6 (  930G) RAID-0      64K OPTIMAL Enabled
mfid7 (  930G) RAID-0      64K OPTIMAL Enabled
mfid8 (  930G) RAID-0      64K OPTIMAL Enabled
mfid9 (  930G) RAID-0      64K OPTIMAL Enabled
mfid10 (  930G) RAID-0      64K OPTIMAL Enabled
mfid11 (  930G) RAID-0      64K OPTIMAL Enabled
mfid12 (  930G) RAID-0      64K OPTIMAL Enabled
mfid13 (  930G) RAID-0      64K OPTIMAL Enabled
mfid14 (  930G) RAID-0      64K OPTIMAL Enabled
mfid15 (  930G) RAID-0      64K OPTIMAL Enabled
mfid16 (  930G) RAID-0      64K OPTIMAL Enabled
mfid17 (  930G) RAID-0      64K OPTIMAL Enabled
mfid18 (  930G) RAID-0      64K OPTIMAL Enabled
mfid19 (  930G) RAID-0      64K OPTIMAL Enabled
mfid20 (  930G) RAID-0      64K OPTIMAL Enabled
mfid21 (  930G) RAID-0      64K OPTIMAL Enabled
mfid22 (  930G) RAID-0      64K OPTIMAL Enabled
mfid23 (  930G) RAID-0      64K OPTIMAL Enabled

Yeah, kinda recipe for disaster running zfs on top of fake RAID and RAID0, however I'm not the one who built it. I've try disable the c-state on the BIOS, adding
hw.mfi.mrsas_enable=1 to /boot/loader.conf. As the OS can detect the hardware I don't think it's a driver issue.

Some said there is a patch for this
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=140416
however I don't know how to apply the patch on current situation. I could give more information if it's necessary. Any help would be appreciated, thanks!
 
I have an old server hardware running 10.2-RELEASE that has been running quite long without problem, until several power outage hit him, and suddenly yesterday the server freeze and displaying mfi0 timeout. As for the drive layout, I'm using 24x1T SATA for the data itself (raidz1, zsata), and 1x1T SATA for zroot.
I stopped reading your post at this spot. IIRC LSI-2108 is an intelligent RAID controller with cache memory. What is ZFS doing on the top of Hardware RAID controller? Please educate me but I would swear that there is no way to put LSI-2108 into IT mode. The very fact that you are talking about mfi (Hardware RAID driver) and ZFS in the same post is a big red flag and unlike to get any attention from serious storage guys.
 
Last edited:
I stopped reading your post at this spot. IIRC LSI-2108 is an intelligent RAID controller with cache memory. What is ZFS doing on the top of Hardware RAID controller? Please educate me but I would swear that there is now way to put LSI-2108 into IT mode. The very fact that you are talking about mfi (Hardware RAID driver) and ZFS in the same post is a big red flag and unlike to get any attention from serious storage guys.
yup that's why i added a statement later, maybe i should've write it in the very beginning
Yeah, kinda recipe for disaster running zfs on top of fake RAID and RAID0, however I'm not the one who built it.
so in a quest of some help, i just give a try, thx anyway for the warning ;)
 
anyway, just some update, i've updated my mfi firmware
mfiutil show firmware
Code:
mfi0 Firmware Package Version: 12.15.0-0239
mfi0 Firmware Images:
Name  Version                          Date         Time         Status
APP   2.130.403-4660                   Aug 14 2015  01:44:33     active
BIOS  3.30.02.2_4.16.08.00_0x06060A05  07/23/2014
  07/23/2014
  active
PCLI  04.04-020:#%00009                May 04 2012  14:47:14     active
BCON  6.0-54-e_50-Rel                  Sep 08 2014  17:14:26     active
NVDT  2.09.03-0058                     Sep 07 2015  10:38:48     active
BTBL  2.02.00.00-0000                  Sep 16 2009  21:37:06     active
BOOT  09.250.01.219                    4/4/2011     15:58:38     active

also now i could mount zroot on multi user mode (without mounting the data zpool), however importing the data pool causing mfi0 timeout popout again.
 
so in a quest of some help, i just give a try, thx anyway for the warning ;)
I am very sorry for your problems but they seems to be self inflicted wounds. I hope you get better answers than mine. IIRC the driver for those HW Raid cards has being updated between 8.xxx and 9.xxx release. I am 100% sure I have seeing this on FreeNAS forum. You might be able to play with alternative or older driver for LSI raid card. This is really all I can say. FreeNAS forum might be more helpful if you don't hit the assholes like me who don't want to be bothered with your special case.
 
If I understood correctly you are using 24 drives on raidz1????

Try booting with a recent mfsbsd image and see if you could import your pool. Remove power supplies first.
 
If I understood correctly you are using 24 drives on raidz1????

Try booting with a recent mfsbsd image and see if you could import your pool. Remove power supplies first.
nope, i'm using 24 drives for raidz3, i've tried to remove the power supply yet no result. previously i can list all the pool however importing the pool causing mfi timeout again.

fortunately i can resolve the problem using https://lists.freebsd.org/pipermail/freebsd-fs/2012-October/015507.html
in case you don't want to go to the url, here's the tweak

Code:
kern.maxfiles=5000000
kern.maxvnodes=5000000
vfs.zfs.zil_disable="1"
vfs.zfs.prefetch_disable="1"
vfs.zfs.txg.timeout="5"
The above solves the system unable to import or mount the pool.

I have also gone into the Card settings BIOS and changed under advanced
settings "Forward Read" to "none". This solves the mfi0 timeout.

now i have chance to move the data to another storage and reconfigure the troublesome server, thanks for your helps ;)
 
fortunately i can resolve the problem using https://lists.freebsd.org/pipermail/freebsd-fs/2012-October/015507.html
in case you don't want to go to the url, here's the tweak

Code:
kern.maxfiles=5000000
kern.maxvnodes=5000000
vfs.zfs.zil_disable="1"
vfs.zfs.prefetch_disable="1"
vfs.zfs.txg.timeout="5"
The above solves the system unable to import or mount the pool.
Those settings look like they will have a pretty serious impact on performance. But if the performance is sufficient to let you migrate the data to a different system, then it is probably OK for you.

I have a somewhat different opinion than Oko about using advanced-function cards for ZFS - it works reasonably well as long as you create a volume for each drive. And if you're a purist, you can set this in /boot/loader.conf
Code:
hw.mfi.allow_cam_disk_passthrough=1
But doing that is basically talking to the drives "behind the controller's back" and usually leads to problems down the line (like when someone sees the controller BIOS reporting "unconfigured volumes" during boot and tries to "fix" it).

I think a more serious issue in your case is that you have a controller which has a maximum of 8 ports (some versions are 4 ports), so you have an expander backplane in there. Since your PCI ID indicates a Supermicro product, this is either a motherboard or an add-on card (AOC) and an expander backplane, likely either a BPN-SASx-xxxEL1 or BPN-SASx-xxxEL2. As such, your controller isn't talking to the drives, it is talking to the expander. The expander is handling the SAS to SATA translation as well as error recovery (or lack thereof). A single mis-behaving drive can cause the expander to issue resets to a group of drives, and the controller doesn't know what's going on and reports a timeout. SAS expanders with SATA drives seem to go "Resets for EVERYBODY!" whenever there's a problem. That's the reason that companies like Dell (who sell SATA drives as a lower-cost option with these controllers) put a SAS to SATA interposer on every drive sled - the expander sees SAS drives and is happier.

I think what your tuning changes above have accomplished is to simply lower the throughput enough that the problem isn't triggered, or is triggered but resolves itself before more I/O queues up. I also think you have a hardware problem, since you say this only started recently after some power hits as the system has otherwise been stable for some time.
 
Back
Top