1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

mfi1: TIMEOUT AFTER 5669 SECONDS + Kernel Panic

Discussion in 'General' started by knuckolls, Jan 3, 2011.

  1. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    This is my first post here on the forums, and I'm a relatively new FreeBSD sysadmin. We ordered a Dell R710 with two RAID controllers, the PERC H700 for 6 internal drives (mfi0), and the PERC H800 for a 24-drive direct attached storage device (mfi1). I have a zpool on 4 of the internal drives on mfi0 and a zpool across all 24 drives in the mfi1 DAS.

    From time to time I have noticed that reads and writes to the mfi1 DAS hang for a long time. While it is hung, timeout error messages show up in /var/log/messages. After some seemingly arbitrary amount of time we're back to normal operation. It may be days before the next time this shows up in the logs and may be on the order of 30, or 15000 seconds. I have not yet discerned a pattern. Here is an example:

    Code:
    ...
    Jan  2 06:06:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 3959 SECONDS
    Jan  2 06:06:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 3989 SECONDS
    Jan  2 06:07:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4019 SECONDS
    Jan  2 06:07:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4049 SECONDS
    Jan  2 06:08:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4079 SECONDS
    Jan  2 06:08:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4109 SECONDS
    Jan  2 06:09:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4139 SECONDS
    Jan  2 06:09:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4169 SECONDS
    Jan  2 06:10:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4199 SECONDS
    Jan  2 06:10:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4229 SECONDS
    Jan  2 06:11:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4259 SECONDS
    Jan  2 06:11:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4289 SECONDS
    Jan  2 06:12:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4319 SECONDS
    Jan  2 06:12:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4349 SECONDS
    Jan  2 06:13:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4379 SECONDS
    Jan  2 06:13:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4409 SECONDS
    Jan  2 06:14:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4439 SECONDS
    Jan  2 06:14:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4469 SECONDS
    Jan  2 06:15:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4499 SECONDS
    Jan  2 06:15:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4529 SECONDS
    Jan  2 06:16:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 4559 SECONDS
    ...
    


    I have a hunch that this is occurring when copying large amounts of data (I.E. 0.5TB) between the drive arrays. I've also seen it occur during a hefty amount of database activity on mfi1. This database is not heavily used as an online database and is mostly for data warehousing so I've waited a bit to attack the problem as other fires have demanded my attention.

    This morning the server is unresponsive and the following information is in the logs and on screen. A hard shutdown and startup seems to have solved the problem with no data loss (thanks zfs checksumming).

    Recovered from /var/log/messages. Also shown on screen before panic message below:
    Code:
    Jan  2 06:29:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5339 SECONDS
    Jan  2 06:29:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5369 SECONDS
    Jan  2 06:30:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5399 SECONDS
    Jan  2 06:30:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5429 SECONDS
    Jan  2 06:31:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5459 SECONDS
    Jan  2 06:31:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5489 SECONDS
    Jan  2 06:32:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5519 SECONDS
    Jan  2 06:32:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5549 SECONDS
    Jan  2 06:33:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5579 SECONDS
    Jan  2 06:33:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5609 SECONDS
    Jan  2 06:34:28 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5639 SECONDS
    Jan  2 06:34:58 onyx kernel: mfi1: COMMAND 0xffffff80009c7870 TIMEOUT AFTER 5669 SECONDS
    


    Transcribed from on-screen before shutdown, not present in any logs.
    Code:
    Fatal Trap 12: pagefault while in kernel mode
    cpuid=6; apic id=34
    fault virtual device = 0x290
    fault code = supervisor read data, page not present
    instruction pointer = 0x20
    stack pointer = 0x28
    frame pointer = 0x28
    code segment = base 0x0, limit 0xffff, type 0x16
                   DPL 0, pres 1, long 1, def 320, gran 1
    processor eflags = interrupt enabled, resume, IOPC = 0
    current process = 20376 (ruby)
    trap number = 12
    panic: page fault
    cpuid = 6
    Cannot Dump. Device not defined or enabled.
    


    I'm at a loss, but it appears to be related to the mfi1 timeouts. I'll add dumps of info that may be helpful.
     
  2. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    # mfiutil -u 0 show adapter
    Code:
    mfi0 Adapter:
        Product Name: PERC H700 Integrated
       Serial Number: 06C0032
            Firmware: 12.3.0-0032
         RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
      Battery Backup: present
               NVRAM: 32K
      Onboard Memory: 1024M
      Minimum Stripe: 8K
      Maximum Stripe: 1M
    


    # mfiutil -u 1 show adapter
    Code:
    mfi1 Adapter:
        Product Name: PERC H800 Adapter
       Serial Number: 06H00B2
            Firmware: 12.3.0-0032
         RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
      Battery Backup: present
               NVRAM: 32K
      Onboard Memory: 1024M
      Minimum Stripe: 8K
      Maximum Stripe: 1M
    


    # mfiutil -u 0 show config
    Code:
    mfi0 Configuration: 5 arrays, 5 volumes, 0 spares
        array 0 of 2 drives:
            drive 2 (   75G) ONLINE <INTEL SSDSA2M080 02HD serial=CVPO0360003F080JGN> SATA enclosure 1, slot 2
            drive 3 (   75G) ONLINE <INTEL SSDSA2M080 02HD serial=CVPO036003PH080JGN> SATA enclosure 1, slot 3
        array 1 of 1 drives:
            drive 0 (  419G) ONLINE <SEAGATE ST3450857SS ES63 serial=3SK1D97L> SAS enclosure 1, slot 0
        array 2 of 1 drives:
            drive 1 (  419G) ONLINE <SEAGATE ST3450857SS ES63 serial=3SK1C818> SAS enclosure 1, slot 1
        array 3 of 1 drives:
            drive 5 (  419G) ONLINE <SEAGATE ST3450857SS ES63 serial=3SK1D8PZ> SAS enclosure 1, slot 5
        array 4 of 1 drives:
            drive 4 (  419G) ONLINE <SEAGATE ST3450857SS ES63 serial=3SK1D8RV> SAS enclosure 1, slot 4
        volume mfid0 (74G) RAID-1 64K OPTIMAL <os-wal> spans:
            array 0
        volume mfid1 (419G) RAID-0 64K OPTIMAL <one> spans:
            array 1
        volume mfid2 (419G) RAID-0 64K OPTIMAL <two> spans:
            array 2
        volume mfid3 (419G) RAID-0 64K OPTIMAL <three> spans:
            array 3
        volume mfid4 (419G) RAID-0 64K OPTIMAL <four> spans:
            array 4
    


    # mfiutil -u 1 show config
    Code:
        array 0 of 1 drives:
            drive 1 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N8S> SAS enclosure 1, slot 0
        array 1 of 1 drives:
            drive 2 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N86> SAS enclosure 1, slot 1
        array 2 of 1 drives:
            drive 3 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N8D> SAS enclosure 1, slot 2
        array 3 of 1 drives:
            drive 4 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N95> SAS enclosure 1, slot 3
        array 4 of 1 drives:
            drive 5 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N7N> SAS enclosure 1, slot 4
        array 5 of 1 drives:
            drive 0 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N7U> SAS enclosure 1, slot 5
        array 6 of 1 drives:
            drive 6 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N71> SAS enclosure 1, slot 6
        array 7 of 1 drives:
            drive 7 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N6U> SAS enclosure 1, slot 7
        array 8 of 1 drives:
            drive 8 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NAD> SAS enclosure 1, slot 8
        array 9 of 1 drives:
            drive 11 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NAT> SAS enclosure 1, slot 9
        array 10 of 1 drives:
            drive 10 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06MPY> SAS enclosure 1, slot 10
        array 11 of 1 drives:
            drive 9 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N5V> SAS enclosure 1, slot 11
        array 12 of 1 drives:
            drive 21 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NAK> SAS enclosure 1, slot 12
        array 13 of 1 drives:
            drive 20 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NB2> SAS enclosure 1, slot 13
        array 14 of 1 drives:
            drive 15 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NBT> SAS enclosure 1, slot 14
        array 15 of 1 drives:
            drive 14 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N9E> SAS enclosure 1, slot 15
        array 16 of 1 drives:
            drive 13 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NB6> SAS enclosure 1, slot 16
        array 17 of 1 drives:
            drive 12 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NAF> SAS enclosure 1, slot 17
        array 18 of 1 drives:
            drive 18 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NBB> SAS enclosure 1, slot 18
        array 19 of 1 drives:
            drive 19 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N9A> SAS enclosure 1, slot 19
        array 20 of 1 drives:
            drive 22 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N9C> SAS enclosure 1, slot 20
        array 21 of 1 drives:
            drive 23 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06N9G> SAS enclosure 1, slot 21
        array 22 of 1 drives:
            drive 16 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NAJ> SAS enclosure 1, slot 22
        array 23 of 1 drives:
            drive 17 (  137G) ONLINE <FUJITSU MBE2147RC D905 serial=D304PAB06NB5> SAS enclosure 1, slot 23
        volume mfid5 (136G) RAID-0 64K OPTIMAL <01_00_00> spans:
            array 0
        volume mfid6 (136G) RAID-0 64K OPTIMAL <01_00_01> spans:
            array 1
        volume mfid7 (136G) RAID-0 64K OPTIMAL <01_00_02> spans:
            array 2
        volume mfid8 (136G) RAID-0 64K OPTIMAL <01_00_03> spans:
            array 3
        volume mfid9 (136G) RAID-0 64K OPTIMAL <01_00_04> spans:
            array 4
        volume mfid10 (136G) RAID-0 64K OPTIMAL <01_00_05> spans:
            array 5
        volume mfid11 (136G) RAID-0 64K OPTIMAL <01_00_06> spans:
            array 6
        volume mfid12 (136G) RAID-0 64K OPTIMAL <01_00_07> spans:
            array 7
        volume mfid13 (136G) RAID-0 64K OPTIMAL <01_00_08> spans:
            array 8
        volume mfid14 (136G) RAID-0 64K OPTIMAL <01_00_09> spans:
            array 9
        volume mfid15 (136G) RAID-0 64K OPTIMAL <01_00_10> spans:
            array 10
        volume mfid16 (136G) RAID-0 64K OPTIMAL <01_00_11> spans:
            array 11
        volume mfid17 (136G) RAID-0 64K OPTIMAL <01_00_12> spans:
            array 12
        volume mfid18 (136G) RAID-0 64K OPTIMAL <01_00_13> spans:
            array 13
        volume mfid19 (136G) RAID-0 64K OPTIMAL <01_00_14> spans:
            array 14
        volume mfid20 (136G) RAID-0 64K OPTIMAL <01_00_15> spans:
            array 15
        volume mfid21 (136G) RAID-0 64K OPTIMAL <01_00_16> spans:
            array 16
        volume mfid22 (136G) RAID-0 64K OPTIMAL <01_00_17> spans:
            array 17
        volume mfid23 (136G) RAID-0 64K OPTIMAL <01_00_18> spans:
            array 18
        volume mfid24 (136G) RAID-0 64K OPTIMAL <01_00_19> spans:
            array 19
        volume mfid25 (136G) RAID-0 64K OPTIMAL <01_00_20> spans:
            array 20
        volume mfid26 (136G) RAID-0 64K OPTIMAL <01_00_21> spans:
            array 21
        volume mfid27 (136G) RAID-0 64K OPTIMAL <01_00_22> spans:
            array 22
        volume mfid28 (136G) RAID-0 64K OPTIMAL <01_00_23> spans:
            array 23
    


    # zpool status
    Code:
      pool: data
     state: ONLINE
     scrub: none requested
    config:
    
            NAME        STATE     READ WRITE CKSUM
            data        ONLINE       0     0     0
              raidz1    ONLINE       0     0     0
                mfid1   ONLINE       0     0     0
                mfid2   ONLINE       0     0     0
              raidz1    ONLINE       0     0     0
                mfid3   ONLINE       0     0     0
                mfid4   ONLINE       0     0     0
    
    errors: No known data errors
    
      pool: tank
     state: ONLINE
     scrub: none requested
    config:
    
            NAME        STATE     READ WRITE CKSUM
            tank        ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid5   ONLINE       0     0     0
                mfid6   ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid7   ONLINE       0     0     0
                mfid8   ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid9   ONLINE       0     0     0
                mfid10  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid11  ONLINE       0     0     0
                mfid12  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid13  ONLINE       0     0     0
                mfid14  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid15  ONLINE       0     0     0
                mfid16  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid17  ONLINE       0     0     0
                mfid18  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid19  ONLINE       0     0     0
                mfid20  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid21  ONLINE       0     0     0
                mfid22  ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid23  ONLINE       0     0     0
                mfid24  ONLINE       0     0     0
            logs        ONLINE       0     0     0
              mirror    ONLINE       0     0     0
                mfid25  ONLINE       0     0     0
                mfid26  ONLINE       0     0     0
            spares
              mfid27    AVAIL
              mfid28    AVAIL
    
     
  3. SirDice

    SirDice Moderator Staff Member Moderator

    Messages:
    17,868
    Thanks Received:
    2,492
    On what version of FreeBSD?
     
  4. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    That's a funny thing to forget to mention. It's 8.1-stable.

    uname -a
    Code:
    FreeBSD onyx 8.1-RELEASE FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC 2010     root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64
    
     
  5. SirDice

    SirDice Moderator Staff Member Moderator

    Messages:
    17,868
    Thanks Received:
    2,492
    No, it's 8.1-RELEASE.

    You could try updating to 8-STABLE. Or wait a little while for 8.2-RELEASE.
     
  6. kkdzo

    kkdzo New Member

    Messages:
    4
    Thanks Received:
    0
    How about # mfiutil -u 1 cache mfid1 ?

    You might want to check your cache settings for all your volumes.

    Remember also that when there is a battery discharge in progress, the write policy will change from Write-Back to Write-Through if the capacity is under a certain level and you'll get that in your /var/log/messages:
    Code:
    kernel: mfi0: 2244 (347643416s/0x0008/WARN) - Current capacity of the battery is below threshold
    kernel: mfi0: 2245 (347643416s/0x0008/WARN) - BBU disabled; changing WB virtual disks to WT

    This tends to seriously affect performance.

    As the battery goes through the charge process, you'll see that in your logs:
    Code:
    kernel: mfi0: 2261 (347649626s/0x0008/info) - Current capacity of the battery is above threshold
    kernel: mfi0: 2262 (347649626s/0x0008/info) - BBU enabled; changing WT virtual disks to WB
    
     
  7. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    Thanks for the information about the battery. That was a likely candidate. The problem was actually solved because the Dell PERC H700 and H800 raid controllers had a pretty serious bug in their firmware.

    In the end it wasn't FreeBSD's fault, but i'll go ahead and enumerate the solution for wayward googlers with the same problem.

    Release Notes:
    http://ftp.us.dell.com/SAS-RAID/DELL_PERC-H800-ADAPTER_A06_R285986.txt

    Affected firmware version: 12.3.0-0032
    Updated firmware version: 12.10.0-0025

    From the release notes:
    1. Corrects a potential scenario where a drive may repeatedly fail
    during i/o.
    2. Fixes an issue in which IEEE scatter gather lists may be
    misinterpreted by firmware leading to possible PCIe fatal error.

    Here's what I had to do to get the H800 firmware updated. Dell only supplies executables for Windows, and Redhat Linux. In order to avoid adapting the installation script to FreeBSD I had to:

    1. Burn and boot off of a Fedora 14 LiveCD. [http://fedoraproject.org/en/get-fedora]
    2. Download the H800 Redhat Firmware Update package [http://bit.ly/8KYCXw]
    3. yum install compat-libstdc++-33.i686
    4. su
    5. ./RAID_FRMW_LX_R285986.BIN

    The H700 has the same problem and you can follow the same instructions to get the firmware updated.
     
    kkdzo thanks for this.
  8. kkdzo

    kkdzo New Member

    Messages:
    4
    Thanks Received:
    0
    Thanks for posting the solution.
     
  9. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    A quick note for those who find this solution. My PERC H800 seems to be cured of all timeout problems. The H700 with the same firmware update showed some timeouts in the /var/log/messages just now. So while this may be a valid solution for the H800, the jury is still out on the H700.

    Code:
    Jan  8 14:30:55 onyx kernel: mfi0: COMMAND 0xffffff80009b7b28 TIMEOUT AFTER 41 SECONDS
    
     
  10. knuckolls

    knuckolls New Member

    Messages:
    6
    Thanks Received:
    1
    Notice the firmware is up to date and the problem still persists.

    # sudo mfiutil -u 0 show adapter
    Code:
    mfi0 Adapter:
        Product Name: PERC H700 Integrated
       Serial Number: 06C0032
            Firmware: 12.10.0-0025
         RAID Levels: JBOD, RAID0, RAID1, RAID5, RAID6, RAID10, RAID50
      Battery Backup: present
               NVRAM: 32K
      Onboard Memory: 1024M
      Minimum Stripe: 8K
      Maximum Stripe: 1M
    
     
  11. Mussolini

    Mussolini New Member

    Messages:
    43
    Thanks Received:
    1
    Hi all,

    I had the same problem here with a LSI 9260-4i. The controller in plugged in a Intel S1200BTL inside a Supermicro chassis 24 bays running FreeBSD 9 and ZFS.
    After trying lots of things including other releases of FreeBSD, I solved this problem by changing the read police from adaptive to normal in the controller setup.

    I hope this can help someone here.


    Best,

    Danilo
     
    zang3tsu thanks for this.