Still trying to solve zfs / mps driver SCSI timeout and disk lost problem

I am creating a new thread to replace my old one, since my old one is now off topic (originally from a panic in 8.2-STABLE).

So the story is,

I chose the LSI 9211-8i because I wanted support for 3 TB disks. A few people who use the mps driver have issues where perfectly good disks time out and won't come back. It is unknown if it is the fault of the mps driver, ahci, cam, expanders, card firmware, etc.

Here is an excerpt from the log.

Code:
    Oct  1 18:06:12 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 632
    Oct  1 23:15:01 bcnas1 kernel: : SCSI command timeout on device handle 0x000a SMID 1010
    Oct  1 23:15:01 bcnas1 kernel: (da3:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID 174
    ...
    Oct  1 23:15:01 bcnas1 kernel: mps0: (0:0:0) terminated ioc 804b scsi 0 state c xfer 0
    Oct  1 23:15:01 bcnas1 last message repeated 6 times
    Oct  1 23:15:01 bcnas1 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 931 complete
    Oct  1 23:15:01 bcnas1 kernel: mps0: mpssas_complete_tm_request: sending deferred task management request for handle 0x0a SMID 191

My old thread: http://forums.freebsd.org/showthread.php?p=149376
Sebulon's thread with a AOC-USAS2-L8i card (using ZFS): http://forums.freebsd.org/showthread.php?t=27128
Jason on a mailing list (using SAS disks and UFS, with many identical servers with same issue): http://osdir.com/ml/freebsd-scsi/2011-11/msg00006.html

Workarounds I came across and test results:
  • Sebulon's
    Code:
    daily_status_security_chksetuid_enable="NO"
    workaround for periodic doesn't work for me.
  • Updating my firmware to the latest IR version did not work.
  • Jason's workaround of using mpslsi doesn't work for me.
  • Jason's workaround of setting disk tags to 1 on all disks didn't work.
    # camcontrol tags -N 1 da#
  • in /boot/loader.conf:
    Code:
    vfs.zfs.vdev.min_pending="1"
    vfs.zfs.vdev.max_pending="1"
    did not solve the problem. (This time the whole root system was lost)
  • IT firmware instead of IR using this LSI page
    also see link from olav
  • Replace both root disks or put them on the onboard controller. (EDIT 2012-01-20: so far this seems to be working well; uptime is 29 days)

My next experiment:
  • Try the Crucial SSDs again with firmware 0009 (old firmware was version 0001). (so far this seems to work, but somehow caused a URE and makes it fail the SMART short self test, so I am not done testing)
Other things I thought of trying, but probably won't need to:
  • Doug's suggestion to set the SMP timeouts http://lists.freebsd.org/pipermail/freebsd-scsi/2011-November/005108.html
  • disabling native command queuing
  • disabling AHCI
  • see if # camcontrol reset all works with mps (since it is ignored with mpslsi)
  • Try a non-mps controller and flash it to IT to support 3TB disks based on this info
  • (decided against this... with the mps driver before, it was random which of the 2 SSDs failed.) Put the disk in a different bay and port. The disk is currently in the front 24 disk backplane. I could try the back 12 disk backplane where the other root disk is. (maybe some backplanes just don't work with certain disks...) And if that fails, plug it into the onboard port.
  • Try version 8-fixed of the firmware mps driver instability under stable/8
  • Try without expanders / move the 2 SSDs to a place where they have their own channel (SSD in 1 port, other 3 empty)
  • Try
    Code:
    hw.pci.enable_msix="0"
    hw.pci.enable_msi="0"
    (last attempt resulted in a lockup... but maybe that only means it shouldn't be done while the system is running). The idea behind this is a mix between this page about mfi (not mps), this page (again about mfi, but mentions interrupts and has an interesting patch), and this peice of code which has a terribly scary comment saying that it is simply an assumption that no flush/check is needed with MSI, and shows that with MSI there is no flush (which is what the previous patch was all about). (Please note that I really don't know what I am rambling on about, but will try anything) /usr/src/sys/dev/mps/mps.c
    void
    Code:
    mps_intr(void *data)
    {
            struct mps_softc *sc;
            uint32_t status;
    
            sc = (struct mps_softc *)data;
            mps_dprint(sc, MPS_TRACE, "%s\n", __func__);
    
            /*
             * Check interrupt status register to flush the bus.  This is
             * needed for both INTx interrupts and driver-driven polling
             */
            status = mps_regread(sc, MPI2_HOST_INTERRUPT_STATUS_OFFSET);
            if ((status & MPI2_HIS_REPLY_DESCRIPTOR_INTERRUPT) == 0)
                    return;
    
            mps_lock(sc);
            mps_intr_locked(data);
            mps_unlock(sc);
            return;
    }
    
    /*
     * In theory, MSI/MSIX interrupts shouldn't need to read any registers on the
     * chip.  Hopefully this theory is correct.
     */
    void
    mps_intr_msi(void *data)
    {
            struct mps_softc *sc;
    
            sc = (struct mps_softc *)data;
            mps_lock(sc);
            mps_intr_locked(data);
            mps_unlock(sc);
            return;
    }
 
# uname -a
Code:
FreeBSD bcnas1.bc.local 8.2-STABLE FreeBSD 8.2-STABLE #0: Thu Sep 29 15:06:03 CEST 2011     root@bcnas1.bc.local:/usr/obj/usr/src/sys/GENERIC  amd64

# grep -Ev "^#|^$" /boot/loader.conf
Code:
zfs_load="YES"
vfs.root.mountfrom="zfs:zroot"
mps_load="NO"
mpslsi_load="YES"
hw.memtest.tests=0
autoboot_delay="3"
if_ixgb_load="YES"
if_ixgbe_load="YES"
inet.tcp.tcbhashsize=4096
loader_logo="beastie"
net.inet.tcp.syncache.hashsize=1024
net.inet.tcp.syncache.bucketlimit=100
net.isr.bindthreads=0
net.isr.direct=1
net.isr.direct_force=1
net.isr.maxthreads=3
vm.kmem_size="44g"
vm.kmem_size_max="44g"
vfs.zfs.arc_min="80m"
vfs.zfs.arc_max="42g"
vfs.zfs.arc_meta_limit="24g"
vfs.zfs.vdev.cache.size="32m"
vfs.zfs.vdev.cache.max="256m"
vfs.zfs.vdev.min_pending="4"
vfs.zfs.vdev.max_pending="32"
kern.maxfiles="950000"
zfs_load="YES"
ahci_load="YES"
siis_load="YES"
 
Somehow I doubt the "reset all" will do anything when the disks are lost.

# camcontrol reset all
Code:
Reset of bus 0 was successful

/var/log/messages
Code:
Dec 12 11:22:11 bcnas1bak kernel: mpslsi0: mpssas_action faking success for abort or reset
 
Update: It crashed again today. Setting this in /boot/loader.conf did not prevent it.
Code:
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

Although the crash was different today... I think both root disks were lost, so there is not even any record of timeouts in the /var/log/messages, but I took some camera screenshots. And this time no graceful shutdown was possible, so I had to hard reset it, and to my surprise, there were no checksum errors on the root disks (didn't run scrub for the rest yet).
 
It crashed again today at around 5:00-5:15pm, again like both were lost, but this time there were timeouts reported in /var/log/messages, and on the console, all for the same disk. And then when I rebooted, the disk never came back, so it ran degraded.

So flashing IT firmware does not fix the problem. Maybe it even makes it worse. But I will add "replace the disk" to my list.
 
Today I unplugged and plugged in many disks into the backup system, with no issues. And then I unplugged the (idle) SSD, watched all the disks lights blink red, plugged it back in, and got the same timeouts, and segmentation fault from # gpart show. So I think the SSD is to blame. It is using firmware version 0001, so I will try upgrading it to version 0009.

# smartctl -i /dev/da5
Code:
smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.2-STABLE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     M4-CT256M4SSD2
Serial Number:    0000000011120304FB4E
Firmware Version: 0001
User Capacity:    256,060,514,304 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:    Mon Jan  9 11:43:04 2012 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


And also I tried something like # camcontrol reset 0:11:0 while the disk was timed out, and got a kernel panic, much like when I ran # gpart recover da5 long ago. (I would say that is a bug, whether or not the disk is the root cause)
 
Hey man,

I lost a disk yesterday during a send/recv, just thought you should know.

Code:
Jan 18 07:38:19 fs2-7 kernel: (da6:mps1:0:0:0): SCSI command timeout on device handle 0x0009 SMID 130
Jan 18 07:39:02 fs2-7 kernel: mps1: (1:0:0) terminated ioc 804b scsi 0 state c xfer 0
Jan 18 07:39:02 fs2-7 kernel: mps1: mpssas_abort_complete: abort request on handle 0x09 SMID 130 complete
Jan 18 07:39:02 fs2-7 kernel: mps1: (1:0:0) terminated ioc 804b scsi 0 state c xfer 0
Jan 18 07:39:02 fs2-7 kernel: mps1: (1:0:0) terminated ioc 804b scsi 0 state c xfer 0
Jan 18 07:39:02 fs2-7 kernel: mps1: (1:0:0) terminated ioc 804b scsi 0 state 0 xfer 0
Jan 18 07:39:02 fs2-7 kernel: mps1: (1:0:0) terminated ioc 804b scsi 0 state 0 xfer 0
Jan 18 07:39:02 fs2-7 kernel: mps1: mpssas_remove_complete on target 0x0000, IOCStatus= 0x8
Jan 18 07:39:02 fs2-7 kernel: (da6:mps1:0:0:0): lost device
Jan 18 07:39:02 fs2-7 kernel: (da6:mps1:0:0:0): Synchronize cache failed, status == 0xa, scsi status == 0x0
Jan 18 07:39:02 fs2-7 kernel: (da6:mps1:0:0:0): removing device entry

Had to reboot the server to get it back. Have you found some way to get them back after dropouts like these? I have tried physically pulling it out and inserting it again, doesn´t work.

/Sebulon
 
I believe my issue is solved (the cause, but not the FreeBSD/camcontrol/expanders/etc bad handling of the issue). The 'avoid using swap' or 'disable set check setuid' etc. workarounds are not needed.

I am quite sure that my issue was only the firmware on the SSD (my root disk). Upgrading one SSD to version 0009 and testing with my "hot pull test" always passes with the upgraded one, and nearly always fails on a non-upgraded one.

The "hot pull test":
Code:
[CMD="#"]dd if=/dev/random of=/somewhere/on/the/disk bs=128k[/CMD]
pull disk
wait 1 second
put disk back in
wait 1 second
pull disk
wait 1 second
put disk back in
wait 1 second
hit ctrl+c on the dd command
wait for messages to stop on tty1 / syslog.
wait for a message saying that the disk was reattached (probably 15+ seconds)
[CMD="#"]gpart show[/CMD]
[CMD="#"]zpool status[/CMD]
[CMD="#"]zpool online <pool> <disk>[/CMD]
[CMD="#"]zpool status[/CMD]

If gpart show does not seg fault, and zpool online causes the disk to resilver, then it is all good.

40% of the time, the bad SSD passes the test if only pulled once, and so far 0% if pulled twice, and one time out of all tests, the red lights blink on all disks on the controller when the bad disk is pulled. In all "pass" situations, the system runs ONLINE. In all "fail" situations, it runs DEGRADED. No panics or pool faults were caused.

Without dd it would probably work too, but sometimes if I do that (or maybe only if I don't wait 1 second in between), the disk just runs "ONLINE" and is never actually lost.

So pulling to fix the disk is not going to fix it... it causes it. And if I put the disk on another computer, it works fine. If I put it back in the same system with the same HBA (only have one... can't test another), then it never comes back without a reboot. It comes back if I reboot with the disk in, or if I reboot with the disk out, and then later plug it in. So I don't believe that resetting the server resets the disk in some special way... I think the hardware just remembers the ID of the disk and won't do the normal restart until it forgets the ID.

So try my hot pull test. If that works (and by works I means causes a failure), then you have a quick way to test other solutions, such as upgrading firmware, or using different disks. (too bad the disk market is messed up) BTW I recently found another thread with Samsung Spinpoint disks with timeouts (without mps). So I would first suspect the firmware.

http://forums.freebsd.org/showthread.php?p=161841#post161841

And have you tried mpslsi? It handles a specific error that mps does not. see my link above about Jason Wolfe (http://osdir.com/ml/freebsd-scsi/2011-11/msg00006.html)

I am interested in how your situation develops... so let me know how that works.
 
And there it went again:
Code:
Jan 26 00:08:17 fs2-7 kernel: mpslsi0: mpssas_scsiio_timeout checking sc 0xffffff80003a5000 cm 0xffffff80003e8498
Jan 26 00:08:17 fs2-7 kernel: (da5:mpslsi0:0:6:0): WRITE(10). CDB: 2a 0 1 cc f1 f1 0 0 8 0 length 4096 SMID 603 command timeout cm 0xffffff80003e8498 ccb 0xffffff00076db800
Jan 26 00:08:17 fs2-7 kernel: mpslsi0: mpssas_alloc_tm freezing simq
Jan 26 00:08:17 fs2-7 kernel: mpslsi0: timedout cm 0xffffff80003e8498 allocated tm 0xffffff80003b8148
Jan 26 00:08:21 fs2-7 kernel: (da5:mpslsi0:0:6:0): WRITE(10). CDB: 2a 0 1 cc f1 f1 0 0 8 0 length 4096 SMID 603 completed timedout cm 0xffffff80003e8498 ccb 0xffffff00076db800 during recovery ioc 8048 scsi 0 state c xfer (noperiph:mpslsi0:0:6:0): SMID 1 abort TaskMID 603 status 0x4a code 0x0 count 1
Jan 26 00:08:21 fs2-7 kernel: (noperiph:mpslsi0:0:6:0): SMID 1 finished recovery after aborting TaskMID 603
Jan 26 00:08:21 fs2-7 kernel: mpslsi0: mpssas_free_tm releasing simq
Jan 26 00:08:58 fs2-7 kernel: (da5:mpslsi0:0:6:0): WRITE(10). CDB: 2a 0 1 cc f1 f1 0 0 8 0 length 4096 SMID 1000 terminated ioc 804b scsi 0 state c xfer 0
Jan 26 00:08:59 fs2-7 kernel: mpslsi0: mpssas_alloc_tm freezing simq
Jan 26 00:08:59 fs2-7 kernel: mpslsi0: mpssas_lost_target targetid 6
Jan 26 00:08:59 fs2-7 kernel: (da5:mpslsi0:0:6:0): lost device
Jan 26 00:08:59 fs2-7 kernel: mpslsi0: mpssas_remove_complete on handle 0x000e, IOCStatus= 0x0
Jan 26 00:08:59 fs2-7 kernel: mpslsi0: mpssas_free_tm releasing simq
Jan 26 00:09:04 fs2-7 kernel: (da5:mpslsi0:0:6:0): Synchronize cache failed, status == 0x39, scsi status == 0x0
Jan 26 00:09:04 fs2-7 kernel: (da5:mpslsi0:0:6:0): removing device entry

/Sebulon
 
Are you using the latest firmware on that device? What device is it exactly?

I can't know for sure if the real world test of time will make my SSD crash again, but the latest firmware fixes the losing of the device from a hot pull. Yesterday, I put an SSD with updated firmware back in the machine as the ZIL and cache (but not root disk this time). So I will eventually find out if my hot pull test has anything to do with the real world test.

BTW. I also tried a 2TB Seagate Green today (ST2000DL003), which fails the hot pull test.

At some point I will try a 2TB 5k RPM Deskstar Hitachi (HDS5C3020ALA632; my favorite cheap consumer disk... fastest and possibly the most reliable). I already tried a 3TB version of that Hitachi Deskstar, which works fine. (In theory the 5k ones are more reliable than the 7k) [UPDATE: It has firmware ML6OA580 and passes the test.]

I have some 3 TB WD Greens too... maybe I will try one of those. [UPDATE: It is a WDC WD30EZRX-00MMMB0 and has firmware 80.00A80 and passes the test.]

I also have an old 1TB WD Green with bad sectors that runs ultra-slow... might try that too (probably won't try it).


And just to repeat my opinion, my idea of the situation is: the device does something wrong; FreeBSD handles it badly; then when the device is added again, and should work fine, FreeBSD is not able to use it until a reboot, due to a bug in FreeBSD or an inability to work around some problem in the rest of the IO system. (I conclude this based on my assumption that the same test would pass on all disks on Linux, due to my extensive experience with Linux, without actually performing the exact same test).

So maybe we could ask for help from someone that can troubleshoot the drivers to see what can be done to get the disk back. Where should we start?
 
Sebulon said:
Have you found some way to get them back after dropouts like these? I have tried physically pulling it out and inserting it again, doesn´t work.

Yesterday, someone on the freebsd-stable mailing list said that # camcontrol rescan gets his disk back when it times out.

Code:
Jan  7 10:04:24 zfs kernel: ahcich3: Timeout on slot 27 port 0
Jan  7 10:04:24 zfs kernel: ahcich3: is 00000000 cs 20000000 ss 38000000 rs 38000000 tfd c0 serr 00000000 cmd 0004dd17
Jan  7 10:04:56 zfs kernel: ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)


It was a real dead device, the only way to get it back: powercycle the device by pulling it, and stick it back then camcontrol rescan
 
To add to the list:

Device Model: TOSHIBA MK2002TSKB
Firmware Version: MT2A

Fails the hot pull test. This disk passed when writing from /dev/zero, but failed writing from a real file. And I tested the SSD again with a real file, which passed. I also tested the Deskstar and WDC Green again just in case, which passed again.

The disk light was not solid on, and gstat showed low load writing from /dev/zero, which is why I decided to use a real file. Maybe this changed since I upgraded to the latest 8-STABLE on Feb. 4th.

And then I tried # camcontrol rescan all which hung. And then I tried # camcontrol rescan 0:60:0 for the specific disk, which said it succeeded, but did not let me use the disk again. Then I tried # for num in {1..60}; do camcontrol rescan 0:${num}:0; done which hung after saying 0:17:0 was successful. So I rebooted.
 
To add to the list:

Device Model: ST3000DM001-9YN166
Firmware Verision: CC49

Passes (unlike the previous version of the disk with 2TB and firmware CC45)

And with that disk in, the hard disk light seems to be reversed on systems with LSI 9211-8i and SAS expanders, but not on a USB dock or my desktop's onboard controller.

light on = idle,
light off = disk activity

Hah!


And I might note that starting a partition at sector 64 (which suits 4096 physical sectors) makes it VERY slow to resilver (maybe 2 hours) unlike the WDC spinning disk I tested, also with 512 logical and 4096 physical sectors. Starting at logical sector 129024 (which is aligned to 512B sectors and >=63 [most modern disks?], 4096B sectors [for these new advanced format ones], 63 sectors [for stone age disks?], and 2MiB [for SSDs]) went faster (20 minute resilver).

Code:
64*512 = 32768
32768%4096 = 0

Code:
root@somelinuxmachine # parted -s /dev/sdb unit s print
Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sdb: 5860533168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
 
thethirdnut said:
So if I read your posts here correctly the only way you got your SSD's to operate correctly is by removing them from the LSI card?

Did you find any other better solution to a more stable mps driver? I am running 9.0-RELEASE atm.

No, upgrading the firmware on them makes them work fine. The SSDs have now been running for half a year without any problems. I don't know if this works for all devices, as obviously firmware is different on different models of disks/SSDs.

I don't believe mps is the only thing to blame. My hypothesis is that the firmware sucks and fails, and then mps or whatever other FreeBSD bits just don't handle it gracefully, so even if you remove the device, you can't get it to show up again until you reboot.
 
Thanks Peetaur.

I have my SSD - Crucial M4 64GB (fw 000F) - running off the SATA3 port on the mobo so use-case here is different. I am going to try manually loading the new mpslsi module into 9.0-RELEASE and will let you know if it makes any difference for me as well.

In my case there is one IBM M1015 port that has 4 x 2TB Seagate SATA3 HDD's on it. There was all sorts of wonkyness going on during a 1.4TB transfer. Reboot fixed everything.
 
Back
Top