Problems with mpt

Matty · Sep 16, 2010

Today after booting my nas I got this errors with the mpt driver.
The machine wouldn't boot with the drives attached and this errors I got after the

Code:

camcontrol rescan all

command.

I hooked up a spare drive which runs fine on a sata port so it's the cable or the controller. Or do I miss something?

Code:

mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(probe3:mpt0:0:3:0): SCSI status error
(probe3:mpt0:0:3:0): MODE SENSE(6). CDB: 1a 0 a 0 14 0 
(probe3:mpt0:0:3:0): CAM status: SCSI Status Error
(probe3:mpt0:0:3:0): SCSI status: Check Condition
(probe3:mpt0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(probe3:mpt0:0:3:0): Retrying command (per sense data)
GEOM: new disk da0pass0 at mpt0 bus 0 scbus0 target 3 lun 0

pass0: <ATA SAMSUNG HD502HJ 00E4> Fixed Direct Access SCSI-5 device 
pass0: Serial Number S20BJ90SC92988      
pass0: 300.000MB/s transfers
pass0: Command Queueing enabled
da0 at mpt0 bus 0 scbus0 target 3 lun 0
da0: <ATA SAMSUNG HD502HJ 00E4> Fixed Direct Access SCSI-5 device 
da0: Serial Number S20BJ90SC92988      
da0: 300.000MB/s transfers
da0: Command Queueing enabled
da0: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C)
mpt0: request 0xffffff80002daf80:313 timed out for ccb 0xffffff0002554000 (req->ccb 0xffffff0002554000)
mpt0: attempting to abort req 0xffffff80002daf80:313 function 0
mpt0: mpt_send_handshake_cmd: db ignored
mpt0: soft reset failed: device not running
mpt0: WARNING - Failed hard reset! Trying to initialize anyway.
mpt0: mpt_cam_event: 0x0
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80002daf80:313
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(probe0:mpt0:0:3:4): Bus Reset issued
(probe0:mpt0:0:3:4): Retrying command
(da0:mpt0:0:3:0): Bus Reset issued
(da0:mpt0:0:3:0): Retrying command
(da0:mpt0:0:3:0): SCSI status error
(da0:mpt0:0:3:0): READ(10). CDB: 28 0 3a 38 60 2f 0 0 1 0 
(da0:mpt0:0:3:0): CAM status: SCSI Status Error
(da0:mpt0:0:3:0): SCSI status: Check Condition
(da0:mpt0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpt0:0:3:0): Retrying command (per sense data)
mpt0: request 0xffffff80002db2e0:320 timed out for ccb 0xffffff0002554000 (req->ccb 0xffffff0002554000)
mpt0: request 0xffffff80002db370:321 timed out for ccb 0xffffff0002592800 (req->ccb 0xffffff0002592800)
mpt0: attempting to abort req 0xffffff80002db2e0:320 function 0
mpt0: mpt_send_handshake_cmd: db ignored
mpt0: soft reset failed: device not running
mpt0: WARNING - Failed hard reset! Trying to initialize anyway.
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80002db2e0:320
mpt0: completing timedout/aborted req 0xffffff80002db370:321
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(da0:mpt0:0:3:0): Bus Reset issued
(da0:mpt0:0:3:0): Retrying command
(probe0:mpt0:0:3:5): Bus Reset issued
(probe0:mpt0:0:3:5): Retrying command
(da0:mpt0:0:3:0): SCSI status error
(da0:mpt0:0:3:0): READ(10). CDB: 28 0 3a 38 60 2f 0 0 1 0 
(da0:mpt0:0:3:0): CAM status: SCSI Status Error
(da0:mpt0:0:3:0): SCSI status: Check Condition
(da0:mpt0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpt0:0:3:0): Retrying command (per sense data)
mpt0: request 0xffffff80002db640:327 timed out for ccb 0xffffff0002554000 (req->ccb 0xffffff0002554000)
mpt0: attempting to abort req 0xffffff80002db640:327 function 0
mpt0: mpt_send_handshake_cmd: db ignored
mpt0: soft reset failed: device not running
mpt0: WARNING - Failed hard reset! Trying to initialize anyway.
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80002db640:327
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x12
mpt0: Unhandled Event Notify Frame. Event 0x12 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(probe0:mpt0:0:3:6): Bus Reset issued
(probe0:mpt0:0:3:6): Retrying command
(da0:mpt0:0:3:0): Bus Reset issued
(da0:mpt0:0:3:0): Retrying command
(da0:mpt0:0:3:0): SCSI status error
(da0:mpt0:0:3:0): READ(10). CDB: 28 0 3a 38 60 2f 0 0 1 0 
(da0:mpt0:0:3:0): CAM status: SCSI Status Error
(da0:mpt0:0:3:0): SCSI status: Check Condition
(da0:mpt0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da0:mpt0:0:3:0): Retrying command (per sense data)
mpt0: request 0xffffff80002db9a0:334 timed out for ccb 0xffffff0002554000 (req->ccb 0xffffff0002554000)
mpt0: request 0xffffff80002dba30:335 timed out for ccb 0xffffff0002592800 (req->ccb 0xffffff0002592800)
mpt0: attempting to abort req 0xffffff80002db9a0:334 function 0
mpt0: mpt_send_handshake_cmd: db ignored
mpt0: soft reset failed: device not running
mpt0: WARNING - Failed hard reset! Trying to initialize anyway.
mpt0: mpt_cam_event: 0x0
mpt0: completing timedout/aborted req 0xffffff80002db9a0:334
mpt0: completing timedout/aborted req 0xffffff80002dba30:335
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
mpt0: mpt_cam_event: 0x16
mpt0: Unhandled Event Notify Frame. Event 0x16 (ACK not required).
(da0:mpt0:0:3:0): Bus Reset issued
(da0:mpt0:0:3:0): Retrying command
(probe0:mpt0:0:3:7): Bus Reset issued
(probe0:mpt0:0:3:7): Retrying command
(pass0:mpt0:0:3:0): lost device
(pass0:mpt0:0:3:0): removing device entry
(da0:mpt0:0:3:0): lost device
(da0:mpt0:0:3:0): Selection timeout
(da0:mpt0:0:3:0): Retrying command
(da0:mpt0:0:3:0): Error 6, Unretryable error
(da0:mpt0:0:3:0): Invalidating pack
(da0:mpt0:0:3:0): Synchronize cache failed, status == 0xa, scsi status == 0x0
(da0:mpt0:0:3:0): removing device entry

Matty · Sep 16, 2010

Well I tried the controller in another computer. Didn't get recognized by bios. I put it back in the one it came from. Same problem. Bios/FB doesn't recognize it any more.

Kinda sucks.. Will try to RMA it.

quillo · Sep 21, 2010

Hi Matty,

I've been having a lot of the same problems and can't seem to find a definitive reason for the issue.

Can you tell me more about your setup? I'm using an LSI 3081E-R controller connected to a HP SAS expander. I have 2x WD 160GB, 13x Samsung 2TB and 5x WD 640GB drives connected, all are SATA.

It seems that the problem also exists for other kernels as there was discussion about a similar issue over at the Nexenta forum. I really want to try fix this because it is wreaking havoc on my ZFS array, in fact it may have just caused both of my arrays to die during a scrub *AS I WAS WRITING THIS POST*.

Code:

  pool: online0
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub completed after 3h5m with 0 errors on Tue Sep 21 14:33:37 2010
config:

        NAME        STATE     READ WRITE CKSUM
        online0     UNAVAIL      1    38     2  insufficient replicas
          mirror    ONLINE       2    76     5
            da15    ONLINE       9 1.21K     0
            da16    ONLINE       4    80     5
          mirror    ONLINE       0     0     0
            da17    ONLINE       3   592     0
            da18    ONLINE       0     0     0
        spares
          da19      AVAIL   

errors: 5 data errors, use '-v' for a list

  pool: vault0
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: scrub in progress for 5h11m, 57.79% done, 3h47m to go
config:

        NAME        STATE     READ WRITE CKSUM
        vault0      DEGRADED     3     0     0
          raidz2    DEGRADED     7     3   116
            da2     ONLINE       6     7    15  130K repaired
            da3     FAULTED     16   298    65  corrupted data
            da4     ONLINE       7     2     0  1K repaired
            da5     ONLINE       7     2     0  1K repaired
            da6     ONLINE       5     2     2  2K repaired
            da7     ONLINE      10     3     2  2K repaired
            da8     ONLINE       7     2     0  512 repaired
            da9     ONLINE     129    59     1
            da10    ONLINE       7     4    90  1.02M repaired
            da11    ONLINE       6     2     0  1.50K repaired
            da12    ONLINE       7     3     0  1K repaired
            da13    ONLINE       7     3     0  1.50K repaired
        spares
          da14      AVAIL

Code:

Sep 21 16:17:34 xxxxxx kernel: mpt0: request 0xffffff80005c36c0:26594 timed out for ccb 0xffffff00072db000 (req->ccb 0xffffff00072db000)
Sep 21 16:17:34 xxxxxx kernel: mpt0: attempting to abort req 0xffffff80005c36c0:26594 function 0
Sep 21 16:17:35 xxxxxx kernel: mpt0: mpt_wait_req(1) timed out
Sep 21 16:17:35 xxxxxx kernel: mpt0: mpt_recover_commands: abort timed-out. Resetting controller
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x0
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x0
Sep 21 16:18:59 xxxxxx kernel: mpt0: completing timedout/aborted req 0xffffff80005c36c0:26594
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x16
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x12
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x12
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x1b
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x12
Sep 21 16:18:59 xxxxxx last message repeated 54 times
Sep 21 16:18:59 xxxxxx kernel: mpt0: mpt_cam_event: 0x16
Sep 21 16:18:59 xxxxxx kernel: (da3:mpt0:0:44:0): Synchronize cache failed, status == 0x4e, scsi status == 0x0
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): WRITE(10). CDB: 2a 0 3 7a 54 5f 0 0 8 0 
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): CAM Status: SCSI Status Error
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): SCSI Status: Check Condition
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): UNIT ATTENTION asc:29,0
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): Power on, reset, or bus device reset occurred
Sep 21 16:18:59 xxxxxx kernel: (da0:mpt0:0:41:0): Retrying Command (per Sense Data)
---snip---

Matty · Sep 22, 2010

My setup is way smaller:

Supermicro usas-l8i sas hba.
It's basically a LSI 1068 chip. Not sure if it's the same you got but it does look very similar.

connected are 4 Samsung F3 1TB sata drives via adaptec SAS->4SATA cable.

I think I overheated the card due to bad air flow because after removing the card to test it in another computer it would not initialize at all (so no LSI bios). Even when I connect a drive to the card 1 of the 8 amber LEDs should change and it doesn't even do that any more.

I sent the card for RMA. Will post updates.

Did you try to change some cables or maybe the combo with the expander isn't really working. when did the problems start?

quillo · Sep 22, 2010

Hey Matty,

Always have had the issue... Good news is that I managed to recover my array, I ended up having to do a hard reboot because the system just got stuck in a loop of SCSI bus resets (the system disks are running off this controller too until I get a bigger chassis).

I found a possible workaround reading this OSol bug report:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6894775

Based off this, I made the following changes in /boot/loader.conf:

Code:

# Limit queue size for ZFS
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1"

# Disable MSI as a potential workaround for MPT being a colossal jerk
hw.pci.enable_msix="0"
hw.pci.enable_msi="0"

I rebooted the machine and have found this has the unexpected effect of dramatically increasing multi-threaded IOPS, I'm guessing because the queue size of 35 was just too massive. Unsure at this stage if it has fixed my issue, I'm too scared to run another scrub incase it breaks everything again

My HBA and expander both get very hot, especially so because I replaced all the high speed delta fans in my chassis with quieter alternatives. I also bought a PCI slot cooler which I haven't installed yet as I don't think that heat is the main issue.

Matty · Sep 22, 2010

Found this and it may be worth a shot:

Code:

Eventually, I became desperate and flashed the IR (Integrated Raid) firmware over the top of the IT
firmware.  Since then, I have had no errors in dmesg of any kind.

I even removed the workarounds from /etc/system and still have had no issues.  The mpt driver is
exceptionally quiet now.

I'm interested to know if anyone who has a 1068E based card is having these problems using the IR firmware, or
if they all seem to be IT (initiator target) related.

http://comments.gmane.org/gmane.os.solaris.opensolaris.zfs/31982

Matty · Sep 22, 2010

quillo said:

will try this when I get the card back. Card is broken (confirmed by dealer) and sent back to supermicro.

Had the pending settings at 2 and 10 for quiet some time but then again I hadn't any issues with the pool for weeks.

Can't you disable msi for MPT only?

quillo · Sep 23, 2010

Matty said:
Can't you disable msi for MPT only?

Not sure, I couldn't find anything specific to MPT and I'm not really sure what to look for, e.g.:

Code:

[xxxxx@xxxxx ~]$ sysctl -a |grep mpt
kern.sched.preemption: 1
kern.sched.preempt_thresh: 64
dev.mpt.0.%desc: LSILogic SAS/SATA Adapter
dev.mpt.0.%driver: mpt
dev.mpt.0.%location: slot=0 function=0 handle=\_SB_.PCI0.MRP9.S7F0
dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0058 subvendor=0x1000 subdevice=0x3140 class=0x010000
dev.mpt.0.%parent: pci4
dev.mpt.0.debug: 3
dev.mpt.0.role: 1
[xxxxx@xxxxx ~]$ sysctl -a |grep msi
hw.bce.msi_enable: 1
hw.pci.honor_msi_blacklist: 1
hw.pci.enable_msix: 0
hw.pci.enable_msi: 0

Regarding the IR vs IT firmware... My understanding was that IR firmware has a maximum drive limit of 16 or something along those lines which is slightly less that my requirements