[updated 05-Jan-13] unstable mps driver on 9.0-RELEASE - how to proceed?

Hi All,

I have been experiencing a flaky mps driver on 9.0-RELEASE-amd64. This occurred during a 1.4TB rsync transfer of media to the new FreeBSD server sporting a 10 x 2TB raidz2 over NFSv4 to HDD's running on 3 x IBM M1015's - LSI SAS2008 chips with IT fw v14.

The 3 of 4 SATA3 HDD's affected were all on the same mps0 device. da2 was removed from the zpool and da0, da3 experienced errors, but didn't drop out.

A reboot of the system brought everything back and da2 resilvered back into the zpool.

---------------

Questions I have are related to mention of a new LSI-supported mps driver and as I understand it NOT in the 9.0-RELEASE-amd64...

Q1) Does adding mps_load="YES" to /boot/loader.conf in 9.0-RELEASE have any benefit at all with respect to using a different driver for the LSI SAS2008 cards?

Q2) Would upgrading to 9.0-STABLE be of great benefit to me since I also hear that this new mps driver is used in STABLE or should I wait for 9.1-RELEASE instead?

Q3) Alternative approach of dloading the mpslsi.ko and loading it?


Anyways, lots of questions and haven't found any definitive answers either way. Feedback would be much appreciated. Specs and dmesg to follow:

Code:
[cpu]		Intel Xeon E3-1220-V2
[mobo]		Supermicro X9SCM-F (bios 2.0a)
[ram]		(4x) Crucial CT51272BA1339 [4GB DDR3 Unbuffered ECC]
[ssd]		Crucial M4 64GB (fw 000F)
[sas card]	(3x) IBM M1015 (IT mode v14)
.....	
[hdd]		(4x) 2TB Seagate ST2000DL003
		(3x) 2TB WD WD20EARS
		(1x) 2TB WD WD20EARX
		(1x) 2TB Hitachi HDS5C3020ALA632 
                (1x) 2TB Samsung/Seagate ST2000DL004
[os]		FreeBSD 9.0-RELEASE amd64
[NFS]		v4
[ZFS]		v28, dedupe, compression OFF

Code:
Dec  4 21:25:11 e1220 kernel: mps0: <LSI SAS2008> port 0xe000-0xe0ff mem 0xf7a00000-0xf7a03fff,0xf7980000-0xf79bffff irq 16 at device 0.0 on pci1
Dec  4 21:25:11 e1220 kernel: mps0: Firmware: 14.00.00.00
Dec  4 21:25:11 e1220 kernel: mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Dec  4 21:25:11 e1220 kernel: pcib2: <ACPI PCI-PCI bridge> irq 16 at device 1.1 on pci0
Dec  4 21:25:11 e1220 kernel: pci2: <ACPI PCI bus> on pcib2
Dec  4 21:25:11 e1220 kernel: mps1: <LSI SAS2008> port 0xd000-0xd0ff mem 0xf7400000-0xf7403fff,0xf7380000-0xf73bffff irq 17 at device 0.0 on pci2
Dec  4 21:25:11 e1220 kernel: mps1: Firmware: 14.00.00.00
Dec  4 21:25:11 e1220 kernel: mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
Dec  4 21:25:11 e1220 kernel: pcib3: <ACPI PCI-PCI bridge> irq 19 at device 6.0 on pci0
Dec  4 21:25:11 e1220 kernel: pci3: <ACPI PCI bus> on pcib3
Dec  4 21:25:11 e1220 kernel: mps2: <LSI SAS2008> port 0xc000-0xc0ff mem 0xf6e00000-0xf6e03fff,0xf6d80000-0xf6dbffff irq 19 at device 0.0 on pci3
Dec  4 21:25:11 e1220 kernel: mps2: Firmware: 14.00.00.00
Dec  4 21:25:11 e1220 kernel: mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
.....
.....

Dec  6 05:39:53 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 05:39:53 e1220 last message repeated 2 times
Dec  6 05:39:53 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 65536
Dec  6 05:39:53 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 05:39:53 e1220 last message repeated 15 times
Dec  6 05:47:31 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 05:47:31 e1220 last message repeated 18 times
Dec  6 07:23:54 e1220 su: leeandang to root on /dev/pts/0
Dec  6 18:02:46 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 18:02:46 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 18:02:46 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 16384
Dec  6 18:02:46 e1220 kernel: mps0: (0:2:0) terminated ioc 804b scsi 0 state c xfer 0
Dec  6 18:02:46 e1220 last message repeated 3 times
Dec  6 18:03:42 e1220 kernel: mps0: mpssas_remove_complete on target 0x0002, IOCStatus= 0x0
Dec  6 18:03:42 e1220 kernel: (da2:mps0:0:2:0): lost device - 0 outstanding
Dec  6 18:03:44 e1220 kernel: (da2:mps0:0:2:0): removing device entry
Dec  6 18:03:47 e1220 kernel: da2 at mps0 bus 0 scbus0 target 2 lun 0
Dec  6 18:03:47 e1220 kernel: da2: <ATA ST2000DL003-9VT1 CC32> Fixed Direct Access SCSI-6 device 
Dec  6 18:03:47 e1220 kernel: da2: 600.000MB/s transfers
Dec  6 18:03:47 e1220 kernel: da2: Command Queueing enabled
Dec  6 18:03:47 e1220 kernel: da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Dec  6 18:03:54 e1220 kernel: mps0: Failure 0x4b reseting device 0x000a
Dec  6 18:06:31 e1220 kernel: (da3:mps0:0:3:0): READ(10). CDB: 28 0 11 16 5e a8 0 0 20 0 
Dec  6 18:06:31 e1220 kernel: (da3:mps0:0:3:0): CAM status: SCSI Status Error
Dec  6 18:06:31 e1220 kernel: (da3:mps0:0:3:0): SCSI status: Check Condition
Dec  6 18:06:31 e1220 kernel: (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Dec  6 18:20:52 e1220 kernel: (da2:mps0:0:2:0): SCSI command timeout on device handle 0x000a SMID 899
Dec  6 18:20:52 e1220 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 899 complete
Dec  6 18:21:29 e1220 login: ROOT LOGIN (root) ON ttyv0
Dec  6 18:21:52 e1220 kernel: (da2:mps0:0:2:0): SCSI command timeout on device handle 0x000a SMID 368
Dec  6 18:21:52 e1220 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 368 complete
Dec  6 18:21:54 e1220 login: ROOT LOGIN (root) ON ttyv1
Dec  6 18:22:52 e1220 kernel: (da2:mps0:0:2:0): SCSI command timeout on device handle 0x000a SMID 784
Dec  6 18:22:52 e1220 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 784 complete
Dec  6 18:23:52 e1220 kernel: (da2:mps0:0:2:0): SCSI command timeout on device handle 0x000a SMID 537
Dec  6 18:23:52 e1220 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 537 complete
...
Dec  6 18:42:35 e1220 kernel: mps0: mpssas_abort_complete: abort request on handle 0x0a SMID 427 complete
Dec  6 20:56:07 e1220 kernel: (da0:mps0:0:0:0): READ(10). CDB: 28 0 16 7a 1f 20 0 0 20 0 
Dec  6 20:56:07 e1220 kernel: (da0:mps0:0:0:0): CAM status: SCSI Status Error
Dec  6 20:56:07 e1220 kernel: (da0:mps0:0:0:0): SCSI status: Check Condition
Dec  6 20:56:07 e1220 kernel: (da0:mps0:0:0:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
....
Dec  6 21:21:40 e1220 kernel: (da3:mps0:0:3:0): READ(10). CDB: 28 0 17 47 e7 68 0 0 20 0 
Dec  6 21:21:40 e1220 kernel: (da3:mps0:0:3:0): CAM status: SCSI Status Error
Dec  6 21:21:40 e1220 kernel: (da3:mps0:0:3:0): SCSI status: Check Condition
Dec  6 21:21:40 e1220 kernel: (da3:mps0:0:3:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
 
Although I did not have problems using the old driver with newer firmware, I'm still using the newer driver from the LSI web site in 9.0-RELEASE:
http://www.lsi.com/products/storagecomponents/Pages/LSISAS9211-8i.aspx

There is mention that the old driver has drop out problems and worse error handling than the new one, so I would install this one.

When installing the driver in /boot/modules/mpslsi.ko (imo third party modules should not reside in /boot/kernel as the LSI instructions want you to do) and putting
Code:
mpslsi_load="YES"
in /boot/loader.conf, the built-in driver will be overridden. I'm actually using a Supermicro X8SI6-F board, kernel messages:

Code:
Nov  6 23:58:12 server kernel: mpslsi0: <LSI SAS2008> port 0xc000-0xc0ff mem 0xfb43c000-0xfb43ffff,0xfb440000-0xfb47ffff irq 16 at device 0.0 on pci2
Nov  6 23:58:12 server kernel: mpslsi0: Firmware: 14.00.01.00, Driver: 14.00.00.00
Nov  6 23:58:12 server kernel: mpslsi0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
 
thethirdnut said:
Q2) Would upgrading to 9.0-STABLE be of great benefit to me since I also hear that this new mps driver is used in STABLE or should I wait for 9.1-RELEASE instead?

Although it is not officially announced, an SVN update will get you to 9.1-RELEASE which includes the proper driver for your card.
 
@Sfynx, @gkontos

Thank you both for your reply.

I will try loading the LSI module manually for now - and potentially move up to 9.1-RELEASE once official. I have everything ready just need to test and will report back.
 
This is still in the TBD phase...have loaded the v15 mpslsi driver with both v14 and then v15 of the IT fw on the SAS2008 cards.

Still having issues, HOWEVER, I now have a CONFIRMED dead HDD after latest mpslsi freakout...it brutally fails both short and long smartctl tests - to be RMA'd.

Hopeful question on my part: could this bad HDD be the cause of the mps / mpslsi freaks out? It was on same port, backplane, HBA port where these other issues occurred - drives would drop out and then come back after a reboot.

Meaning that when stressed this HDD was on the same HBA port along with other drives that were experiencing issues. Short tests on 9 x other HDD's have thus far all come up clean...I'll do long tests on all overnight as well and then stress-test them further.

What do you folks think about possibility of this one bad HDD causing the mps driver instability? Possible, probable or merely wishful thinking and I'll likely still have issues?

TIA
 
This is now SOLVED - thanks all for the assistance.

Turns out root cause of the issue was 2 of 10 dud HDD's. The dud drives each had approx 300 hours before problem popped up. One was 2TB Seagate ST2000DL003 and other was 2TB WD WD20EARX. They both had 'Current_Pending_Sector' and 'Offline_Uncorrectable' incremented values, Failed smartctl long tests brutally, and ZFS just generally gave up on them after accruing mps errors.

Replaced them both with 2 x 2 TB WD20EFRX (Red). I also upgraded the M1015's and mpslsi drivers to v15, upgraded to 9.1-RELEASE.

After a 4.4TB xfer + reading back md5sums = 9 TB effective + scrub of 7 TB of data it didn't skip a beat...worked very well.

Now I just need to knock on wood that no other drives fail on me soon. BTW - all of these drives had been burnt in to some degree before putting into use...test I added now is to also do an extra smartctl longtest before insertion into machines as well.
 
Back
Top