ZFS Seagate ST10000NM0016 (10TB Enterprise) with LSI 9207-8i

Hi,

I have an issue with Seagate ST10000NM0016 drives sporadically refusing to work. 8 of them are attached to a 9207-8i controller and assembled in a RAIDZ2. At unpredictable intervals, drives throw errors like

Code:
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 365 command timeout cm 0xfffffe0001006f10 ccb 0xfffff801668ce800
        (noperiph:mps2:0:4294967295:0): SMID 2 Aborting command 0xfffffe0001006f10
mps2: Sending reset from mpssas_send_abort for target ID 45
        (da6:mps2:0:45:0): WRITE(16). CDB: 8a 00 00 00 00 01 8c 17 f6 68 00 00 00 08 00 00 length 4096 SMID 783 terminated ioc 804b scsi 0 state c xfer 0
mps2: Unfreezing devq for target ID 45
(da6:mps2:0:45:0): WRITE(16). CDB: 8a 00 00 00 00 01 8c 17 f6 68 00 00 00 08 00 00
(da6:mps2:0:45:0): CAM status: CCB request completed with an error
(da6:mps2:0:45:0): Retrying command
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps2:0:45:0): CAM status: Command timeout
(da6:mps2:0:45:0): Retrying command
(da6:mps2:0:45:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da6:mps2:0:45:0): CAM status: SCSI Status Error
(da6:mps2:0:45:0): SCSI status: Check Condition
(da6:mps2:0:45:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da6:mps2:0:45:0): Error 6, Retries exhausted
(da6:mps2:0:45:0): Invalidating pack

SCSI opcodes that I observed to fail are WRITE(16), READ(16), and SYNCHRONIZE CACHE(10). The problem occurs with every drive I have, and does not seem to correlate with load: a 3Tb resilvering typically completes without issues, while a light sequential read may trigger an error.

The same setup worked for several years with WD disks with nary a hiccup.

Can someone recommend next steps for me to try for debugging?

Further details about my system:
Code:
# uname -a
FreeBSD ... 11.0-STABLE FreeBSD 11.0-STABLE #0 r321665+c0805687fec(freenas/11.0-stable): Tue Sep  5 16:07:24 UTC 2017     root@gauntlet:/freenas-11-releng/freenas/_BE/objs/freenas-11-releng/freenas/_BE/os/sys/FreeNAS.amd64  amd64

Code:
# camcontrol identify da3
pass3: <ST10000NM0016-1TT101 SNB0> ACS-3 ATA SATA 3.x device
pass3: 600.000MB/s transfers, Command Queueing Enabled

protocol              ATA/ATAPI-10 SATA 3.x
device model          ST10000NM0016-1TT101
firmware revision     SNB0
serial number         ...
WWN                   ...
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       19532873728 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             7200

Code:
 #lspci -vvv
....
05:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
        Subsystem: LSI Logic / Symbios Logic 9207-8i SAS2.1 HBA
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 16
        Region 0: I/O ports at b000
        Region 1: Memory at fb2b0000 (64-bit, non-prefetchable)
        Region 3: Memory at fb2c0000 (64-bit, non-prefetchable)
        Expansion ROM at fb300000 [disabled]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [d0] Vital Product Data
                Not readable
        Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=1 offset=0000e000
                PBA: BAR=1 offset=0000f000
 
The same setup worked for several years with WD disks with nary a hiccup.
You might be hitting some power issues. It's possible the Seagates draw more power than the WDs and it's pushing your power-supply past its limits.
 
Timeout usually means that communication somewhere in the stack (between host OS and drive) was disrupted. Power is one possible cause, as SirDice said. Another common cause is back-level firmware (in SAS HBAs, expanders, drives). Occasionally faulty SAS cables cause problems like this (although that's much more rare than with SATA cables). I would also check whether any firmware can be upgraded (don't forget that if you have any SAS backplanes, they may have expander chips with firmware), and reseat all data cables.
 
So there is indeed newer firmware for the LSI 9207-8i available, and the system is warning me about mps driver mismatching the firmware:

Code:
mps0: Firmware: 17.00.01.00, Driver: 21.01.00.00-fbsd

The latest firmware available from broadcom is P20, from April 2016. However, there seem to be a lot of people unhappy with P20, some even deliberately downgrading to P16.

Before I upgrade - does anyone here have any experiences with running LSI 9207-8i with P20 firmware?
 
  • Thanks
Reactions: Oko
Code:
mps0: Firmware: 20.00.02.00, Driver: 21.02.00.00-fbsd
Code:
root@molly:~ # mpsutil show adapter
mps0 Adapter:
       Board Name: SAS9207-8i
   Board Assembly: H3-25412-00K
        Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.00.00
Firmware Revision: 20.00.02.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator
Seems to be running fine since I got it. But this is on my home server so I'm less concerned about performance or other issues.
 
So there is indeed newer firmware for the LSI 9207-8i available, and the system is warning me about mps driver mismatching the firmware:

Code:
mps0: Firmware: 17.00.01.00, Driver: 21.01.00.00-fbsd
That is informational, not a warning. You'll still see it with P20. Mismatched firmware / driver is more critical when a new series is launched, as a newer driver may issue requests not supported by older firmware. But the SAS2 firmware interface has been stable for quite some time.
The latest firmware available from broadcom is P20, from April 2016. However, there seem to be a lot of people unhappy with P20, some even deliberately downgrading to P16.
20.00.00.00 had some serious problems. By the time 20.00.04.00 came out (4th rebuild release) most of them had been squared away, and 20.00.07.00 has been rock solid everywhere I've used it.
 
Warning: My data may applies to the 6Gbit LSI SAS HBAs, and not necessarily to the 12Gbit ones. And it applies to Linux hosts when using the LSI driver (which may be slightly different from the LSI mps/mpt driver that's in FreeBSD). With those caveats:

Any LSI firmware that's less than version 20 is very bad trouble, and is to be avoided. When firmware 20 first came out, it fixed many problems, but introduced one new serious one, which was very quickly fixed in a minor spin which I refer to as "20-4". That's probably the 20.00.04.00 that Terry is referring to. Since then, I only know of minor issues that were fixed (at least in mainstream free-market cards, the OEM branded ones with system-specific special firmware have had extra new versions). Unfortunately, the exact version numbers are in e-mails that are with my former employer, so I can't easily look up the details.

So what I'm really saying is: Agree with Terry (except that I don't have the details).
 
20.00.00.00 had some serious problems. By the time 20.00.04.00 came out (4th rebuild release) most of them had been squared away, and 20.00.07.00 has been rock solid everywhere I've used it.
As it appears I'm still at 20.00.02.00, I might give 20.00.07.00 a go some time tonight. Thanks for reminding me about the firmware :)

I haven't run into any issues yet, but like I said, it's my home server. It does get used a lot, there's about 12 TB of ZFS pools attached to it. Mostly filled with crap I downloaded over the years. It would be painful if it got destroyed but nothing life-threatening. And big part of the fun of collecting is the hunt :D
 
SirDice: You're probably OK. If I remember right, the problems with the firmware were that under intense workload (we had roughly 100 disks connected per HBA) the adapter would "forget" IOs under rare conditions, which then would cause the OS driver to reset the adapter, which causes other IOs to be forgotten (deliberately), and the whole thing goes into runaway. Most of these problems are only visible when doing an enormous number of IOs at the same time (we often had several thousand IOs outstanding). A single-user file system, where only a few files are used at once, probably wouldn't experience this.
 
I upgraded to 20.00.07.00 last week. On the positive side, there appear to be fewer SCSI CDB errors now. On the other hand, I still get them.

The errors seem to be triggered by writes, even under tiny load of 40kb/sec. On the other hand, 3 successive scrubs (of 70Tb pool) did not trigger any. I think this excludes the cables, and, together with the fact that I now have half as many Seagate disks than I used to have WDs, power supply as well.

If anyone had good experience with ST10000NM0016, what controller are you using?
 
I got the same issues with this exact same drive. I even switched from an LSI 2008 controller (firmware 20.00.07.00) to an LSI 3008 (firmware 15.00.00.00). Both in IT mode with the newest firmware available.

The errors happen for me about once every one or two weeks. Usually when i am not home and the system idles. My feeling is that this is some kind of power saving or wakeup time issue.

This is my latest error:

Code:
mpr0: Sending reset from mprsas_send_abort for target ID 6
   (da3:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 d5 9f 7f 40 00 00 00 08 00 00 length 4096 SMID 971 terminated ioc 804b scsi 0 state c xfer 0
mpr0: Unfreezing devq for target ID 6
(da3:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 d5 9f 7f 40 00 00 00 08 00 00
(da3:mpr0:0:6:0): CAM status: CCB request completed with an error
(da3:mpr0:0:6:0): Retrying command
(da3:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mpr0:0:6:0): CAM status: Command timeout
(da3:mpr0:0:6:0): Retrying command
(da3:mpr0:0:6:0): WRITE(16). CDB: 8a 00 00 00 00 02 d5 9f 7f 40 00 00 00 08 00 00
(da3:mpr0:0:6:0): CAM status: SCSI Status Error
(da3:mpr0:0:6:0): SCSI status: Check Condition
(da3:mpr0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpr0:0:6:0): Retrying command (per sense data)
   (da3:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 594 terminated ioc 804b scsi 0 state c xfer 0
(da3:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mpr0:0:6:0): CAM status: CCB request completed with an error
(da3:mpr0:0:6:0): Retrying command
(da3:mpr0:0:6:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
(da3:mpr0:0:6:0): CAM status: SCSI Status Error
(da3:mpr0:0:6:0): SCSI status: Check Condition
(da3:mpr0:0:6:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da3:mpr0:0:6:0): Error 6, Retries exhausted
(da3:mpr0:0:6:0): Invalidating pack
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 893 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 266 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 424 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 265 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 492 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 682 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 764 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 524 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 417 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 572 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 899 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 916 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 550 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 999 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 744 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 371 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 999 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 744 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 371 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 508 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 236 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 356 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 271 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 211 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 592 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 length 0 SMID 477 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 424 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d0 00 01 00 00 00 4f 00 c2 00 b0 00 length 512 SMID 266 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 265 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 06 00 4f 00 c2 00 b0 00 length 512 SMID 492 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 524 terminated ioc 804b scsi 0 state c xfer 0
   (pass3:mpr0:0:6:0): ATA COMMAND PASS THROUGH(16). CDB: 85 08 0e 00 d5 00 01 00 01 00 4f 00 c2 00 b0 00 length 512 SMID 764 terminated ioc 804b scsi 0 state c xfer 0
 
Hi, I'm trying to wake up this old thread.. I'm having the same drives as you. I'm experiencing the same issues.
Do you have any recommendations what I should try? I have changed cables and HBA. Running the LSI 3008 using the 20.00.00.00 driver...
 
Hi, I'm trying to wake up this old thread.. I'm having the same drives as you. I'm experiencing the same issues.
Do you have any recommendations what I should try? I have changed cables and HBA. Running the LSI 3008 using the 20.00.00.00 driver...
Do you mean the 20.00.00.00 driver or the 20.00.00.00 firmware? If you meant firmware, all 20.00.xx.00 versions older than 20.00.04.00 had severe problems, and the last problems weren't fixed until 20.00.07.00.
 
Back
Top