Other SCSI Status Error with Status: OK?

Greetings!

Yesterday we had a server crash and browsing thru the logs made the following picture:

First lots of those messages appeared in the logs:
Code:
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 29 68 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 20 e8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 2b 68 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 22 e8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 2d a8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 24 e8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 2f e8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 26 e8 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 32 68 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 34 28 00 00 01 00 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 37 28 00 00 00 80 00 00
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:10:56 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
... more of the same...
<2>Jul 30 04:11:09 storage kernel: (da0:mrsas0:0:0:0): READ(16). CDB: 88 00 00 00 00 06 f8 91 3e 68 00 00 01 00 00 00
<2>Jul 30 04:11:09 storage kernel: (da0:mrsas0:0:0:0): CAM status: SCSI Status Error
<2>Jul 30 04:11:09 storage kernel: (da0:mrsas0:0:0:0): SCSI status: OK
<2>Jul 30 04:11:09 storage kernel: (da0:mrsas0:0:0:0): Invalidating pack
<2>Jul 30 04:11:09 storage kernel: g_vfs_done():ufs/vol2[WRITE(offset=6596581294080, length=32768)]error = 6
<2>Jul 30 04:11:09 storage kernel: g_vfs_done():/dev: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: ufs/vol2[WRITE(offset=6941998514176, length=32768)]error = 6
<2>Jul 30 04:11:09 storage kernel: /srv/vol2: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: g_vfs_done():ufs/vol2[WRITE(offset=6286074085376, length=4096)]error = 6
<2>Jul 30 04:11:09 storage kernel: /srv/vol2: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: g_vfs_done():ufs/vol2[WRITE(offset=6286074089472, length=4096)]error = 6
<2>Jul 30 04:11:09 storage kernel: /srv/vol2: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: softdep_deallocate_dependencies: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: g_vfs_done():
<2>Jul 30 04:11:09 storage kernel: ufs/vol2[WRITE(offset=6290013126656, length=4096)]error = 6
<2>Jul 30 04:11:09 storage kernel: /srv/vol2: got error 6 while accessing filesystem
<2>Jul 30 04:11:09 storage kernel: da0 at mrsas0 bus 0 scbus0 target 0 lun 0
<2>Jul 30 04:11:09 storage kernel: da0: <LSI MR9381-4i4e 4.23> s/n 006405f317f1beb51ef00e670bb00506 detached

System probably panic bit later and restarted.

What bothers me is CAM status: SCSI Status Error immediately followed by SCSI status: OK. My SCSI debugging powers are on vacation, so, please, help!

Here is some software and hardware information:

# freebsd-version -ku
Code:
10.3-RELEASE-p20
10.3-RELEASE-p20
# ./storcli64 /c0 show
Code:
Product Name = LSI MegaRAID SAS 9380-4i4e
FW Package Build = 24.3.0-0050
BIOS Version = 6.17.04.0_4.16.08.00_0x06060A01
FW Version = 4.230.20-3532
Driver Name = mrsas
Driver Version = 06.707.04.03-fbsd

Vendor Id = 0x1000
Device Id = 0x5D
SubVendor Id = 0x1000
SubDevice Id = 0x9381
Host Interface = PCI-E
Device Interface = SAS-12G
Bus Number = 1
Device Number = 0
Function Number = 0
Drive Groups = 1

TOPOLOGY :
========

----------------------------------------------------------------------------
DG Arr Row EID:Slot DID Type  State BT      Size PDC  PI SED DS3  FSpace TR
----------------------------------------------------------------------------
 0 -   -   -        -   RAID5 Optl  Y  14.552 TB dflt N  N   dflt N      N 
 0 0   -   -        -   RAID5 Optl  Y  14.552 TB dflt N  N   dflt N      N 
 0 0   0   8:0      9   DRIVE Onln  N   3.637 TB dflt N  N   dflt -      N 
 0 0   1   8:1      11  DRIVE Onln  N   3.637 TB dflt N  N   dflt -      N 
 0 0   2   8:2      12  DRIVE Onln  N   3.637 TB dflt N  N   dflt -      N 
 0 0   3   8:3      10  DRIVE Onln  N   3.637 TB dflt N  N   dflt -      N 
 0 0   4   8:4      13  DRIVE Onln  N   3.637 TB dflt N  N   dflt -      N 
----------------------------------------------------------------------------

VD LIST :
=======

-----------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC      Size Name   
-----------------------------------------------------------------
0/0   RAID5 Optl  RW     Yes     RWTD  -   ON  14.552 TB Arrary1
-----------------------------------------------------------------

PD LIST :
=======

-----------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp
-----------------------------------------------------------------------
8:0       9 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0034     U 
8:1      11 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0034     U 
8:2      12 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0034     U 
8:3      10 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0034     U 
8:4      13 Onln   0 3.637 TB SAS  HDD N   N  512B ST4000NM0034     U 
-----------------------------------------------------------------------

Thank you for your help!
 
Yes, a failing disk drive is a valid cause, but somehow I would expect RAID5 to be able to survive this.

Also, I'm wondering why SCSI Status Error is immediately followed by SCSI status: OK; it looks to me the same as error: No error.

Can you give some hints on how to check drives' S.M.A.R.T status? megaraid argument to -d option of smartctl seems gone.

# camcontrol devlist
Code:
<LSI MR9381-4i4e 4.23>             at scbus0 target 0 lun 0 (pass0,da0)
<LSI SAS3x36 0601>                 at scbus1 target 8 lun 0 (pass1,ses0)
<AVAGO MR9361-8i 4.60>             at scbus2 target 0 lun 0 (pass2,da1)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus8 target 0 lun 0 (pass3,ses1)
<WDC WD5003ABYZ-011FA0 01.01S03>   at scbus9 target 0 lun 0 (ada0,pass4)
<AHCI SGPIO Enclosure 1.00 0001>   at scbus15 target 0 lun 0 (pass5,ses2)
# smartctl -a -d megaraid,0 /dev/pass0
Code:
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-RELEASE-p20 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/pass0: Unknown device type 'megaraid'
=======> VALID ARGUMENTS ARE: ata, scsi, nvme[,NSID], sat[,auto][,N][+TYPE], usbcypress[,X], usbjmicron[,p][,x][,N], usbprolific, usbsunplus, 3ware,N, hpt,L/M/N, cciss,N, areca,N/E, atacam, auto, test <=======

Use smartctl -h to get a usage summary
 
Yes, a failing disk drive is a valid cause, but somehow I would expect RAID5 to be able to survive this.
RAID 5 doesn't prevent disk errors. All RAID 5 does is make sure you can still access your data when one of the disks dies completely. It does not and will not prevent disks giving errors.

I'm not sure how to access the disks individually for smartctl(8), the mrsas(4) driver works a little different than mfi(4) (you can't use mfiutil(8) either but sysutils/megacli should work though).
 
At a low SCSI level, it is quite possible to have a "SCSI status" (meaning the target sends non-zero status and then sense data) while the actual operation (read or write) is OK. That for example happens with deferred errors. Example: One initiator uses "mode select" to change a disk drive parameter. Another initiator then tries to perform a read, and the status it gets back is "I have something important to tell you: someone else changed a parameter", which superficially looks like an error. For details, read up on the SCSI unit attention concept. BUT: Normal operating systems (like FreeBSD) know how to handle these cases, at least for normal hardware (with multi-initiator multi-path disks it can get a little dicey, but that's probably not your situation).

About the question of how to get SMART information from the drives: I am assuming that you are using your LSI MegaRAID adaptor to create a RAID-ed volume out of multiple drives, and the MegaRAID then exposes that volume to the OS as a single disk, right? You are not using software RAID? In that case, the SCSI status your are seeing is not from an invidiual disk drive, but from that virtualized volume. In that case, You can't use the normal smartctl() command to get SMART information from individual drives, since FreeBSD sees only a virtual volume, not drives. SirDice's advice to try to get the answer through megacli should work. You will probably also be able to access the same functionality by rebooting, and going into the BIOS setup screen of the MegaRAID card (although the user interface of those things tends to be awful, and reading the documentation first probably a good idea).

Question for experts on the FreeBSD SCSI stack: Is there a way to enable verbose debugging messages in the SCSI stack, which show not just the final SCSI status, but the whole sense data (including ASC/ASCQ)? That might help narrow the problem down.
 
I gave up trying to read S.M.A.R.T attributes, but I came across some loosely related issues and decided to read controller's event log. Turned out this:

./storcli64 /c0 show events
Code:
seqNum: 0x00001eb5
Time: Sun Jul 30 01:10:54 2017

Code: 0x000000df
Class: 1
Locale: 0x10
Event Description: SAS wide port 0 lost link on PHY 0
Event Data:
===========
Port: 0
Phy : 0

seqNum: 0x00001eb6
Time: Sun Jul 30 01:10:54 2017

Code: 0x000000df
Class: 1
Locale: 0x10
Event Description: SAS wide port 0 lost link on PHY 1
Event Data:
===========
Port: 0
Phy : 1

... one for each hard drive...

seqNum: 0x00001eb9
Time: Sun Jul 30 01:10:57 2017

Code: 0x0000010c
Class: 1
Locale: 0x02
Event Description: PD 09(e0x08/s0) Path 5000c50084faf1e1  reset (Type 03)
Event Data:
===========
Device ID: 9
Enclosure Index: 8
Slot Number: 0
Error: 3

... more of the same for other hard drives...

seqNum: 0x00001ebe
Time: Sun Jul 30 01:10:57 2017

Code: 0x0000010c
Class: 1
Locale: 0x02
Event Description: Encl PD 08 Path 500304801ea23f3d  reset (Type 03)
Event Data:
===========
Device ID: 8
Enclosure Index: 8
Slot Number: 255
Error: 3

...

seqNum: 0x00001ebf
Time: Sun Jul 30 01:10:57 2017

Code: 0x00000070
Class: 1
Locale: 0x02
Event Description: Removed: PD 09(e0x08/s0)
Event Data:
===========
Device ID: 9
Enclosure Index: 8
Slot Number: 0

seqNum: 0x00001ec0
Time: Sun Jul 30 01:10:57 2017


Code: 0x000000f8
Class: 0
Locale: 0x02
Event Description: Removed: PD 09(e0x08/s0) Info: enclPd=08, scsiType=0, portMap=00, sasAddr=5000c50084faf1e1,0000000000000000
Event Data:
===========
Device ID: 9
Enclosure Device ID: 8
Enclosure Index: 1
Slot Number: 0
SAS Address 1: 5000c50084faf1e1
SAS Address 2: 0


seqNum: 0x00001ec1
Time: Sun Jul 30 01:10:57 2017

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 09(e0x08/s0) from ONLINE(18) to FAILED(11)
Event Data:
===========
Device ID: 9
Enclosure Index: 8
Slot Number: 0
Previous state: 24
New state: 17


seqNum: 0x00001ec2
Time: Sun Jul 30 01:10:57 2017

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2)
Event Data:
===========
Target Id: 0
Previous state: 3
New state: 2

Basically ... after drives disappeared one by one, the volume, understandably, went offline. That explains the software-side troubles, but not the reason why and how that happened. I suppose we better check the cabling and tighten all the screws :)

Thanks!
 
Check the card itself too. I used to have an old Promise SATA controller that died, the controller chip itself got extremely hot and started producing CAM errors for just about everything attached to it. Needless to say it was the controller card itself that was broken and needed to be replaced.
 
Back
Top