Other Deciphering SCSI errors

Lately I've been running several of our drives through read scans, as I've noticed a few with sectors on some disks that simply will not read. I'm doing this with a simple dd if=/dev/(a)daX of=/dev/null bs=1m. 'dd' has died on some machines when it hits something that can't be read, triggering me to replace the drive.

However, while performing these scans, I've noticed some drives do not cause dd to stop, but generate some interesting dmesgs, e.g.:
Code:
(da6:mps0:0:60:0): READ(10). CDB: 28 00 ba 74 66 00 00 01 00 00
(da6:mps0:0:60:0): CAM status: SCSI Status Error
(da6:mps0:0:60:0): SCSI status: Check Condition
(da6:mps0:0:60:0): SCSI sense: ABORTED COMMAND asc:47,3 (Information unit iuCRC error detected)
(da6:mps0:0:60:0): Retrying command (per sense data)

...or alternatively (and far more rare), the SCSI sense line is:
(da6:mps0:0:60:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
This particular machine is the only one using a SAS HBA. My understanding of all of this is, based on the fact that dd did not exit prematurely, the error is specific to the HBA/backplane or wiring between, and not a problem with the drive itself. Is this safe to assume, or could the HBA in this particular machine be masking problems reading the disk?
 
The error ASC/ASCQ = 47/3 is exactly what it says: A CRC error was detected in the information unit. What is an information unit? Some atomic simple part of the SCSI protocol, for example the command itself going from initiator (computer) to target (disk), or the status return, or data going some way. So this is a communication error between the HBA and the drive.

The log tells you clearly what happens next: The command is retried. Assuming that there are no more error messages, I would expect that the second time, the command worked. So you have an intermittent communication problem on your SAS network. I would start by reseating cables.

The second thing is not an error, but it gets transmitted and printed the same way errors do. The SCSI target is telling the initiator an interesting condition that has occurred, namely that it was recently powered up or reset. Which can easily make sense. You should be able to tell whether the drive was recently powered up. The SAS HBA might be resetting the drive as a response to communication problems too, or because of hung IOs. Again, I think your power or communications cabling is suspect.

Nothing here points at problems in the actual mechanism of the disk.
 
Yeah, I was able to take the machine down briefly, and swap things around. I'm getting similar errors on a different drive in the same bay, so I strongly suspect there's just something wrong with that particular bay. There's not much cabling per se, just the single one that runs from the HBA to the backplane. I'll swap it out when I get a chance but my fear is it lies in the backplane itself or some hardware component of the HBA.

Assuming that there are no more error messages, I would expect that the second time, the command worked.
Thats what I was gathering. I should have been more clear, as these message come in 'bursts' of a many (around 10 to 20) in quick succession with slightly different CDBs for each one, and then it falls quiet for quite some time. However, this might only be due to the unusual workload of reading 15 or 16 drives simultaneously at maximum throughput.

I'll probably just stick a disk that has relatively little I/O there, such as a system disk, and cross my fingers that things don't get worse.
 
I'm getting similar errors on a different drive in the same bay, so I strongly suspect there's just something wrong with that particular bay.
The connector inside the drive bay might be bit dodgy. Maybe a cold solder joint.
 
However, this might only be due to the unusual workload of reading 15 or 16 drives simultaneously at maximum throughput.
In that case, it is possible that the power supply is causing the problem. Maybe the voltage is dropping slightly below the acceptable range when under maximum load, so the communication signals become too “weak” for a short period of time, especially for the drive farthest from the power supply. Either that, or a cabling / connector problem, as others have mentioned.
 
In that case, it is possible that the power supply is causing the problem.
I thought about that, but I don't think it is, as no other drives experience the issue. It has two PSUs, which I believe are in a fully redundant configuration, and both are plugged in. It does draw ~2A (~240VA) while reading all the disks, which isn't too bad. The bay that has the issue is also very close to where the PSU <-> backplane connection is. I'll pull it apart when I have some more time to investigate.

I'm still most suspicious of some part of the data path. It may just be crosstalk or a bad connection somewhere on the path between the physical port on the drive and the HBA.
 
That number sort of makes sense: When fully busy, disks can consume 13-15W, times 16 drives, with some power supply inefficiency 2A at 240V is in the ballpark. As you said, most likely your problem (*) is caused by the data path, which is why I would suspect connectors and cables first.

Footnote: Not clear that you even have a real problem ... occasional CRC errors may be something you can ignore.
 
2A at 240V
That's 'VA', not 'V', as I'm on 120V. 2A @ 120V = 240VA. Anyway, the system draws substantially more power when it's under cpu load (e.g. make -j 40 buildworld), at least 3A.
Footnote: Not clear that you even have a real problem ... occasional CRC errors may be something you can ignore.
That's kind of my feeling right now. The commands never appear to fully time out as dd does complete. I've also not seen any until now, so I suspect I won't ever see them during normal load.
 
Back
Top