UFS FreeBSD 11.1 on SDD: CAM status: Uncorrectable parity/CRC error

giorgiob · Aug 16, 2017

I have just done a fresh install of FreeBSD 11.1 on a small home server with an SDD drive. The installation seemed to be successful: no errors whatsoever.

After the installation I wanted to restore my old data on the server and copied some tar files to the (UFS-formated) SDD disk. Each file is about 4 G in size. I then wanted to check the files (I have sha512 sums for each file). While computing the sums, I got several similar errors logged in the console:

Code:

(ada0:ahcich0:0:0:0): READ_FPDMA_QUEUED, ACB: 60 00 c0 10 88 40 04 00 00 01 00 00
(ada0:ahcich0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada0:ahcich0:0:0:0): Retrying command

The command completes and the computed sha512 is correct. So, even though there seem to be errors reading the disk, the data is read correctly.

I then rebooted and mounted the disk read only and ran a check with
smartctl -s on -t long /dev/ada0
and the test passed.
I also rebooted into a GNU/Linux rescue disk and ran
badblock -s /dev/sda
from there: this also found no bad blocks.

I have also looked at https://forums.freebsd.org/threads/45101/
but in that question smartctl was reporting errors, so it seems to me that that situation was different.

So both tools found no errors but still I get these error messages in the console from time to time. Do you have any suggestions as to what I should check next? Can it be a broken disk?

ralphbsz · Aug 16, 2017

That error usually indicates a communication problem, usually on the SATA interface to the disk. In particular, you need to look for the following: If that error happens *once*, then says that it is being retried (like your example above does), but then does *not* happen immediately again, then the retry succeeded, which is a strong indicator that this was a communication problem. How do you know whether a second problem report is a retry that also failed? Look at the ACB that's listed in the report, it includes the disk address, so different ACBs mean different IOs and not a failed retry.

If retries also fail, then eventually the error will be propagated up to the file system and perhaps even the user application. In that case, there will be more entries in the system log.

If these are indeed communication errors, and go away on retries, then my first guess would be to check the cabling to the disk; the second guess would be to check the power supply situation (in particular connections). If all that is in excellent shape, then it gets more interesting.

giorgiob · Aug 17, 2017

I think I have found the problem: thanks for pointing me in the right direction!!

To make a long story short: I think it is the SATA cable. I will replace it.

What I tried in detail.

First test. I opened the box and checked that the power supply and the SATA cable were connected properly. I connected the SATA cable to another slot (maybe one slot is defective?).
I started the system and ran the sha512 verification again. I got 250 errors in total, each error on a different disk address. So, if I understand correctly, each read was successful at the first retry.

Second test. I connected the drive to a SATA-USB adapter, attached to my laptop and booted FreeBSD on it. I mounted the drive and ran the SHA512 check: no errors.

Third test. I connected the drive to the server using another SATA cable and ran the SHA512 check: no errors.

So I inspected the old SATA cable and found a scratch / small cut on one side. It probably got damaged while I mounted it: the case has some sharp edges and I even cut my finger while trying to remove the disk. So I tested again with the old SATA cable and the errors are there!

UFS FreeBSD 11.1 on SDD: CAM status: Uncorrectable parity/CRC error

giorgiob

ralphbsz

giorgiob