Other Large number of CAM status: Uncorrectable parity/CRC errors when resilvering/scrubbing

Hi,

I have two Sil3124 PCI-X S-ATA controllers on my NAS box. Earlier they have worked fine, but recently I had to replace a disk on my raid-z1 array, and when resilvering, they initially work fine but after a while there are large number of CAM status errors. All the errors concern disks connected to Sils, and none seem to affecting the disk that is connected to native S-ATA port on motherboard.

Does this indicate that the controllers are about to break down as this happens when under load? I thought that Sil3124s are a decent controller..
Code:
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 31 5c d8 40 01 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 78 a8 40 ae 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 76 a8 40 ae 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 76 a8 40 ae 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:siisch0:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:siisch0:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 1a 6d e6 40 01 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 76 a8 40 ae 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 1b c0 a2 40 01 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:siisch0:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 c0 7b 95 d3 40 01 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 fc 0c 39 40 01 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:siisch0:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 0f dc 22 40 01 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 20 c9 68 40 01 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada2:siisch0:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 82 b9 fb 40 01 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 48 ba 57 40 02 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: ATA Status Error
(ada5:siisch6:0:0:0): ATA status: 41 (DRDY ERR), error: 04 (ABRT )
(ada5:siisch6:0:0:0): RES: 41 04 48 ba 57 40 02 00 00 00 00
(ada5:siisch6:0:0:0): Retrying command
(ada2:siisch0:0:0:0): READ_FPDMA_QUEUED. ACB: 60 80 c0 09 4c 40 01 00 00 00 00 00
(ada2:siisch0:0:0:0): CAM status: CCB request was invalid
(ada2:siisch0:0:0:0): Error 22, Unretryable error
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 8e c1 69 40 01 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada5:siisch6:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 5c 5f f0 40 01 00 00 00 00 00
(ada5:siisch6:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada5:siisch6:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 78 a8 40 ae 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 76 a8 40 ae 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
(ada4:siisch5:0:0:0): READ_FPDMA_QUEUED. ACB: 60 10 10 02 00 40 00 00 00 00 00 00
(ada4:siisch5:0:0:0): CAM status: Uncorrectable parity/CRC error
(ada4:siisch5:0:0:0): Retrying command
 
I'm not familiar with that particular controller card but given the CRC errors I would first check the hard drive cables are good as well as make sure the controller card is seated properly and then run a smart test on the affected drive/s should there still be a problem. Resilvering can tax a drive and bring a pending failure to attention. Others here may have some better ideas.
 
I'm not familiar with that particular controller card but given the CRC errors I would first check the hard drive cables are good as well as make sure the controller card is seated properly and then run a smart test on the affected drive/s should there still be a problem. Resilvering can tax a drive and bring a pending failure to attention. Others here may have some better ideas.

I have tried to change cables and reseat controller, with no effect. I also took a look at SMART data with smartctl, and the drives have zero raw read error rate. UDMA CRC error counts are high in several disks, which would indicate either a cable or controller problem.
 
I think if it were one drive it would be easy to point the finger at the drive. Considering it's three, looking at what is common between them would be the way to go. If you've already changed cables and reseated stuff that is a good start. You said you have two controllers. Do all the symptoms show on just one controller? Can you plug in a drive that is showing errors on one controller into the other? Is there shared backplane between the drives in some way?
 
I think if it were one drive it would be easy to point the finger at the drive. Considering it's three, looking at what is common between them would be the way to go. If you've already changed cables and reseated stuff that is a good start. You said you have two controllers. Do all the symptoms show on just one controller? Can you plug in a drive that is showing errors on one controller into the other? Is there shared backplane between the drives in some way?

No, the symptoms show on disks attached to either controller. There are five similar disks (WD 1.5TB Caviar Black) in the array, two of them in each controller and one in S-ATA controller on motherboard. There were lots of similar errors, so I pasted only part of dmesg, I think also fourth disk attached to Sils had some errors. Earlier I had only one controller attached, but I happened to have a spare, so I put it in to test if it would affect anything. Also, the disk that is connected to motherboard has had no issues at all.

However, for some reason it managed to resilver this night with no errors, so probably the disks are fine. I did set vfs.zfs_scrub_limit=1 at /boot/loader.conf to attempt to limit resilver speed and didn't have anything else running. I don't know if that helped.. anyways, I guess the issue is solved for now, but I'm not sure if I should replace controller(s) to avoid issues in future. o_O




EDIT: Also, has FreeBSD's support for Marvell chips improved? Supermicro SAT2-MV8 would seem to be a pretty cheap controller, but I have heard that it has had problems with FreeBSD.
 
Back
Top