Solved ZFS: Trying to replace faulted drive with new, getting "cannot zero first 4096 bytes"

Hey everyone

I have a zpool in raidz2 configuration. One of the drives faulted, so I took it offline and replaced the drive in the space. Now I'm trying to zpool replace {poolname} /dev/ada0 and I'm getting cannot zero first 4096 bytes of '/dev/ada0': Input/output error.

Taking a look at dmesg, I'm seeing the following:
Code:
# dmesg | grep ada0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <ST5000LM000-2AN170 0001> ACS-3 ATA SATA 3.x device
ada0: Serial Number WCXXXDCX
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)
ada0: Command Queueing enabled
ada0: 4769307MB (9767541168 512 byte sectors)
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 02 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:ahcich0:0:0:0): RES: 41 84 10 02 00 00 00 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command, 3 more tries remain
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 02 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:ahcich0:0:0:0): RES: 41 84 10 02 00 00 00 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command, 2 more tries remain
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 02 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:ahcich0:0:0:0): RES: 41 84 10 02 00 00 00 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command, 1 more tries remain
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 02 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:ahcich0:0:0:0): RES: 41 84 10 02 00 00 00 00 00 10 00
(ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
(ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 10 10 02 00 40 00 00 00 00 00 00
(ada0:ahcich0:0:0:0): CAM status: ATA Status Error
(ada0:ahcich0:0:0:0): ATA status: 41 (DRDY ERR), error: 84 (ICRC ABRT )
(ada0:ahcich0:0:0:0): RES: 41 84 10 02 00 00 00 00 00 10 00
(ada0:ahcich0:0:0:0): Error 5, Retries exhausted

I'm not sure how to proceed with this. It's a new drive, straight out of the packaging, so I'd be surprised if it was faulty already. It is connected through a SATA card, but there are 9 other drives connected via that card which are running well.

Any ideas how I can get this into my pool?
 
The logs above tell us that:
  • The disk known as ada0 exists, is connected, and is a certain model and serial number.
  • The disk communicates with the host.
  • The disk is reporting a DRDY error; if I remember right, that means the disk says it is not ready to do IO.
So the disk is not functioning. It might be defective, or it might have an installation problem. I would start by checking that the power wiring is OK; perhaps the disk is not receiving enough power to actually fully function. If you have checked that, you can try using smartctl to diagnose what the disk is saying.
 
Thanks everyone. I rewired the power so there wasn't so much daisy-chaining, but that didn't help. However, replacing the SATA cable with a newly purchased one from a different manufacturer seems to have fixed the issue.

zpool replace {poolname} {old_id} /dev/ada0 is currently resilvering, so everything is looking great!
 
However, replacing the SATA cable with a newly purchased one from a different manufacturer seems to have fixed the issue.
This always baffles me. How does a cable go bad when it's sitting in a desktop pc, plugged in and nobody touches it. I've had it happen to me, but I'm at the point where I keep extra cables on hand, highest quality I can get and that's the first thing I swap.
 
This always baffles me.

Cables are weird. They can do the darnedest things.

Many years ago, we had one SAS cable that would work great in normal use (in a super busy server), but would consistently not work once we enabled T10DIF (meaning on-disk checksums). After much trial and error, we narrowed the problem down to literally one cable. Completely bizarre.
 
This always baffles me. How does a cable go bad when it's sitting in a desktop pc, plugged in and nobody touches it. I've had it happen to me, but I'm at the point where I keep extra cables on hand, highest quality I can get and that's the first thing I swap.
Both disks were at the beginning fast and one day one disk was slow ( https://forums.freebsd.org/threads/one-disk-slow.86332/ ) and finally was cable related. No idea how a cable goes bad when nobody touch it.
 
Back
Top