Question about damaged SATA drive

pkc · Jan 2, 2014

I have a 1[ ]TB SATA drive which is known to be damaged somehow. I initiated a transfer about 12 hours [ago? -- mod.] of 150[ ]GB from a USB [drive? -- mod.] to this drive using cp, but so far only 50[ ]GB have transferred. This makes sense.

However, the terminal session from which I initiated cp over SSH is almost unresponsive -- keyboard input receives a response after a minute or so. This is not an issue because I can just abort the transfer if I want, and in fact if I log in on a separate session the system is as responsive as usual, but I was just wondering what the technical reason would be for this.

Thanks.

ralphbsz · Jan 5, 2014

Educated guess: the reason it is so slow is that the damaged drive has to retry some I/Os many many times. During this time, the process is stuck in the kernel. It can take disk drives easily a few seconds to retry I/Os (if they recalibrate). The minute you are seeing is a bit extreme; the only way I can explain it is that the disk drive itself spends a few seconds on each I/O, and then the kernel (probably some parts of the SATA and block device stack) retry the I/O another few times.

My usual rule of thumb is that I/Os should finish (at the disk drive level, not counting kernel retries) within a few seconds, even under the most extreme workloads (with many dozens of I/Os queued), even in error cases. But I know that error handling in the kernel can exceed that, but only up to a few dozen seconds (20 or 30 seconds for an I/O is the absolute upper limit). The only reason for longer I/O times is kernel bugs; on one Unix-like OS (name withheld to protect the innocent), we ended up rebooting the machine after about 90 hours, and the I/O still hadn't finished.

Question about damaged SATA drive

pkc

ralphbsz