ZFS SATA drive disappears under load

Here's an odd one. I built a system that boots/roots off a UFS 2 GB CF card in a primary IDE slot. There are also on-board SATA ports and I've plugged a 500 GB drive in for storing backups - the whole thing given over to a ZFS pool with multiple file systems underneath (dedupe off, compression LZJB, copies=2)

Everything is nicely stable until I try to use rsync to push a significant amount (50 GB-ish) over to the drive. At some point during disk writes, the drive suddenly disappears as though it has been literally disconnected from the port.

Code:
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): WRITE_DMA48. ACB: 35 00 76 c7 38 40 26 00 00 00 05 00
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): CAM status: Command timeout
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): Retrying command
Oct 30 10:07:23 bupbox kernel: ada0 at ata2 bus 0 scbus0 target 0 lun 0
Oct 30 10:07:23 bupbox kernel: ada0: <WDC WD5000AAKS-65YGA0 12.01C02> s/n WD-WCAS84345972 detached
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): Periph destroyed

Only thing that brings the drive back is a total reboot HARD power off and back on.

Any ideas what might cause this? I was thinking drive temperature could be, although I have a fan pointed at it? Bad RAM? Etc?
 
SiliconImage SIL3112 SATA RAID, BIOS v4.4.02. Just the one drive connected to it, no RAID setup.

Edit: some SMART info for you
Code:
=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Caviar Blue (SATA)
Device Model:  WDC WD5000AAKS-65YGA0
Serial Number:  WD-WCAS84345972
LU WWN Device Id: 5 0014ee 1ab4899c0
Firmware Version: 12.01C02
User Capacity:  500,107,862,016 bytes [500 GB]
Sector Size:  512 bytes logical/physical
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.5, 1.5 Gb/s
Local Time is:  Thu Oct 30 11:38:21 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
I always had a similar problem with copying to (vs from) external flash memory and disks... anything not native IDE or SATA, so I use the --bwlimit=1000 or maybe 2000, more risky, that throttles the rsync to 1/10 speed and has proven very very reliable. YMMV but recommended.
 
Can you test the drive connected directly to the motherboard? I also recommend updating the firmware on the RAID card.
 
I always had a similar problem with copying to (vs from) external flash memory and disks... anything not native IDE or SATA, so I use the --bwlimit=1000 or maybe 2000, more risky, that throttles the rsync to 1/10 speed and has proven very very reliable. YMMV but recommended.
Thanks, this seems like it might be a really helpful workaround - slower transfers beats hard lockups, anyway! According to this site --bwlimit works in rsyncd.conf server-side as well which should keep it from dying no matter what the client is doing.

I haven't tried doing rsync-over-ssh yet but that may also have some benefit or at least offer some other options.

Can you test the drive connected directly to the motherboard? I also recommend updating the firmware on the RAID card.
It is directly connected to the motherboard : ) I was reading info off the chip soldered on the board.
 
Usually motherboards that have a RAID chip also have non-RAID ports. Sometimes the connectors are different colors. If you can identify the system or motherboard, it would help.

And the power supply is still a suspect. Heavy load is exactly when those symptoms appear. If the motherboard is more than a few years old, it is worth inspecting the power supply capacitors near the processor also.
 
Everything is nicely stable until I try to use rsync to push a significant amount (50 GB-ish) over to the drive. At some point during disk writes, the drive suddenly disappears as though it has been literally disconnected from the port.

Code:
Oct 30 10:07:23 bupbox kernel: ada0: <WDC WD5000AAKS-65YGA0 12.01C02> s/n WD-WCAS84345972 detached
Do you have another drive (different model) that you can test with, simply to eliminate this drive as the source of the problem?

A part number of WDxxxxxxxx-65xxxx normally indicates an HP OEM model. HP occasionally requests strange features in their firmware, and they don't test for compatibility with controllers they don't use. Most non-OEM Western Digital drives will end with -00xxxx, -01xxxx, or -02xxxx (there are exceptions).
 
Thanks for the tips, guys.

Setting --bwlimit 1024 didn't help... well, actually, it had an interesting effect: the drive still unplugged itself, but after a much longer delay as the transfers took longer. It's almost as though it allows some amount of data to be written and then craps out, regardless of how long that time takes.

I don't have any other SATA drives. Or a power supply tester. I guess in absence of other tools, I'll just have to live with it for now.
 
Limiting the amount of data transferred by rsync is a hack to hide a problem. A power supply tester might not detect a problem. Swapping power supplies is a better test. "Just living with it" is a way to get corrupted data.
 
Another rsync hint... it may use more memory and crash, copying all the disk VS copying each filesystem -- or equally large parts of the disk, it does not have to cache as much information about the transfers while doing so. Indeed that is another part of the usage here, additionally to the slowdown by parameter.
 
Back
Top