ZFS SATA drive disappears under load

Hornpipe2 · Oct 30, 2014

Here's an odd one. I built a system that boots/roots off a UFS 2 GB CF card in a primary IDE slot. There are also on-board SATA ports and I've plugged a 500 GB drive in for storing backups - the whole thing given over to a ZFS pool with multiple file systems underneath (dedupe off, compression LZJB, copies=2)

Everything is nicely stable until I try to use rsync to push a significant amount (50 GB-ish) over to the drive. At some point during disk writes, the drive suddenly disappears as though it has been literally disconnected from the port.

Code:

Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): WRITE_DMA48. ACB: 35 00 76 c7 38 40 26 00 00 00 05 00
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): CAM status: Command timeout
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): Retrying command
Oct 30 10:07:23 bupbox kernel: ada0 at ata2 bus 0 scbus0 target 0 lun 0
Oct 30 10:07:23 bupbox kernel: ada0: <WDC WD5000AAKS-65YGA0 12.01C02> s/n WD-WCAS84345972 detached
Oct 30 10:07:23 bupbox kernel: (ada0:ata2:0:0:0): Periph destroyed

Only thing that brings the drive back is a ~~total reboot~~ HARD power off and back on.

Any ideas what might cause this? I was thinking drive temperature could be, although I have a fan pointed at it? Bad RAM? Etc?

wblock@ · Oct 30, 2014

Bad power supply, maybe.

Hornpipe2 · Oct 30, 2014

Is smartd known to interfere with ZFS in any way?

wblock@ · Oct 30, 2014

Not to my knowledge. What controller is being used?

Hornpipe2 · Oct 30, 2014

SiliconImage SIL3112 SATA RAID, BIOS v4.4.02. Just the one drive connected to it, no RAID setup.

Edit: some SMART info for you

Code:

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Caviar Blue (SATA)
Device Model:  WDC WD5000AAKS-65YGA0
Serial Number:  WD-WCAS84345972
LU WWN Device Id: 5 0014ee 1ab4899c0
Firmware Version: 12.01C02
User Capacity:  500,107,862,016 bytes [500 GB]
Sector Size:  512 bytes logical/physical
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.5, 1.5 Gb/s
Local Time is:  Thu Oct 30 11:38:21 2014 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

jb_fvwm2 · Oct 30, 2014

I always had a similar problem with copying to (vs from) external flash memory and disks... anything not native IDE or SATA, so I use the --bwlimit=1000 or maybe 2000, more risky, that throttles the rsync to 1/10 speed and has proven very very reliable. YMMV but recommended.

wblock@ · Oct 30, 2014

Can you test the drive connected directly to the motherboard? I also recommend updating the firmware on the RAID card.

Hornpipe2 · Oct 30, 2014

jb_fvwm2 said:
I always had a similar problem with copying to (vs from) external flash memory and disks... anything not native IDE or SATA, so I use the --bwlimit=1000 or maybe 2000, more risky, that throttles the rsync to 1/10 speed and has proven very very reliable. YMMV but recommended.

Thanks, this seems like it might be a really helpful workaround - slower transfers beats hard lockups, anyway! According to this site --bwlimit works in rsyncd.conf server-side as well which should keep it from dying no matter what the client is doing.

I haven't tried doing rsync-over-ssh yet but that may also have some benefit or at least offer some other options.

wblock@ said:
Can you test the drive connected directly to the motherboard? I also recommend updating the firmware on the RAID card.

It is directly connected to the motherboard : ) I was reading info off the chip soldered on the board.

wblock@ · Oct 30, 2014

Usually motherboards that have a RAID chip also have non-RAID ports. Sometimes the connectors are different colors. If you can identify the system or motherboard, it would help.

And the power supply is still a suspect. Heavy load is exactly when those symptoms appear. If the motherboard is more than a few years old, it is worth inspecting the power supply capacitors near the processor also.

Terry_Kennedy · Oct 31, 2014

Hornpipe2 said:
Everything is nicely stable until I try to use rsync to push a significant amount (50 GB-ish) over to the drive. At some point during disk writes, the drive suddenly disappears as though it has been literally disconnected from the port.

Code:

Oct 30 10:07:23 bupbox kernel: ada0: <WDC WD5000AAKS-65YGA0 12.01C02> s/n WD-WCAS84345972 detached

Do you have another drive (different model) that you can test with, simply to eliminate this drive as the source of the problem?

A part number of WDxxxxxxxx-65xxxx normally indicates an HP OEM model. HP occasionally requests strange features in their firmware, and they don't test for compatibility with controllers they don't use. Most non-OEM Western Digital drives will end with -00xxxx, -01xxxx, or -02xxxx (there are exceptions).

Hornpipe2 · Nov 8, 2014

Thanks for the tips, guys.

Setting --bwlimit 1024 didn't help... well, actually, it had an interesting effect: the drive still unplugged itself, but after a much longer delay as the transfers took longer. It's almost as though it allows some amount of data to be written and then craps out, regardless of how long that time takes.

I don't have any other SATA drives. Or a power supply tester. I guess in absence of other tools, I'll just have to live with it for now.

wblock@ · Nov 8, 2014

Limiting the amount of data transferred by rsync is a hack to hide a problem. A power supply tester might not detect a problem. Swapping power supplies is a better test. "Just living with it" is a way to get corrupted data.

jb_fvwm2 · Nov 8, 2014

Another rsync hint... it may use more memory and crash, copying all the disk VS copying each filesystem -- or equally large parts of the disk, it does not have to cache as much information about the transfers while doing so. Indeed that is another part of the usage here, additionally to the slowdown by parameter.

Jimmy · Nov 11, 2014

Do you get any write failures prior to the disk reset? I have seen this is with a faulty disk drive.