Solved ZFS write errors caused by "out of memory"?

cracauer@

Developer
Is this really running out of memory as in out of RAM? It didn't use up the swapspace it has.

I don't think it is the disks. ada1 show no SMART errors at all. ada5 has some old ones but not at the time of that writing. The operation showing this is writing via rsync.

raidz2, dedup on, compression maybe on (different filesystems, not sure where the error hit). Some snapshots taken, but only 1 or 2.

Code:
ata4: FAILURE - out of memory in start
(ada1:ata4:0:1:0): WRITE_DMA48. ACB: 35 00 f8 29 5f 40 44 00 00 00 00 01
(ada1:ata4:0:1:0): CAM status: CCB request was invalid
(ada1:ata4:0:1:0): Error 22, Unretryable error
ata7: FAILURE - out of memory in start
(ada5:ata7:0:0:0): WRITE_DMA48. ACB: 35 00 10 20 87 40 b7 00 00 00 00 01
(ada5:ata7:0:0:0): CAM status: CCB request was invalid
(ada5:ata7:0:0:0): Error 22, Unretryable error

It resilvered very quickly.

Code:
  pool: cbackup6
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 2.35G in 0h4m with 0 errors on Sat Dec 20 10:56:42 2014
config:

        NAME        STATE     READ WRITE CKSUM
        cbackup6    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0    33     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0
            ada5p3  ONLINE       0    67     0

errors: No known data errors

This is 11-current from August. It's a 12 GB original i7 Xeon. All disks are on the in-chipset ports.

Reviewing dmesg it looks like I forgot to turn on AHCI for this one (gotta love mainboards forgetting the BIOS settings).
Code:
atapci1: <Intel ICH10 SATA300 controller> port 0xb000-0xb007,0xac00-0xac03,0xa880-0xa887,0xa800-0xa803,0xa480-0xa48f,0xa400-0xa40f irq 19 at device 31.2 on pci0


I'll move this to a stable release and turn on AHCI, I'm just curious what the correct interpretation of the error message is. Or whether somebody has other thoughts on this?
 
I've had similar errors with a dodgy SATA cable. You might want to check that as it's a rather cheap and easy fix.
 
Hm. Yeah. I think at least I should shuffle the drive positions around so that if it is cable, port or the SATA frame the errors would pop up on a different drive.
 
This was an ugly one. The disk array ended up in a wedged state that would cause both panics from segfaults and plain machine hangs (deadlocks I presume), zfs mount -a hanging forever etc. That persisted after changing motherboard, disk frame and SATA cables. Nuked the pool and made a new one and looks like it's rolling now.

One of those SAS 19" rackmounts with 12 or 16 drive slots and double-backed SAS cables to them looks pretty attractive now...
 
Back
Top