Hi guys,
I have been working on an issue on my ZFS pool for the last week but I can not get any breaks. I am hoping someone here may know what has happened.
Put simply, as of about a week ago: what was perfectly healthy and functioning ZFS array has started crashing when I try to write to it. I can read but if I try to write, it locks up.
Originally the issue presented itself as some services that write to the array getting stuck in top state zio>i. I posted about it here: forums.freebsd.org/showthread.php?t=42072.
After disabling anything that wrote to the array I scrubbed it and found a single file was unrecoverable plus some chksum errors. I deleted the file, cleared the chksum errors and scrubbed again; this came up with different chksum errors so I scrubbed again to be sure and received a new set chksum errors!
I have tried disabling write-cache but that did not appear to help:
I have also noticed some suggestions to use ZIL, which I had not heard of before. It sounds a bit risky.
I've found similar issues where the advice was to check there was enough power supplied but it has been fine for 6 months and I have not added any new devices or made any real changes in the software.
What else can I look into?
More information
The controller cards BIOS shows all disks as healthy and so does zpool
I am currently scrubbing again after another failed write test, the progress is as follows:
Here is the messages log for bootup and the the current scrub.
Thanks all
D
I have been working on an issue on my ZFS pool for the last week but I can not get any breaks. I am hoping someone here may know what has happened.
Put simply, as of about a week ago: what was perfectly healthy and functioning ZFS array has started crashing when I try to write to it. I can read but if I try to write, it locks up.
Originally the issue presented itself as some services that write to the array getting stuck in top state zio>i. I posted about it here: forums.freebsd.org/showthread.php?t=42072.
After disabling anything that wrote to the array I scrubbed it and found a single file was unrecoverable plus some chksum errors. I deleted the file, cleared the chksum errors and scrubbed again; this came up with different chksum errors so I scrubbed again to be sure and received a new set chksum errors!
I have tried disabling write-cache but that did not appear to help:
Code:
/boot/loader
### Load RAID Drivers
vfs.zfs.prefetch_disable="1"
vfs.zfs.cache_flush_disable="1"
zfs_load="YES"
### Load RAID Drivers
hpt27xx_load="YES"
I have also noticed some suggestions to use ZIL, which I had not heard of before. It sounds a bit risky.
I've found similar issues where the advice was to check there was enough power supplied but it has been fine for 6 months and I have not added any new devices or made any real changes in the software.
What else can I look into?
More information
The controller cards BIOS shows all disks as healthy and so does zpool
Code:
#zpool status -x
all pools are healthy
I am currently scrubbing again after another failed write test, the progress is as follows:
Code:
#zpool status -v
pool: datastore
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scan: scrub in progress since Sun Oct 6 20:06:36 2013
8.74T scanned out of 10.3T at 172M/s, 2h38m to go
334K repaired, 84.88% done
config:
NAME STATE READ WRITE CKSUM
datastore ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
da2 ONLINE 0 0 7 (repairing)
da3 ONLINE 0 0 0
da4 ONLINE 0 0 1 (repairing)
da5 ONLINE 0 0 3 (repairing)
da8 ONLINE 0 0 0
da6 ONLINE 0 0 0
da7 ONLINE 0 0 2 (repairing)
Here is the messages log for bootup and the the current scrub.
Code:
/var/log/messages
...
[I]bootup[/I]
Oct 6 20:00:30 bsd kernel: da2 at hpt27xx0 bus 0 scbus2 target 0 lun 0
Oct 6 20:00:30 bsd kernel: da2: <HPT DISK 0_0 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da1 at ahd1 bus 0 scbus1 target 2 lun 0
Oct 6 20:00:30 bsd kernel: da1: <SEAGATE ST373307LW 0003> Fixed Direct Access SCSI-3 device
Oct 6 20:00:30 bsd kernel: da1: 320.000MB/s transfers (160.000MHz DT, offset 63, 16bit)
Oct 6 20:00:30 bsd kernel: da1: Command Queueing enabled
Oct 6 20:00:30 bsd kernel: da1: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C)
Oct 6 20:00:30 bsd kernel: da0 at ahd0 bus 0 scbus0 target 4 lun 0
Oct 6 20:00:30 bsd kernel: da0: <SEAGATE ST373307LW 0003> Fixed Direct Access SCSI-3 device
Oct 6 20:00:30 bsd kernel: da0: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
Oct 6 20:00:30 bsd kernel: da0: Command Queueing enabled
Oct 6 20:00:30 bsd kernel: da0: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C)
Oct 6 20:00:30 bsd kernel: da3 at hpt27xx0 bus 0 scbus2 target 1 lun 0
Oct 6 20:00:30 bsd kernel: da3: <HPT DISK 0_1 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da3: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da4 at hpt27xx0 bus 0 scbus2 target 2 lun 0
Oct 6 20:00:30 bsd kernel: da4: <HPT DISK 0_2 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da4: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da5 at hpt27xx0 bus 0 scbus2 target 3 lun 0
Oct 6 20:00:30 bsd kernel: da5: <HPT DISK 0_3 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da6 at hpt27xx0 bus 0 scbus2 target 4 lun 0
Oct 6 20:00:30 bsd kernel: da6: <HPT DISK 0_4 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da7 at hpt27xx0 bus 0 scbus2 target 5 lun 0
Oct 6 20:00:30 bsd kernel: da7: <HPT DISK 0_5 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct 6 20:00:30 bsd kernel: da8 at hpt27xx0 bus 0 scbus2 target 6 lun 0
Oct 6 20:00:30 bsd kernel: da8: <HPT DISK 0_6 4.00> Fixed Direct Access SCSI-0 device
Oct 6 20:00:30 bsd kernel: da8: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
...
[I](scrubbing)[/I]
Oct 6 21:17:08 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 6 21:17:08 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x5146b746,LBA[4-7]=0x0.
Oct 6 21:17:09 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 6 23:27:57 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 6 23:27:57 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x12f19e19,LBA[4-7]=0x0.
Oct 6 23:27:57 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 02:42:15 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 7 02:42:15 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xda69d09e,LBA[4-7]=0x0.
Oct 7 02:42:15 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 03:39:39 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 7 03:39:39 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xd9eb36ce,LBA[4-7]=0x0.
Oct 7 03:39:39 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 06:16:46 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 7 06:16:46 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0xd296addb,LBA[4-7]=0x0.
Oct 7 06:16:46 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 06:23:52 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 7 06:23:52 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x739bcfc8,LBA[4-7]=0x0.
Oct 7 06:23:53 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 09:00:11 bsd kernel: hpt27xx: Device error information 0x1000000
Oct 7 09:00:11 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0xb59a2cb1,LBA[4-7]=0x0.
Oct 7 09:00:11 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct 7 10:51:37 bsd hsflowd: res_search(_sflow._udp, C_IN, 16) failed : Operation timed out (h_errno=1)
Thanks all
D