Pool freezes when writing

ghostcorps · Oct 7, 2013

Hi guys,

I have been working on an issue on my ZFS pool for the last week but I can not get any breaks. I am hoping someone here may know what has happened.

Put simply, as of about a week ago: what was perfectly healthy and functioning ZFS array has started crashing when I try to write to it. I can read but if I try to write, it locks up.

Originally the issue presented itself as some services that write to the array getting stuck in top state zio>i. I posted about it here: forums.freebsd.org/showthread.php?t=42072.

After disabling anything that wrote to the array I scrubbed it and found a single file was unrecoverable plus some chksum errors. I deleted the file, cleared the chksum errors and scrubbed again; this came up with different chksum errors so I scrubbed again to be sure and received a new set chksum errors!

I have tried disabling write-cache but that did not appear to help:

Code:

/boot/loader
### Load RAID Drivers
vfs.zfs.prefetch_disable="1"
vfs.zfs.cache_flush_disable="1"
zfs_load="YES"
### Load RAID Drivers
hpt27xx_load="YES"

I have also noticed some suggestions to use ZIL, which I had not heard of before. It sounds a bit risky.

I've found similar issues where the advice was to check there was enough power supplied but it has been fine for 6 months and I have not added any new devices or made any real changes in the software.

What else can I look into?

More information

The controller cards BIOS shows all disks as healthy and so does zpool

Code:

#zpool status -x
all pools are healthy

I am currently scrubbing again after another failed write test, the progress is as follows:

Code:

#zpool status -v
  pool: datastore
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Oct  6 20:06:36 2013
        8.74T scanned out of 10.3T at 172M/s, 2h38m to go
        334K repaired, 84.88% done
config:

	NAME        STATE     READ WRITE CKSUM
	datastore   ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    da2     ONLINE       0     0     7  (repairing)
	    da3     ONLINE       0     0     0
	    da4     ONLINE       0     0     1  (repairing)
	    da5     ONLINE       0     0     3  (repairing)
	    da8     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	    da7     ONLINE       0     0     2  (repairing)

Here is the messages log for bootup and the the current scrub.

Code:

/var/log/messages
...
[I]bootup[/I]
Oct  6 20:00:30 bsd kernel: da2 at hpt27xx0 bus 0 scbus2 target 0 lun 0
Oct  6 20:00:30 bsd kernel: da2: <HPT DISK 0_0 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da1 at ahd1 bus 0 scbus1 target 2 lun 0
Oct  6 20:00:30 bsd kernel: da1: <SEAGATE ST373307LW 0003> Fixed Direct Access SCSI-3 device 
Oct  6 20:00:30 bsd kernel: da1: 320.000MB/s transfers (160.000MHz DT, offset 63, 16bit)
Oct  6 20:00:30 bsd kernel: da1: Command Queueing enabled
Oct  6 20:00:30 bsd kernel: da1: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C)
Oct  6 20:00:30 bsd kernel: da0 at ahd0 bus 0 scbus0 target 4 lun 0
Oct  6 20:00:30 bsd kernel: da0: <SEAGATE ST373307LW 0003> Fixed Direct Access SCSI-3 device 
Oct  6 20:00:30 bsd kernel: da0: 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
Oct  6 20:00:30 bsd kernel: da0: Command Queueing enabled
Oct  6 20:00:30 bsd kernel: da0: 70007MB (143374744 512 byte sectors: 255H 63S/T 8924C)
Oct  6 20:00:30 bsd kernel: da3 at hpt27xx0 bus 0 scbus2 target 1 lun 0
Oct  6 20:00:30 bsd kernel: da3: <HPT DISK 0_1 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da3: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da4 at hpt27xx0 bus 0 scbus2 target 2 lun 0
Oct  6 20:00:30 bsd kernel: da4: <HPT DISK 0_2 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da4: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da5 at hpt27xx0 bus 0 scbus2 target 3 lun 0
Oct  6 20:00:30 bsd kernel: da5: <HPT DISK 0_3 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da6 at hpt27xx0 bus 0 scbus2 target 4 lun 0
Oct  6 20:00:30 bsd kernel: da6: <HPT DISK 0_4 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da7 at hpt27xx0 bus 0 scbus2 target 5 lun 0
Oct  6 20:00:30 bsd kernel: da7: <HPT DISK 0_5 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
Oct  6 20:00:30 bsd kernel: da8 at hpt27xx0 bus 0 scbus2 target 6 lun 0
Oct  6 20:00:30 bsd kernel: da8: <HPT DISK 0_6 4.00> Fixed Direct Access SCSI-0 device 
Oct  6 20:00:30 bsd kernel: da8: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
...
[I](scrubbing)[/I]
Oct  6 21:17:08 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  6 21:17:08 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x5146b746,LBA[4-7]=0x0.
Oct  6 21:17:09 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  6 23:27:57 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  6 23:27:57 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x12f19e19,LBA[4-7]=0x0.
Oct  6 23:27:57 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 02:42:15 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  7 02:42:15 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xda69d09e,LBA[4-7]=0x0.
Oct  7 02:42:15 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 03:39:39 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  7 03:39:39 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x4, LBA[0-3]=0xd9eb36ce,LBA[4-7]=0x0.
Oct  7 03:39:39 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 06:16:46 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  7 06:16:46 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0xd296addb,LBA[4-7]=0x0.
Oct  7 06:16:46 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 06:23:52 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  7 06:23:52 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0x739bcfc8,LBA[4-7]=0x0.
Oct  7 06:23:53 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 09:00:11 bsd kernel: hpt27xx: Device error information 0x1000000
Oct  7 09:00:11 bsd kernel: hpt27xx: Task file error, StatusReg=0x41, ErrReg=0x54, LBA[0-3]=0xb59a2cb1,LBA[4-7]=0x0.
Oct  7 09:00:11 bsd kernel: hpt27xx: Device error information 0x8000000080000000
Oct  7 10:51:37 bsd hsflowd: res_search(_sflow._udp, C_IN, 16) failed : Operation timed out (h_errno=1)

Thanks all
D

ghostcorps · Oct 7, 2013

I have disabled write cache on each disk using the hptsvr webui.

This seems to have worked but I do not understand why it changed after working for so long.

Could it be that the array is filling up?

phoenix · Oct 7, 2013

Data is getting corrupted somewhere between the disk and the CPU (checksum errors). Check your SATA cables, your power cables, and your RAM. Reseat all the cables, or replace them if you have spares. Run memtest86+ or similar overnight. Check the CPU heatsink/fan and temps inside the CPU and case. Run a few SMART tests on the disks, just in case.

wblock@ · Oct 7, 2013

If you can move the array over to the motherboard controller, it would take the RAID controller out of the equation. Unless it is the motherboard controller...

These sound like spurious errors. The first suspect in that situation is RAM, which can fail suddenly. Power supply is another suspect, but I would expect a power supply failing to cause a gradually-increasing error rate. If Memtest shows a problem, that does not isolate the component, but doing a binary search with the RAM might narrow it down.

ghostcorps · Oct 8, 2013

Thanks guys

I'll look into the hardware a bit further but I can not move the array off the controller unfortunately.

Currently all SMART tests come up OK.

I'll try memtest86+ tonight and see how it fares.

ghostcorps · Nov 1, 2013

Back again, sorry it has taken so long.

I found one of the RAM sticks was faulty so I replaced them all. They were about 4 years old so it is about time. I also reseated the cabling and cleaned the CPU fan. This had no effect so I looked into some soft options: I upgraded to FreeBSD 9.2 and have started using the ZFS log/cache but the issue is the same.

Below are all the logs I can find. I cut them pretty short but still needed two posts!

#dmesg

Code:

FreeBSD 9.2-RELEASE
...
hpt27xx0: <odin> mem 0xfc4e0000-0xfc4fffff,0xfc480000-0xfc4bffff irq 16 at device 0.0 on pci1
hpt27xx: adapter at PCI 1:0:0, IRQ 16
...
ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
Timecounters tick every 1.000 msec
hpt27xx: Attached device index 40 (Path 04 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 41 (Path 05 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 42 (Path 06 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 43 (Path 07 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 00 (Path 00 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 01 (Path 01 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 02 (Path 02 | Target 00 | E0/Sff)  00000000
hpt27xx: Attached device index 03 (Path 03 | Target 00 | E0/Sff)  00000000
hpt27xx0: [GIANT-LOCKED]
(probe38:hpt27xx0:0:8:0): INQUIRY. CDB: 12 00 00 00 24 00 
(probe38:hpt27xx0:0:8:0): CAM status: Invalid Target ID
(probe38:hpt27xx0:0:8:0): Error 22, Unretryable error
...
(probe284:hpt27xx0:0:254:0): INQUIRY. CDB: 12 00 00 00 24 00 
(probe284:hpt27xx0:0:254:0): CAM status: Invalid Target ID
(probe284:hpt27xx0:0:254:0): Error 22, Unretryable error
...
da2 at hpt27xx0 bus 0 scbus11 target 0 lun 0
da2: <HPT DISK 0_0 4.00> Fixed Direct Access SCSI-0 device 
da2: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da3 at hpt27xx0 bus 0 scbus11 target 1 lun 0
da3: <HPT DISK 0_1 4.00> Fixed Direct Access SCSI-0 device 
da3: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da4 at hpt27xx0 bus 0 scbus11 target 2 lun 0
da4: <HPT DISK 0_2 4.00> Fixed Direct Access SCSI-0 device 
da4: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da5 at hpt27xx0 bus 0 scbus11 target 3 lun 0
da5: <HPT DISK 0_3 4.00> Fixed Direct Access SCSI-0 device 
da5: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da6 at hpt27xx0 bus 0 scbus11 target 4 lun 0
da6: <HPT DISK 0_4 4.00> Fixed Direct Access SCSI-0 device 
da6: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da7 at hpt27xx0 bus 0 scbus11 target 5 lun 0
da7: <HPT DISK 0_5 4.00> Fixed Direct Access SCSI-0 device 
da7: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da8 at hpt27xx0 bus 0 scbus11 target 6 lun 0
da8: <HPT DISK 0_6 4.00> Fixed Direct Access SCSI-0 device 
da8: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)
da9 at hpt27xx0 bus 0 scbus11 target 7 lun 0
da9: <HPT DISK 0_7 4.00> Fixed Direct Access SCSI-0 device 
da9: 1907729MB (3907029168 512 byte sectors: 255H 63S/T 243201C)

There are thousands of these errors in /var/log/messages

Code:

bsd kernel: (probe252:hpt27xx0:0:252:0): INQUIRY. CDB: 12 00 00 00 24 00 
bsd kernel: (probe252:hpt27xx0:0:252:0): CAM status: Invalid Target ID
bsd kernel: (probe252:hpt27xx0:0:252:0): Error 22, Unretryable error

The scrub is still finding heaps of chksum errors:

#zpool status -v datastore

Code:

  pool: datastore
 state: ONLINE
config:

	NAME           STATE     READ WRITE CKSUM
	datastore      ONLINE       0     0     3
	  raidz2-0     ONLINE       0     0     9
	    da2        ONLINE       0     0     1
	    da3        ONLINE       0     0     0
	    da4        ONLINE       0     0     0
	    da5        ONLINE       0     0    16
	    da9        ONLINE       0     0    16
	    da7        ONLINE       0     0     0
	    da8        ONLINE       0     0     0
	logs
	  gpt/zlog     ONLINE       0     0     0
	cache
	  gpt/zcache0  ONLINE       0     0     0

This is a great script to collect a summary on the ZFS config and performance.

#zfs-stats -a

Code:

ZFS Subsystem Report				Fri Nov  1 16:06:59 2013

System Information:
	Kernel Version:				902001 (osreldate)
	Hardware Platform:			amd64
	Processor Architecture:			amd64
	ZFS Storage pool Version:		5000
	ZFS Filesystem Version:			5

FreeBSD 9.2-RELEASE #0 r255898: Thu Sep 26 22:50:31 UTC 2013 root
 4:06PM  up  5:46, 3 users, load averages: 1.11, 1.17, 1.16

System Memory:

	2.46%	169.97	MiB Active,	2.76%	190.49	MiB Inact
	29.14%	1.96	GiB Wired,	0.26%	17.75	MiB Cache
	65.38%	4.41	GiB Free,	0.01%	552.00	KiB Gap

	Real Installed:				7.00	GiB
	Real Available:			99.65%	6.98	GiB
	Real Managed:			96.63%	6.74	GiB

	Logical Total:				7.00	GiB
	Logical Used:			34.14%	2.39	GiB
	Logical Free:			65.86%	4.61	GiB

Kernel Memory:					1.47	GiB
	Data:				98.36%	1.44	GiB
	Text:				1.64%	24.66	MiB

Kernel Memory Map:				5.90	GiB
	Size:				22.86%	1.35	GiB
	Free:				77.14%	4.55	GiB

ARC Summary: (HEALTHY)
	Memory Throttle Count:			0

ARC Misc:
	Deleted:				3.37m
	Recycle Misses:				3.04k
	Mutex Misses:				818
	Evict Skips:				0

ARC Size:				28.69%	1.43	GiB
	Target Size: (Adaptive)		100.00%	5.00	GiB
	Min Size (Hard Limit):		10.00%	512.00	MiB
	Max Size (High Water):		10:1	5.00	GiB

ARC Size Breakdown:
	Recently Used Cache Size:	93.75%	4.69	GiB
	Frequently Used Cache Size:	6.25%	320.00	MiB

ARC Hash Breakdown:
	Elements Max:				565.35k
	Elements Current:		99.84%	564.45k
	Collisions:				3.61m
	Chain Max:				23
	Chains:					98.98k

ARC Efficiency:					6.35m
	Cache Hit Ratio:		62.21%	3.95m
	Cache Miss Ratio:		37.79%	2.40m
	Actual Hit Ratio:		53.11%	3.37m

	Data Demand Efficiency:		1.40%	2.35m

	CACHE HITS BY CACHE LIST:
	  Most Recently Used:		6.14%	242.41k
	  Most Frequently Used:		79.22%	3.13m
	  Most Recently Used Ghost:	30.91%	1.22m
	  Most Frequently Used Ghost:	1.48%	58.47k

	CACHE HITS BY DATA TYPE:
	  Demand Data:			0.83%	32.94k
	  Prefetch Data:		0.00%	0
	  Demand Metadata:		75.40%	2.98m
	  Prefetch Metadata:		23.77%	938.29k

	CACHE MISSES BY DATA TYPE:
	  Demand Data:			96.83%	2.32m
	  Prefetch Data:		0.00%	0
	  Demand Metadata:		0.13%	3.21k
	  Prefetch Metadata:		3.04%	72.87k

L2 ARC Summary: (HEALTHY)
	Passed Headroom:			196.16k
	Tried Lock Failures:			198.61k
	IO In Progress:				0
	Low Memory Aborts:			0
	Free on Write:				65
	Writes While Full:			1
	R/W Clashes:				0
	Bad Checksums:				0
	IO Errors:				0
	SPA Mismatch:				0

L2 ARC Size: (Adaptive)				1.46	GiB
	Header Size:			0.52%	7.77	MiB

L2 ARC Breakdown:				2.40m
	Hit Ratio:			0.01%	195
	Miss Ratio:			99.99%	2.40m
	Feeds:					20.70k

L2 ARC Buffer:
	Bytes Scanned:				13.59	TiB
	Buffer Iterations:			20.70k
	List Iterations:			1.32m
	NULL List Iterations:			375.65k

L2 ARC Writes:
	Writes Sent:			100.00%	5.04k

VDEV cache is disabled

ZFS Tunables (sysctl):
	kern.maxusers                           384
	vm.kmem_size                            6442450944
	vm.kmem_size_scale                      1
	vm.kmem_size_min                        0
	vm.kmem_size_max                        8589934592
	vfs.zfs.l2c_only_size                   538247168
	vfs.zfs.mfu_ghost_data_lsize            219091968
	vfs.zfs.mfu_ghost_metadata_lsize        199026176
	vfs.zfs.mfu_ghost_size                  418118144
	vfs.zfs.mfu_data_lsize                  0
	vfs.zfs.mfu_metadata_lsize              975539200
	vfs.zfs.mfu_size                        978570240
	vfs.zfs.mru_ghost_data_lsize            4304284672
	vfs.zfs.mru_ghost_metadata_lsize        646848512
	vfs.zfs.mru_ghost_size                  4951133184
	vfs.zfs.mru_data_lsize                  196034560
	vfs.zfs.mru_metadata_lsize              198780928
	vfs.zfs.mru_size                        417726464
	vfs.zfs.anon_data_lsize                 0
	vfs.zfs.anon_metadata_lsize             0
	vfs.zfs.anon_size                       2761216
	vfs.zfs.l2arc_norw                      1
	vfs.zfs.l2arc_feed_again                1
	vfs.zfs.l2arc_noprefetch                1
	vfs.zfs.l2arc_feed_min_ms               200
	vfs.zfs.l2arc_feed_secs                 1
	vfs.zfs.l2arc_headroom                  2
	vfs.zfs.l2arc_write_boost               8388608
	vfs.zfs.l2arc_write_max                 8388608
	vfs.zfs.arc_meta_limit                  1342177280
	vfs.zfs.arc_meta_used                   1341455184
	vfs.zfs.arc_min                         536870912
	vfs.zfs.arc_max                         5368709120
	vfs.zfs.dedup.prefetch                  1
	vfs.zfs.mdcomp_disable                  0
	vfs.zfs.nopwrite_enabled                1
	vfs.zfs.write_limit_override            0
	vfs.zfs.write_limit_inflated            22469603328
	vfs.zfs.write_limit_max                 936233472
	vfs.zfs.write_limit_min                 33554432
	vfs.zfs.write_limit_shift               3
	vfs.zfs.no_write_throttle               0
	vfs.zfs.zfetch.array_rd_sz              1048576
	vfs.zfs.zfetch.block_cap                256
	vfs.zfs.zfetch.min_sec_reap             2
	vfs.zfs.zfetch.max_streams              8
	vfs.zfs.prefetch_disable                1
	vfs.zfs.no_scrub_prefetch               0
	vfs.zfs.no_scrub_io                     0
	vfs.zfs.resilver_min_time_ms            3000
	vfs.zfs.free_min_time_ms                1000
	vfs.zfs.scan_min_time_ms                1000
	vfs.zfs.scan_idle                       50
	vfs.zfs.scrub_delay                     4
	vfs.zfs.resilver_delay                  2
	vfs.zfs.top_maxinflight                 32
	vfs.zfs.write_to_degraded               0
	vfs.zfs.mg_alloc_failures               8
	vfs.zfs.check_hostid                    1
	vfs.zfs.deadman_enabled                 1
	vfs.zfs.deadman_synctime                1000
	vfs.zfs.recover                         0
	vfs.zfs.txg.synctime_ms                 1000
	vfs.zfs.txg.timeout                     5
	vfs.zfs.vdev.cache.bshift               16
	vfs.zfs.vdev.cache.size                 0
	vfs.zfs.vdev.cache.max                  16384
	vfs.zfs.vdev.trim_on_init               1
	vfs.zfs.vdev.write_gap_limit            4096
	vfs.zfs.vdev.read_gap_limit             32768
	vfs.zfs.vdev.aggregation_limit          131072
	vfs.zfs.vdev.ramp_rate                  2
	vfs.zfs.vdev.time_shift                 29
	vfs.zfs.vdev.min_pending                4
	vfs.zfs.vdev.max_pending                10
	vfs.zfs.vdev.bio_delete_disable         0
	vfs.zfs.vdev.bio_flush_disable          0
	vfs.zfs.vdev.trim_max_pending           64
	vfs.zfs.vdev.trim_max_bytes             2147483648
	vfs.zfs.cache_flush_disable             1
	vfs.zfs.zil_replay_disable              0
	vfs.zfs.sync_pass_rewrite               2
	vfs.zfs.sync_pass_dont_compress         5
	vfs.zfs.sync_pass_deferred_free         2
	vfs.zfs.zio.use_uma                     0
	vfs.zfs.snapshot_list_prefetch          0
	vfs.zfs.version.ioctl                   3
	vfs.zfs.version.zpl                     5
	vfs.zfs.version.spa                     5000
	vfs.zfs.version.acl                     1
	vfs.zfs.debug                           0
	vfs.zfs.super_owner                     0
	vfs.zfs.trim.enabled                    1
	vfs.zfs.trim.max_interval               1
	vfs.zfs.trim.timeout                    30
	vfs.zfs.trim.txg_delay                  32

ghostcorps · Nov 1, 2013

The following logs were taken during a 'lockup'. I am streaming though XMBC via FTP. Since this issue began: the stream will stop and it will take minutes for it to buffer.

#zpool iostat -V 1

Code:

                  capacity     operations    bandwidth
pool           alloc   free   read  write   read  write
datastore      10.9T  1.71T    117    419  7.93M  2.47M
  raidz2       10.9T  1.71T    117    419  7.93M  2.47M
    da2            -      -     88     71  2.31M   565K
    da3            -      -     97     72  2.33M   567K
    da4            -      -     99     64  2.28M   564K
    da5            -      -     86     65  2.29M   561K
    da9            -      -     74     52  1.37M   560K
    da7            -      -     87     64  2.31M   559K
    da8            -      -     96     68  2.29M   560K
logs               -      -      -      -      -      -
  gpt/zlog      128K  31.7G      0      0      0      0
cache              -      -      -      -      -      -
  gpt/zcache0  1.32G  35.0G      0     16      0   156K

#zfs-mon -a

Code:

Cache hits and misses:
                                1s    10s    60s    tot
                   ARC hits:    22     66     79     79
                 ARC misses:    16     15     16     16
       ARC demand data hits:     0      0      0      0
     ARC demand data misses:    16     15     16     16
   ARC demand metadata hits:    22     35     38     38
 ARC demand metadata misses:     0      0      0      0
                 L2ARC hits:     0      0      0      0
               L2ARC misses:    16     15     16     16

Cache efficiency percentage:
                        10s    60s    tot
                ARC:  81.48  83.16  83.16
    ARC demand data:   0.00   0.00   0.00
ARC demand metadata: 100.00 100.00 100.00
              L2ARC:   0.00   0.00   0.00

unrar seems to be the culprit to some degree. I will look into it but any other advice will be greatly welcomed

#top

Code:

53 processes:  1 running, 52 sleeping
CPU:  0.2% user,  0.0% nice,  3.7% system,  0.5% interrupt, 95.6% idle
Mem: 178M Active, 177M Inact, 2017M Wired, 12M Cache, 322M Buf, 4517M Free
ARC: 1453M Total, 936M MFU, 380M MRU, 272K Anon, 124M Header, 13M Other
Swap: 512M Total, 512M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
 2248 root         32  20    0   394M   220M usem    1  41:42  2.15% python2.7
 2976 root          1  20    0 30044K  9440K zio->i  0   0:11  0.00% unrar
 3038 media         1  20    0 24844K  4188K zio->i  1   0:04  0.00% pure-ftpd
 1327 root          1  20    0 24840K  3988K select  1   0:00  0.00% pure-ftpd
 2938 ghostcorps    1  20    0 24844K  4112K select  1   0:00  0.00% pure-ftpd
 3135 ghostcorps    1  20    0 24844K  4192K select  1   0:00  0.00% pure-ftpd

AndyUKG · Nov 1, 2013

Hi,

from your first post it seems errors are being detected on multiple disks, so that would imply the issue is with some common component. The most obvious culprit would seem to be the disk controller HBA, but I guess it could equally be serveral other components such as PSU or motherboard. However without spares to swap in I'm not sure how you can test any of these further than you already have,

thanks, Andy.

ghostcorps · Nov 1, 2013

Thanks

So you are saying I should keep focussing on hardware?

AndyUKG · Nov 4, 2013

It would certainly seem unlikely to be software related to me, given how basic a function checksum is to ZFS and that there is a large user base of ZFS who are not seeing this issue. But anything is possible

Is the data in the pool important? If yes have you got a backup yet? Can you do destructive testing on the disks or attach an additional disk to perform testing on? That would probably seem to be the next step. Ie test writing data to a disk on the same controller to UFS or even via dd to the raw disk device. I'd expect you to confirm read/write errors. If you have onboard SATA that you could use to compare with that could be a useful test.

thanks, Andy.

ghostcorps · Nov 5, 2013

AndyUKG said:
It would certainly seem unlikely to be software related to me, given how basic a function checksum is to ZFS and that there is a large user base of ZFS who are not seeing this issue. But anything is possible
Is the data in the pool important? If yes have you got a backup yet? Can you do destructive testing on the disks or attach an additional disk to perform testing on? That would probably seem to be the next step. Ie test writing data to a disk on the same controller to UFS or even via dd to the raw disk device. I'd expect you to confirm read/write errors. If you have onboard SATA that you could use to compare with that could be a useful test.

thanks, Andy.

Thanks Andy, I was certainly not implying that I found an inherant fault in ZFS. If there is any software issue I would put money down on it being user error!

An Update:

The SMART tests showed all disks 'OK' but I looked a bit deeper with the Controller Management software and found one disc listed as 'OK' but with over 9000 errors!!!! Needless to say I am now resilvering the replacement and will report back.

On a side note, the resilvering seems to halt every so often. I waited hours to see if it progressed but decided to do a restart which kicked off the resilvering from where it stalled. This has happened a few times over the last couple days. The process should have taken 16hours according to zpool status.

da1 · Nov 6, 2013

Another option is to try some firmware updates for the motherboard, controller, etc but if the data is important, make sure to have a backup, before doing this.

ghostcorps · Nov 7, 2013

Thanks da1, that is a pretty drastic option but things are getting pretty drastic...

I am scrubbing now and have already repaired nearly 700 chsum errors on the replacement. I am hoping these are leftover corruptions from the bad drive image.

After this I will see where it stands.

ghostcorps · Nov 8, 2013

Great... the scrub is still going but now the RAID management is showing bad sectors on another disk!

Am I currently doing more damage every time I scrub the array?

phoenix · Nov 8, 2013

You're actually saving your data.

The pool finds bad sectors on the disk, notices checksum errors, re-writes the data to the offending disk (which, via CoW means it's written to a different location on disk), and carries on.

However, finding more and more bad sectors is a very good indication of a drive starting to fail. Work out a time to replace it.

ghostcorps · Nov 8, 2013

Thanks @phoenix, I thought I did find the bad disk and replace it. It showed over 9000 repairs in the RAID management program and its replacement had 1.5k checksums to be repaired in the scrub.

I thought perhaps that the intense activity of a scrub might be compounding the problem. The odd thing is that even if I have no activity on the array other than the scrub it will still find checksums every time.

My next step is to replace the CPU. After that the motherboard, then I suppose the controller...