Faulty drive or not

While scrubing my raidz1 3 disk pool I noticed that my data that are shared there were not accessible from my samba shares anymore. I immediately logged to my server and saw that zpool reported a checksum error on the 3rd disk. I checked the logs and saw the following:
Code:
Feb 21 20:49:19 hp root: ZFS: checksum mismatch, zpool=tank path=/dev/label/zdisk3 offset=104439013376 size=65536
Feb 21 20:49:19 hp root: ZFS: checksum mismatch, zpool=tank path=/dev/label/zdisk3 offset=104439865344 size=65536
Feb 21 20:49:21 hp kernel: ad4: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=203997424
Feb 21 20:49:32 hp kernel: ata2: SIGNATURE: 00000101
Feb 21 20:50:12 hp kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Feb 21 20:50:45 hp su: gkontos to root on /dev/pts/1
Feb 21 20:50:53 hp kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
Feb 21 20:51:33 hp kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Feb 21 20:52:14 hp kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Feb 21 20:52:14 hp kernel: ad4: TIMEOUT - READ_DMA retrying (0 retries left) LBA=203997424
Feb 21 20:52:16 hp kernel: ad4: FAILURE - READ_DMA
 status=ff<BUSY,READY,DMA_READY,DSC,DRQ,CORRECTABLE,INDEX,ERROR>
 error=ff<ICRC,UNCORRECTABLE,MEDIA_CHANGED,NID_NOT_FOUND,MEDIA_CHANGE_REQEST,ABORTED,NO_MEDIA,ILLEGAL_LENGTH> LBA=203997424
Feb 21 20:52:26 hp kernel: ata2: SIGNATURE: 00000101
Feb 21 20:52:26 hp kernel: ad4: TIMEOUT - READ_DMA retrying (1 retry left) LBA=203997680
Feb 21 20:52:29 hp kernel: ad4: WARNING - READ_DMA UDMA ICRC error (retrying request) LBA=203999088
Feb 21 20:52:39 hp kernel: ata2: SIGNATURE: 00000101
Feb 21 20:53:19 hp kernel: ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
Feb 21 20:54:00 hp kernel: ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
I tried to terminate the scrub without any success so I rebooted the server. When it came up I issued another scrub. The errors kept popping and the pool reported errors again. I removed the drive, inserted it in my desktop in order to examine it. Smartmon tools reported no errors. I zeroed it out and during the process no errors were displayed again. So, I decided to plug it in to the server after I removed and cleaned the SATA cables.
So far resilvering goes ok with no problems. I also plan on scrubing it again.

Could it really be a bad SATA cable that caused all the trouble ?

The system is running for more than a year with that configuration, currently at 8.2-Release.

Thanks for your input
 
3rd scrub in a row:
Code:
  pool: tank
 state: ONLINE
 scrub: scrub completed after 1h33m with 0 errors on Tue Feb 22 12:14:16 2011
config:

	NAME              STATE     READ WRITE CKSUM
	tank              ONLINE       0     0     0
	  raidz1          ONLINE       0     0     0
	    label/zdisk1  ONLINE       0     0     0
	    label/zdisk2  ONLINE       0     0     0
	    label/zdisk3  ONLINE       0     0     0
I guess SATA cables can fail also x(
 
Yes, it could be the cable, or how the cable is routed in the case. If rerouting/replacing it doesn't solve the problem, time for a new drive. :)
 
It seems that it is time for a new drive.
Code:
Feb 22 15:54:55 hp kernel: ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
Feb 22 15:55:35 hp kernel: ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
Feb 22 15:55:35 hp kernel: ad4: FAILURE - READ_DMA48 timed out LBA=489976661
Feb 22 16:05:15 hp [B]smartd[/B][53526]: Device: /dev/ad4, ATA error count increased from 43 to 44
At least now I know whats wrong :P
 
It's also useful to turn on wifi, bluetooth and any other peripherals present in your system. In my case it timing out since I had turned off the above two devices.
 
hamis said:
Its also useful to turn on wifi, bluetooth and any other peripherals present in your system.
In my case it timing out since I had turned off the above two device.
This is probably more a coincidence than anything technical.

gkontos, you might want to install sysutils/smartmontools. This will monitor the SMART capabilities of the drives and can inform you of any imminent drive failures.
 
Hi guys :e

This thread is a year old but thanks for the tips.

@hamis
I never use bluetooth, wifi, etc. on a server.

@SirdDice
Please look at my last post!

Bye for now.
 
Back
Top