gmirror: inconsistencies while writing/reading large files

Hi All,

I've got a problem with my gmirror array:
My scenario:
1. I copy a big file to the array (~500MB or more). There are no errors while copying.

2. The file contents changes slightly every time it is read, i.e:
Code:
fiziak 18:51 tmp% md5 openbsd.img.*
MD5 (openbsd.img.1) = 8e1dfa9071bf7aa63caf51e213fff95b
MD5 (openbsd.img.2) = 5ddf995750379419a28966b92bebbbd8
MD5 (openbsd.img.3) = 540a1c5854bf23e73057155999745c69
MD5 (openbsd.img.4) = 540a1c5854bf23e73057155999745c69
MD5 (openbsd.img.5) = 540a1c5854bf23e73057155999745c69
fiziak 18:53 tmp% md5 openbsd.img.*
MD5 (openbsd.img.1) = 8e1dfa9071bf7aa63caf51e213fff95b
MD5 (openbsd.img.2) = 5ddf995750379419a28966b92bebbbd8
MD5 (openbsd.img.3) = 540a1c5854bf23e73057155999745c69
MD5 (openbsd.img.4) = a83f9a23c87b5050583079c3b1f26ed6
MD5 (openbsd.img.5) = 540a1c5854bf23e73057155999745c69
fiziak 18:57 tmp% md5 openbsd.img.*
MD5 (openbsd.img.1) = 2560e643acbecca8d48b5ffe289ce305
MD5 (openbsd.img.2) = cc5f7ffd4ec041f54dbddb5fc9c52f0b
MD5 (openbsd.img.3) = 540a1c5854bf23e73057155999745c69
MD5 (openbsd.img.4) = 540a1c5854bf23e73057155999745c69
MD5 (openbsd.img.5) = 540a1c5854bf23e73057155999745c69

Note that openbsd.img.[1-5] are all copies of the same file.
540a1c5854bf23e73057155999745c69 is the right MD5 sum.

3. The files are not identical on both disks. I checked that by unmounting the array and mounting each of the two providers separately. MD5 sums differ between the disks, but are consistent in multiple MD5 runs.

So generally I'd assume it's a hardware issue with one of the disks, but the thing is I'm getting no errors from the OS nor SMART and the disks seem to behave normally while mounted separately.

Any tips are appreciated.

Cheers,
Szmytson
 
Please post the output from:
Code:
smartctl -a
Specifically you want to pay attention to Reallocated_Sector_Ct. That can clue you into if the disk is remapping sectors under the covers of the OS/FS. This is one of the things that makes ZFS so nice.
 
Thanks for the prompt reply.
Here is Reallocated_Sector_Ct and some others values I though might be interesting:
/dev/sda6:
Code:
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       1013409
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
/dev/sda10
Code:
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       18
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       304715
200 Multi_Zone_Error_Rate   0x000a   099   099   000    Old_age   Always       -       132
The full output can be found here:
/dev/ad6 - http://pastebin.com/m43483781
/dev/ad10 - http://pastebin.com/m51622735

I guess /dev/sda10 is the candidate to be replaced.
I'm not sure what Multi_Zone_Error_Rate is ?
Should I be worried with high Hardware_ECC_Recovered errors ?

Cheers,
Szmytson
 
At work we always replace drives immediately when we see Reallocated_Sector_Ct start going up. I'm not sure about the Hardware_ECC_Recovered, do some googling and see what you come up with.

Perhaps your drive is still under warranty and you can run the tools provided by the drive manufacturer to test it. Those tools will hopefully tell you more and give them the info they need to issue an RMA.
 
I'm suspecting the controller as I'm seeing Hardware_ECC_Recovered increasing constantly on both drives.

What's more I got something like this half an hour ago:
Code:
Nov 18 19:51:43 fiziak kernel: ata5: reiniting channel ..
Nov 18 19:51:43 fiziak kernel: ata5: SATA connect time=0ms
Nov 18 19:51:43 fiziak kernel: ata5: reset tp1 mask=01 ostat0=d8 ostat1=00
Nov 18 19:51:43 fiziak kernel: ata5: stat0=0xd8 err=0xd8 lsb=0xd8 msb=0xd8
Nov 18 19:51:43 fiziak last message repeated 16 times
Nov 18 19:51:43 fiziak kernel: ata5: stat0=0x50 err=0x00 lsb=0x47 msb=0xac
Nov 18 19:51:43 fiziak kernel: ata5: reset tp2 stat0=50 stat1=00 devices=0x0
Nov 18 19:51:43 fiziak kernel: ad10: FAILURE - device detached
Nov 18 19:51:43 fiziak kernel: subdisk10: detached
Nov 18 19:51:43 fiziak kernel: ad10: detached
Nov 18 19:51:43 fiziak kernel: ata5: reinit done ..

I'm gonna check the disks in another computer on Saturday.

Thanks for the tips brd@,
Szmytson
 
[SOLVED] Thread: gmirror: inconsistencies while writing/reading large files

Hi all,

I just wanted to quickly follow up on this issue.
I managed to pinpoint the culprit - it was a faulty SATA cable.

In the meantime I bought a new controller - Adaptec 2410SA with real hardware RAID and I'm really pleased with it.


Adaptec won't even connect the disk through the faulty cable - it complains the disk is not responding while the old controller didn't seem to have any problem with it.
Just for the record: it was cheap "Sil 3114 SATALink/SATARaid Controller": class=0x010400 card=0x71141095 chip=0x31141095 rev=0x02 hdr=0x00.


Anyway, Thanks for your help brd@ .
Much appreciated.
 
If you have reallocated sectors you should swap out your disks anyway.
Also you should periodically run SMART tests on disks to catch faults as early as possible (by setting up tasks in /usr/local/etc/smart.conf and starting smartd daemon).
 
Back
Top