ZFS error confusion

Bobbla · Nov 3, 2011

Hallo

So the story goes, after installing a new hdd controller and some light use I all of sudden get the message that something has gone corrupt. And sure, zpool status -v tells me that I have two degraded disks, and I'm sorta *beep**beep**beep**beep*ed...

However after a restart everything seems fine, sorta. I still have some corrupt files but that is better then nothing I suppose. And then after a while the pool yet again goes to degraded status, but when I try to run the zpool status -v it apparently seemed to freezes the computer halfway through the command. Meaning I could see the disks and all, but it wouldn't show any of the errors. At this point I went fuuu, and hit the start button at which point I could see that it tried to turn it self off. So I guess it was not completely frozen, however after waiting a little I ended up holding in the start button because I got some S5-state error about not responding something something.. (think it was S5). Maybe I was a little inpatient, who knows.

And so next time I turned it on I got a lot of errors when running the zpool status -v command. It listed 2 directories, 2 files and <metadata>:<0x4> thingy (as far as I can remember). At which point I went 2x fuu, and started to search around. The metadata tag thingy is supposedly something called Meta Object Set(MOS), I've got no real idea what its about.

At this point I turned the computer off, power and everything. Then I replaced some old SATA cables with newer once, this have apparently worked before... Anyways, started up the computer and nothing, everything was/is fine. Where did all the errors go?

zpool status -v
gives the good kind of feedback "no known data errors" and everything is "ONLINE".

I'm now confused about this whole thing. I am happy there are not more errors, but can't stop feeling skeptic about it. What is this? And should I be concerned? :\:q

Sylhouette · Nov 3, 2011

At this point I turned the computer off, power and everything. Then I replaced some old SATA cables with newer once, this have apparently worked before... Anyways, started up the computer and nothing, everything was/is fine. Where did all the errors go?

I would be a little on guard if this happens to me, but after changing cables and a problem went away, i would say it was bad cabling.
With bad hardware comes strange errors.
Sometime it is there, sometimes not, sometimes it comes directly, sometimes the errors come in later.

I would however make a backup of all your data, but you already have that and do a scrub, maybe also a export / import step would give me a saver feeling.

regards
Johan

gkontos · Nov 3, 2011

You should keep an eye on your logs. /var/log/messages in particular.
Btw. What is the brand of your controller ?

Bobbla · Nov 4, 2011

Sylhouette said:
I would however make a backup of all your data, but you already have that and do a scrub, maybe also a export / import step would give me a saver feeling.

No backup really, I am hoping the raidz will hold, has so far.. dunno what im-/outport gonna do for my feelings. I dunno about scrub, last time I tried it sorta died about 3/4 of the way through. But in the end I guess I'll try... and why do you talk like that?

gkontos said:
You should keep an eye on your logs. /var/log/messages in particular.
Btw. What is the brand of your controller ?

Logs didn't really make you any wiser and seems like you are a proud owner of a Intel SASUC8I flashed with IT firmware from LSI.

Sylhouette · Nov 4, 2011

No backup really, I am hoping the raidz will hold

BAD BAD Bobbla :e

Never ever trust your data on one spot.
A raidz can fail, a raidz2 can fail, UFS can fail, ext3 can fail, ext4 can fail, NTFS can fail, raid5 ,6 10 and so on can fail, well you get the point.

A power glitch or spike can destroy your disks, and all in one, no raid that can fight agains that.
Secondly accidents never come alone, learned the hard way.

So always backup your data.

Are you seeing the errors again, or is it silent now.
If all is quiet, 99% it was a bad cable.
This also could have caused the scrub errors you saw before the change.
A scrub will test your data to make sure every bit is ok.

regards
Johan Hendriks

Bobbla · Nov 4, 2011

Sylhouette said:
BAD BAD Bobbla :e
..always backup your data.

Well, its a question of money. I don't know where you are from, but maybe students are richer where you are from.

Sylhouette said:
Are you seeing the errors again, or is it silent now.
If all is quiet, 99% it was a bad cable.
This also could have caused the scrub errors you saw before the change.

I opened a file that I know was in the error message before, but everything seemed fine. I am scrubbing at the moment and I have about 5TB left to scrub, so in about 5 hours we'll know..

Crossing fingers on it being the cables fault, its the cheapest alternative.

Bobbla · Nov 4, 2011

The scrub is complete and I have the following error left:

Code:

storagepool/storage:<0x16268>

There also seems to have been some minor repair on two different disks.

Some of the content of thevar/log/message file;

Code:

ZFS: checksum mismatch, zpool=storagepool path=/dev/ad[8,6,10,2]/da[6,7] offset=876262691328/876262691840 size=26624/26112
ZFS: zpool I/O failure, zpool=storagepool error=86

All messages except the I/O failure came twice before da4 came once and this came out(a couple of times):

Code:

(da5:mpt0:0:5:0): READ(10). CDB: 28 0 8 1 88 bd 0 1 0 0
(da5:mpt0:0:5:0): CAM status: SCSI Status Error
(da5:mpt0:0:5:0): SCSI status: Check Condition
(da5:mpt0:0:5:0): SCSI sense: MEDIUM ERROR info: 80189b8 asc:11,0 (Unrecovered read error)

And I think that was all of it, now I'm hungry and need food >.<