ZFS ZFS_DEBUG_MODIFY and ECC

zygoptera · May 16, 2016

This is a cross posting, I have started a similar thread in the NAS4FREE forum, but I guess that there may be more ZFS implementation experts here. Please let me know if this post is over the line....

I have been reading up on ZFS and non-ECC vs ECC in relation to a disk upgrade after a near crash, and acquisition of a second hand old professional rack server.

My conclusion is that non-ECC probably works well, but ECC is better...
I read this recent thread in the nas4free forum,
viewtopic.php?f=4&t=6244#p68136
and read in a link that one of the ZFS creators, Matthew Ahrens, recommends to turn on the flag
ZFS_DEBUG_MODIFY
in this link
http://blog.codinghorror.com/to-ecc-or-not-to-ecc/

This flag, according to Ahrens, reduce the risk of memory error by checksumming data in memory before committing a write, if I understand it correctly.

Questions/Suggestions
1.
My question for those who understand the fineprint details in ZFS, is this useful if you have ECC also?
Does it increase safety or maybe even reduce it?
I think ECC commonly correct single bit errors but not double or multiple. Would the flag help with that?

2.
Since Ahrens recommend it, and most freebsd users probably don't have ECC, why not have it turned on as default?

kpa · May 16, 2016

Sounds interesting but I have big reservations about the ZFS_DEBUG_MODIFY option. It sounds like it could be major performance hog since it's checksumming everything that is being written an additional time before the standard checksumming for data to be written to disk is performed. Already the name of the flag (DEBUG) suggests that it's meant for only testing and debugging ZFS and not for real production use. Anyone have any benchmarks with ZFS_DEBUG_MODIFY turned on?

How did you get the idea that Mr. Ahrens recommends use of it? All I read is that ZFS can mitigate the problem but no assesment if that is smart thing to do by default.

zygoptera · May 16, 2016

Yes, I would like too see a performace test with/without too.
I have a typically low utilized server at home, never runs high on CPU, always fast enough, at least so far. Maybe it would be acceptable in a home scenario, and reduce the risk of memory induced errors without having to go ECC. Kind of a workable compromise for users that don't want to go ECC or simply don't know about it...

This said, I already went the ECC track by getting an old Supermicro server that will be fine I hope, allthoug a bit power hungry..

The Ahren citation is well hidden, about 2/3rds down in the link I quoted above.
(http://blog.codinghorror.com/to-ecc-or-not-to-ecc/)

I have not tried to track down the "Hardforum" post. I don't know that forum, but have not looked for it either.

 

...And at the risk of "appeal to authority" here's ZFS co-founder and current ZFS dev at Delphix, Matthew Ahrens from a thread on Hardforum; see bolded sections re: the ability of ZFS to mitigate in-memory corruption with a specific debug flag (at a performance cost) as well as the bottom line re: ECC



There's nothing special about ZFS that requires/encourages the use of ECC RAM more so than any other filesystem. If you use UFS, EXT, NTFS, btrfs, etc without ECC RAM, you are just as much at risk as if you used ZFS without ECC RAM. Actually, ZFS can mitigate this risk to some degree if you enable the unsupported ZFS_DEBUG_MODIFY flag (zfs_flags=0x10). This will checksum the data while at rest in memory, and verify it before writing to disk, thus reducing the window of vulnerability from a memory error.



I would simply say: if you love your data, use ECC RAM. Additionally, use a filesystem that checksums your data, such as ZFS.

edit: Changed formating.

kpa · May 16, 2016

That still does not sound like a recommendation to use the flag, only stating that the flag exists and can mitigate the problem but with potential performance penalties.

zygoptera · May 16, 2016

kpa said:
That still does not sound like a recommendation to use the flag, only stating that the flag exists and can mitigate the problem but with potential performance penalties.

Maybe recommend is to strong a wording, but I think it is very close... why else suggest it?

I hope I will see a test soon. I am not sure how to set the flag and don't know how to test it in any way that would give reliable data myself, and I don't have a test system to play with. Don't like to play with a live server...

kpa · May 16, 2016

People who have a scientific/engineering background tend to talk that way, expressing possibilities but not taking any stance one way or another. This style of talk gets often mistaken by "laymen" as definitive recommendations when in fact what was said was never meant as such.

zygoptera · May 16, 2016

yea, maybe. I tend to do the same, got a PhD myself, but not in computer science...
It is more fun to look at opportunities than problems... ;-)

Anyway, I like the idea of having a poor mans "memory error detection" to prevent bad memory to cause incorrect data.

Would also be good to hear an opinion from a FreeBSD zfs developer that understands the itty gritty details...
or, maybe you are one of them?

ZFS ZFS_DEBUG_MODIFY and ECC

zygoptera

kpa

zygoptera

kpa

zygoptera

kpa

zygoptera