ECC RAM for ZFS pool

Hello everyone,

I've been using FreeBSD for over a year now (I come from Windows and had some Linux experience) and I'm really happy about the OS. Lately I've been playing around with ZFS, first reading everything I could on the subject (even Solaris man/docs), then I tried it on VMs and finally I installed it on an old system to check everything (I know, I'm a bit paranoid but better safe than sorry).

The next step is to install it on my server (basically a file server, so only samba+few other daemons) where I have all my data (work, email, backup, photos...). I'm pretty confident about everything but there's one thing I've doubts about. I read that for ZFS it would be better (almost mandatory) to use ECC RAMs. The idea was that when scrubbing a pool ZFS checks data against hashes in RAM and a faulty RAM has wrong data so scrub messes up everything and, after that, nothing can be done.

Is this true?
Do I really need ECC (maybe even registered)?
I know ECC is better but it costs much more (considering you need a processor and motherboard that supports it too) and this is a home server and my budget is very limited. On the other hand data integrity is really important to me, so I'd like to know beforehand if I'm putting myself in a dangerous situation.

Thanks in advance for any info,

Gherardo
 
ZFS keeps vital stuff in memory for a long time, and it wants a lot of memory. So if you want to run your system for a longer length of time, the chance of a bit flipping in important memory is higher that it would be for something that reboots about every ten minutes. I am not a good guide in this (a bit biased) as I already have spent the extra cash for ECC memory plus board. Having seen the system log of a server telling you about each bit flip detected and fixed makes you itchy a bit about the risk.

So my recomendation would be: Yes, if you can do it, do it. You are always in a dangerous situation, each minute of your life. You can only mitigate the risk, but not get rid of it. You must decide what you want to afford, what you can loose and if you can afford to loose it.
 
You do want ECC absolutely if you're concerned about data integrity. ZFS handles only the on-disk integrity but can not help with in-memory corruption and you could be happily storing terabytes of corrupted data that looks just fine on disk if there's nothing to detect and correct in-memory corruption. Use ECC.
 
On episode 31 of BSD Now, Allan Jude talks about asking Matt Ahrens about ECC memory. The first important point was that zpool scrub is read-only, so it will not corrupt on-disk data if there is a memory problem. Second, ZFS is better than most filesystems about memory problems, having built-in checksums.
 
@@wblock, are you sure that scrub is read only? And that the data structures in RAM are also protected by hashes? I would doubt both of that, but had not looked up the source code. When it comes to a scrub, the damage found may be repaired if enough redundancy is available - so there would also be a write involved. Or am I completely wrong here?
 
Last edited by a moderator:
Scrub is read-only as long as everything is ok in the pool. When a problem is detected and needs to be corrected there will be writes to the pool and I'm pretty sure those writes can not be guaranteed to be correct with just the ZFS checksums alone.
 
Crivens said:
@@wblock, are you sure that scrub is read only?

Given the source, reasonably sure. I may have misunderstood, though.

And that the data structures in RAM are also protected by hashes?

No, they are (to my knowledge) not. ZFS will not protect you from undetected RAM failures, but as filesystems go, such errors are more detectable with ZFS than with other filesystems.

I would doubt both of that, but had not looked up the source code. When it comes to a scrub, the damage found may be repaired if enough redundancy is available - so there would also be a write involved. Or am I completely wrong here?

You're right. But since these are soft errors, and the checksum would be recalculated, I'd suspect the odds of it corrupting data would be very low. It would take a rare bit error and then another in the same place. But it depends on the way the code works.
 
Last edited by a moderator:
wblock@ said:
And that the data structures in RAM are also protected by hashes?

No, they are (to my knowledge) not. ZFS will not protect you from undetected RAM failures, but as filesystems go, such errors are more detectable with ZFS than with other filesystems.
That is surely true. But it can not guard you against brown outs in memory or bits flipped by radiation or quantum effects. I think Intel owns the biggest lead safe as a test chamber to provide a test place where cosmic radiation does not reach the circuits. They found out that these days it is hard to obtain plastics to cover the memory chips, or ceramics, which are suitable bare of radioactive traces. They have reached the point where this becomes a real problem, so I would not wager on any major advance in shrinking the processes.

And we have not mentioned noise on the connector cable, and that all these things can also occur on the storage unit itself (own memory, own CPU). How many bit errors are there on a SATA or SAS link per TB? Is it checked by the drive/driver? Yes, I absolutely love ZFS for the potential to tell on such things. People complaining that ZFS makes their storage more fragile should more think along the direction that they were never told before just how fragile the systems are.

With the currently available memory capacities, you may get 2 to 10 bits flipped in your machine in one day (fact based guesstimate). This depends on make, place, altitude, moonphase and the hair color of the janitor. Long story short - this is a real problem, and it is very hard to control.

What I would think as the worst case here would be that the scrub takes data from the cache, not the storage medium, and may by that route even skip the hash checking. It is in core memory, so it is supposed to be clean, yes? Now one read finds that there is a problem, and ZFS uses the already in-core data to "fix" that, basically making things worse. When that is your super block, you get interesting times ahead of you.

Basically, I would recommend ECC memory for anything that is planned to be run 24/7 or where the possibility of such errors would screw something up really bad.
 
Surely this would not be a problem on non ECC installations as this is a common computer science issue. Such that if the computation in memory does not match what is on disk, and we are not using ECC, then check again, and a third time akin to the 'Byzantian generals problem'.

http://en.wikipedia.org/wiki/Byzantine_fault_tolerance

The benefit of ECC in this scenario would be speed, as you'd know you have an uncorrectable error generated as a result of the memory read and would have to retest the sector, vs no error generated means guaranteed bit rot on disk.

I say 'surely' but I'm hoping someone who actually knows can chime in?
 
The real cost of ECC is not the small price difference of the DIMMs. It is that ECC forces you to buy more expensive motherboards and CPUs. That's particularly true if you go for compact or low-power systems.

There exist file systems that protect the buffer cache and most important in-memory data structures with checksums / hashes / CRCs. They are not commonly or freely available. I don't know exactly how ZFS compares, never having looked at its source code, but it does protect data on disk with checksums, and it stands to reason that this protection carries through to pages in the buffer cache. Since that is the bulk of the bytes in RAM, this addresses most of the memory error problem. On the other hand,

But one has to see the risk of using non-ECC RAM in perspective. The #1 cause of data loss in file systems is disks. That is particularly true with today's very large disks that still have significant uncorrectable bit error rates. Single-fault tolerant RAID encodings are no longer considered adequate, since after complete failure of 1 disk, there is significant risk of data loss due to a read error on the second disk. Calculate it for yourself, for a 4TB disk (3.2 * 10^13 bits) and a BER of 10^-15 you have a 3.2% chance that a RAID-1 or RAID-5 rebuild will fail. Yet few people (outside of commercial storage servers) use RAID, even less double-fault-tolerant RAID.

We also have to recognize that file system and RAID alone will never be able to cure all data loss. First, these systems have software bugs. In my experience, once you have 2-fault tolerant RAID codes, the next big cause of data loss is the file system structures self-destructing (and this applies to high-quality professional systems that people pay big bucks for). Often this happens due to an unclean shutdown and restart (which stresses file systems more than normal operation). Second, user error can destroy a whole file system in a big hurry. We all joke about "rm -Rf /", but things like that really happen all the time (usually due to innocent mistakes, like creating a file named "-Rf", or by deliberately writing to files like databases). Another good source of error is admins who pick the wrong disk; a running joke among my colleagues is that no redundant storage system can survive if the admin formats all of the disks we use as a Reiser FS. Then there is disaster tolerance. The story of the company that had a backup data center (with synchronous replication) in the *other* tower of the World Trade Center is too tragic to make jokes about. But as a realistic example: I recently had a disk enclosure where the voltage regulator failed, and destroyed every disk drive in the enclosure. If the system had no been redundant across multiple enclosures, the result would have been massive data loss. But for a small system with a single enclosure, this would have been the end. Fire sprinklers that turn on by mistake are also a good source of massive data loss, more so than fires in the data center.

The best protection against such problems is backup, to a different device on a different failure domain. The easiest thing to do for a small system is an external USB-connected disk, rsync'ing to it every few days, and then storing it in a different place (I used to take my backup disk and leave it in my office; now I do network-based remote backup). Tapes are good for this too, but they are not cool and hip these days (even though they work remarkably well, better than most other technologies).

Where does this leave us? If you have money, by all means buy a motherboard with ECC for a file server. If you are short on money and are not using 2-fault tolerant RAID and remote backup yet, then throwing money at ECC memory is the wrong investment. Get yourself a second and third disk first, and then an external backup drive or a tape drive second. A UPS is also an excellent investment. After you have these heavy hitters in place, then worry about ECC.
 
ralphbsz said:
...
The best protection against such problems is backup, to a different device on a different failure domain. The easiest thing to do for a small system is an external USB-connected disk, rsync'ing to it every few days, and then storing it in a different place (I used to take my backup disk and leave it in my office; now I do network-based remote backup). Tapes are good for this too, but they are not cool and hip these days (even though they work remarkably well, better than most other technologies).

Where does this leave us? If you have money, by all means buy a motherboard with ECC for a file server. If you are short on money and are not using 2-fault tolerant RAID and remote backup yet, then throwing money at ECC memory is the wrong investment. Get yourself a second and third disk first, and then an external backup drive or a tape drive second. A UPS is also an excellent investment. After you have these heavy hitters in place, then worry about ECC.

The ordering in gain-per-buck is a good one, but it does not apply everywhere. I would not worry too much about power loos, as I live in a place where power is quite reliable. It has been more than 15 years since I have seen the lights go out around me for external reason. Also, ZFS is pretty good at surviving such scenarios, you may loose what you did just there - but most of the file system should be there. Taking snapshots from time to time would be a good thing. (Note to world - a tool to dig trough the on-disk structure and re-assemble a ZFS content from a snapshot point would be great!)

But let me add to your priority list, which is a good idea, the point of 'knowledge'. You can spend time and money on user education, training, experience. That will most likely pay off much better than investing in a third backup solution. The bad thing about this is, alas, that it involves cost saved in unacountable situations, something the bean counters can not get around to understand (or can not officially).

So some small test playground, say one of these 1-liter-cases which comes in at about 100€, should be an absolute must for admins and interested home users.

If you do not use multiple-fault tolerant RAID (RaidZ2 here ;) ) then at least use mtree on the files you do not touch regulary. Also set them to read-only when possible. This is what can be done to the folder containing the pictures of your kids, so you do not push files you may have known to be corrupt into backup, possibly overwriting correct versions of this. Check the mtree in cron. Each disk which went bad on my systems in the last years was reported to me first by ZFS, not by SMART. Any more suggestions, maybe some "good practice" thread here I may have missed to write this into?
 
Two years ago I bought a cheap HP Proliant server for about 300 euro. It came with ECC memory. So ECC capable hardware does not have to be expensive ;)
 
Crivens said:
The ordering in gain-per-buck is a good one, but it does not apply everywhere. I would not worry too much about power loos, as I live in a place where power is quite reliable. It has been more than 15 years since I have seen the lights go out around me for external reason.

It occurred to me that most people could say the same about memory errors, although it's not really the same situation. To me, a UPS is as necessary as a good power supply, or rather, they are both part of the same thing.

I suspect over the next few years we will see more and more ECC, the same as we will see disk redundancy become more common. The larger capacities will make it necessary.
 
Crivens said:
I would not worry too much about power loos, as I live in a place where power is quite reliable. It has been more than 15 years since I have seen the lights go out around me for external reason.
You are lucky! And probably in Europe. I live on the edge of Silicon Valley, in the highest high-tech community in the world. Every time the power goes out for more than 1 second, my server sends me an e-mail. In the winter (where it rains, even in otherwise sunny California) I get these e-mails every few days. Usually, the power comes back after a few seconds (that usually means some electrical line worker had to switch something, or a wet tree branch touched the power line). If the power doesn't came back within ~10 minutes, my server shuts down, to protect the battery in the UPS. That happens maybe 5 times per year. And about once a year my wife calls: she just got home with our son, there is no electricity, she's running the house on the gasoline generator, the kid needs the internet and the files to do homework, but the UPS refuses to stay online with the generator. I always have to remind her that the UPS is sensitive to frequency fluctuations, but that she can bypass the UPS.

Also, ZFS is pretty good at surviving such scenarios, you may loose what you did just there - but most of the file system should be there.
I hope that ZFS is pretty good at surviving hard crashes, and near-perfect at surviving orderly shutdowns. But for file systems in general, shutting down and starting up is the hardest thing to do. That's probably where a large fraction of the data loss due to software bugs happens.

I completely agree that education and training are a great investment. Albeit a difficult to measure and justify one.

If you do not use multiple-fault tolerant RAID (RaidZ2 here ;) ) then at least use mtree on the files you do not touch regulary. Also set them to read-only when possible. This is what can be done to the folder containing the pictures of your kids, so you do not push files you may have known to be corrupt into backup, possibly overwriting correct versions of this. Check the mtree in cron.

On my previous server (OpenBSD with UFS) I had such a script; in certain directories (ripped music, scanned documents, pictures from the camera) it would automatically set the nouchg flag on files that hadn't been modified in 48 hours. Haven't gotten around to porting that over to my new server (FreeBSD with ZFS). One more thing for my to-do list. Thanks for reminding me.

When I bought my current motherboard, there were no low-power Atom boards in mini-ITX with ECC. This discussion made me look, and: you can now get a somewhat low-power mini-ITX motherboard with ECC! Costs a few hundred $ more, and takes a little more power. Probably on the next upgrade.

Each disk which went bad on my systems in the last years was reported to me first by ZFS, not by SMART.
SMART is good, but not great. There was some data a few years ago, collected by a few google people. I can greatly simplify the result as follows: Of all the disks that fail, only half gave a SMART warning before failing. And of all the disks that gave a SMART warning about imminent failure, only half actually failed. What do we learn from that? If a disk gives a SMART warning, just replace it, and put a spare in. And always be ready for disks to fail without warning.
 
Back
Top