Do I need ECC Ram for a little Homeserver with zfs and Raid1(NAS, Nextcloud, jails)

Hello,
I would like to build a new small home server.
I wonder if I need ecc ram.
I have found that this is a myth.
"
Myth #2 - ECC RAM must be used.

○ All filesystems benefit from ECC RAM and ZFS is no different here.

○ ZFS without ECC RAM is safer then other filesystems with ECC RAM (checksums).
"
But I wanted to ask again just to be sure.
Thank you very much.
Frank
 
Yeah, ZFS helps a bit if you have or starting to have memory corruption. When ZFS says some files are bad you definitely want to stop overwriting one-shot backups and run memory testing.

I am a big proponent of ECC memory. Not so much for the error correction, but for the warning messages that you get when your RAM goes back. If you don't have ECC you can't tell whether you need ECC.

ZFS has little to do with it, but if you run large filesystems (ZFS or not) you probably want ECC.
 
ECC is a good idea for any system that is long running and keeps data in memory for a long time.

The amount of memory these days will make it very likely to have bits flipped. Writing cached data to storage, or reusing it, will then give you problems. Out admin at university showed us the log from a SUN which complained about flipped bits due to ECC. When that thing is running long calculations (VLSI P&R f.e.), it would absolutely suck to have a bit error somewhere along the line. ZFS keeps data in memory for a long time, even compressed. Having bit errors in ARC will make life interesting, and running consumer grade HW for month and expect nothing to go wrong is just asking for this.

The rule of ECC for ZFS comes from an area where uptimes are not measured in days. And where loosing a compute run or messing up your data will cost serious money, compared to the few wampum for ECC over regular memory.
 
The myth is the word must. Your question is need. You do not need ECC and it is not a must have, that is true.

If you are selecting ZFS for data integrity, then it stands to reason you should select ECC, for maintaining that integrity. Otherwise you are gambling that you will not need it, which you won't know until something weird happens. If you're comfortable in that scenario with blaming it on weird FreeBSD or ZFS problems and moving on, then I guess you don't need it.
 
To answer the question, you must ask how much you value your data and system reliability. If both are important, then ECC RAM is mandatory.

There's plenty of empirical evidence to support this stance. For instance, this paper form Google, which starts the list of conclusions with:

"About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year."​

Just one memory error might irretrievably corrupt some data or crash your system.

I don't have ECC memory on my media clients because I don't care enough to do it. I don't have ECC memory in my notebook because it's not supported. It's an imperfect world...
 
A lot of what is said above is correct. I want to add one observation on the relationship between ZFS and ECC DRAM.

It is often said that ZFS "needs", "wants" or "benefits" from ECC. That is partially wrong: ZFS does not keep particularly more data in memory than other file systems do. Matter-of-fact, on a typical Unix machine, fundamentally all unused memory is used as a cache for recent data, so in a nutshell all memory is used at all time, and is therefore also vulnerable to memory errors. This is true for all Unixes, and all file systems, including ZFS on FreeBSD. So from this viewpoint, ZFS doesn't need ECC any more than other file systems do.

All Unix file systems keep data and more importantly metadata in memory before it is written to disk, typically or seconds (5 seconds or 30 seconds are typical). During that time, the data is highly vulnerable to memory errors, because if the in-memory copy is corrupted now, the corrupted data will be written to disk wrong. ZFS is not particularly better or worse in how long it keeps data in memory. One thing that helps ZFS is that it calculates checksums of data very early on, so if the copy in memory is corrupted and then copied to disk, on the next read, the wrong checksum will flag the corrupted data as invalid. Which is good (at least we're not operating on wrong data), and bad (you will not get a checksum error, which most people will (mis-) interpret as a disk error.

I don't think ZFS keeps checksums of all its in-memory data structures at all times. Doing so would be prohibitively expensive, in particular in the free software arena, where too many decisions are driven by (often stupid) benchmarks; all I need to say is "Larabel". Some commercial storage systems do much more extensive checksum protection in memory, not just of data blocks, but also of internal data structures, such as allocation tables; I know of no such technology in the free software world.

The fact that ZFS keeps checksums of data on disk makes it more valuable to have ECC, but that logic is a little difficult to explain: For most storage systems, the biggest source of data corruption and data loss is hardware problems, with disks themselves (both spinning rust and flash are far from perfect), and with interfaces, including good-quality commercial ones (a colleague used to cut SAS cables that cause CRC errors in half with wire cutters, to make sure they don't get reused). To prevent data loss, you use RAID-like technologies, which ultimately rely on redundancy (trade off more storage use for higher durability), and ZFS does a good job supporting that. That leaves silent data corruption as the next biggest problems. ZFS uses checksums to guard against most of these problems; most other file system do not use checksum. Therefore, on ZFS really the largest source of data corruption are memory problems; other file systems are still dominated by disk and interface hardware problems. Therefore, on ZFS it is RELATIVELY (not absolutely!) more valuable to invest in memory protection, once you have invested in disk redundancy, and that gives you a much better hardened system.

I hope the above explanation is at least slightly clear. And having said that: Do as I say, not as I do: my personal server at home does NOT have ECC memory. I would like it to, and on the next motherboard upgrade it's going to happen, but last time I bought a new motherboard, other concerns made ECC impractical.
 
For Intel Raid it's with ECC

Error checking and correction (ECC) DRAM technology protects the data while it is in cache. The ECC scheme generates 8 bits of check data for every 64 bits of regular data transferred. The memory controller uses this information to detect and correct data errors originating inside the DRAM chip or across the memory bus.

Another good reading for the HDD and it's end to end ECC is T10 Protection Information (T10-PI) and how the data on the HDD is protected. It's predecessor of IBM FM / MDM Track format of the Floppy disk and how the checksum is calculated for the sector.
 
Thank you very much for the detailed answers.
I will probably put together a new system with ECC. My current home server is more than 8 years old and also has ECC.
Many thanks and best regards
Frank
 
As this discussion has centered around ECC (volatile RAM) memory, it's perhaps worthwhile mentioning another aspect of ZFS that may well catch (new) ZFS users by surprise.

ECC memory stands for Error Correction Code memory, meaning bit errors are not only recorded but also corrected, at least where it concerns single bit errors. For ZFS and secondary storage, that is non-volatile hdd's, be it flash or spinning platters based, this is not that automatically.

ZFS has extremely good safeguards in securing the delivery of non-faulty data serviced from storage. However, it can only correct data errors when there is the necessary redundancy present. When there is no redundancy (or insufficient redundancy given enough error), ZFS cannot correct the error(s). For example, a non-redundant one disk ZFS setup (with the standard copies=1 setting) will fail when discovering a single error when data* is being scanned (scrub) or serviced. You need redundancy: either an n-way mirror of RAIDZ-1,- 2 or- 3 setup. Perhaps only with the exception of where only a single disk can be deployed like a laptop, I would suggest: do not consider a non-redundant ZFS setup.

The unexpected aspect is that when sufficient redundancy is not available the ZFS pool will fail and cannot be imported. Note that there is no fsck(8) equivalent for ZFS: you'll have to resort to backups.

___
* ZFS meta data has built-in redundancy.
 
Keep in mind that ZFS has no fsck.

If the corrupted bits are not in data, but in metadata you can face hard to correct situations.
 
Back
Top