ECC or non-ECC

poorandunlucky · Nov 23, 2017

Discuss.

Phishfry · Nov 23, 2017

Why would I need ECC memory on a HTPC. So your question is broad.

ECC if you care about your data. How about that for a short answer.

On my security camera server I have ECC. On my laptop I don't (and can't).

It don't make me feel any different.

poorandunlucky · Nov 23, 2017

Phishfry said:
Why would I need ECC memory on a HTPC. So your question is broad.

ECC if you care about your data. How about that for a short answer.

On my security camera server I have ECC. On my laptop I don't (and can't).

It don't make me feel any different.

Well, I imagine data integrity comes at a price... either a dollar amount, a performance metric, a risk, peace of mind, etc...

Even when needed, is it worth it? Is it worth the dollar amount?

Is the overhead and bottleneck, and possible reduced performance cost worth it? In which scenarios do you think it's worth it?

In an ideal world, should all RAM be ECC?

In what scenarios should ECC be most considered, or weighted for? You say for your security camera, is that a good example or when ECC should be used? Wouldn't fast hard drives be better? idk...

It's a broad question indeed, but it's been a long time since I've asked it, and a lot of technological changes happened...

And on that note, are there generational boundaries where you should weigh in favor of ECC, like maybe older systems pre-something should use ECC more than more modern systems, again, idk... that's why I'm asking...

ralphbsz · Nov 23, 2017

poorandunlucky said:
Well, I imagine data integrity comes at a price... either a dollar amount, a performance metric, a risk, peace of mind, etc...

The price is a dollar amount. I don't think there is a performance penalty, if you pay enough. The highest available memory bandwidth probably is on high-dollar server machines, which all have EDD.

Risk and peace of mind is not a price.

Even when needed, is it worth it? Is it worth the dollar amount?

Depends. How much money do you lose every time the computer crashes and you have to wait a minute or ten for rebooting? How much money do you lose if memory errors (which do exist but are rare) silently corrupt your data? That heavily depends on the usage. For a laptop that's on my lap and used to browse the web and send e-mail, a crash means I get to get up and pour another glass of wine, and continue working a minute later. My laptop has nearly no data stored on it, so the risk of corruption is very low. On the other hand, for a server at a bank, which is needed to operate thousands of ATM machines, and where the data is the content of the customer's bank account, the answer is different.

In what scenarios should ECC be most considered, or weighted for? You say for your security camera, is that a good example or when ECC should be used? Wouldn't fast hard drives be better? idk...

ECC does not compete with fast hard drives. It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.

One factor that goes into it is the amount of RAM. Most laptops or consumer computers sold today seem to have in the neighborhood of 4 to 16 GB of RAM. Many enterprise servers have hundreds of GB, and 1TB of RAM is starting to be seen in production regularly. On one hand, that means that the cost of ECC becomes much higher; on the other hand, it means that the utility of ECC is also much higher (much more data in RAM that makes a big target for corruption).

For an amateur or home user, it is indeed a tradeoff: You can invest $300 into a better motherboard and more expensive DIMMs and get ECC, or into a second hard disk, and get RAID. Which one is a better deal? I don't know.

Would I spend an extra $100 or $200 to get ECC on my user interface device (I use Mac laptops): Absolutely, if I could. Unfortunately, the only laptops with ECC seem to be very large and impractical portable workstations (Lenovo and Dell make them), which cost several thousand $ more than a sensible laptop. Not a useful discussion to have, for lack of options.

On a home server, it is a more interesting question. Personally, I use RAID but not ECC at home, but I know that I'm very biased (being a professional storage person). And when I bought my server a few years ago, I was interested in very small physical size and low power consumption; I don't think ECC in a micro-ATX form factor is even a thing. If I could get ECC the next time I upgrade, I would probably do it.

For an enterprise user, the answer is nearly always ECC for server-class machines; and if it isn't, it's because people have thought about it carefully and have made the tradeoff. For compute engines, it's more mixed; I've seen large clusters with inexpensive non-ECC machines in them (but then carefully managed so crashes don't take the whole cluster down).

poorandunlucky · Nov 23, 2017

ralphbsz said:
ECC does not compete with fast hard drives. It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.

I don't see how ECC compares with RAID or mirrors... Like one is for stored data, whereas the other is for computed data... or is it not possible for the computer to store corrupted computed data?

Phishfry · Nov 23, 2017

The files have to get to disk somehow. This somehow is via memory.

My point about my security cam server versus my HTPC is that I could stand to loose a frame of OTA TV but on something like my security camera server I need every frame to be there.

poorandunlucky · Nov 23, 2017

I just never had to deal with corrupted data before, not that I know of, or noticed... I don't really know what it's like, or what I'm/we're talking about, by the same token...

Sometimes you know things but they don't really make sense until you're able to observe them...

What about Zn shields? Do you think one should mind gamma rays and radiation when it comes to data integrity?

SirDice · Nov 23, 2017

poorandunlucky said:
I don't see how ECC compares with RAID or mirrors.

ECC for memory works pretty much the same way as RAID 5 does for disks. It's slightly different but the idea is the same. You can loose one memory chip and there will be enough redundancy in the rest of the chips to keep the data intact.

poorandunlucky said:
I just never had to deal with corrupted data before, not that I know of, or noticed... I don't really know what it's like, or what I'm/we're talking about, by the same token...

Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).

Back in the olden days on the Amiga there was this virus called "Lamer Exterminator". It was a royal pain if you where infected. It was one of the first that was memory resident (it was still active after a reboot). If it was active it hid itself when you tried to read the boot sector. But the worst part of it was that it randomly filled tracks on disk with the "LAMER" text. One track at a time at random intervals. So at first you don't notice it's active. Then you get more and more weird disk errors. Until you realize half your disk was silently overwritten by this monster and there was no way to recover from it anymore.

Same with memory errors. They can silently corrupt files in memory before the data is written to disk. And it can take a while for you to notice those files are corrupt. Now imagine you diligently backup your data every day. Your backups will be corrupted too because you're backing up bad files. If this goes unnoticed long enough all your backups will be worthless too. And then you get hired to sort this mess out. Expensive, tedious, exercise which could have been avoided if they spent just a little bit more money initially.

poorandunlucky · Nov 23, 2017

SirDice said:
ECC for memory works pretty much the same way as RAID 5 does for disks. It's slightly different but the idea is the same. You can loose one memory chip and there will be enough redundancy in the rest of the chips to keep the data intact.

Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).

Back in the olden days on the Amiga there was this virus called "Lamer Exterminator". It was a royal pain if you where infected. It was one of the first that was memory resident (it was still active after a reboot). If it was active it hid itself when you tried to read the boot sector. But the worst part of it was that it randomly filled tracks on disk with the "LAMER" text. One track at a time at random intervals. So at first you don't notice it's active. Then you get more and more weird disk errors. Until you realize half your disk was silently overwritten by this monster and there was no way to recover from it anymore.

Same with memory errors. They can silently corrupt files in memory before the data is written to disk. And it can take a while for you to notice those files are corrupt. Now imagine you diligently backup your data every day. Your backups will be corrupted too because you're backing up bad files. If this goes unnoticed long enough all your backups will be worthless too. And then you get hired to sort this mess out. Expensive, tedious, exercise which could have been avoided if they spent just a little bit more money initially.

Well more and more people handle sometimes irreplaceable data on their computers... You may not think that someone's vacation pictures to Cuba are irreplaceable data, but maybe to them, those pictures mean the world, and they figure they're safe in the computer, and I don't know, they rotate them or perform some sort of operation on them, and the files all end-up streaked by bad bits, or worst, unreadable...

Sure it's not like it affects millions of people, but I'm pretty sure those pictures were probably worth $40 extra to that person... not to mention that they have to replace their non-ECC RAM anyway...

I was starting to think ECC was just for database servers, but you're making me realize that today, there's a database server (often even more than one) on all computers out there, and that once the data's gone... it's gone. So I think I'm going to change my answer from "Depends..." to "ECC"...

Phishfry · Nov 23, 2017

ralphbsz said:
I don't think ECC in a micro-ATX form factor is even a thing.

Oh yea it is. My server board is Gigabyte MATX and Supermicro makes plenty of boards in the MicroATX form factor that take ECC.
There are even ITX boards that take ECC, from SuperMicro and ASRock Rack that I know of.

OlivierW · Nov 25, 2017

ralphbsz said:
And when I bought my server a few years ago, I was interested in very small physical size and low power consumption; I don't think ECC in a micro-ATX form factor is even a thing. If I could get ECC the next time I upgrade, I would probably do it.

Exactly what I needed, and I bought HP N40L microservers, with ECC RAM.

_martin · Nov 25, 2017

ECC, definitely. I do care about my data.

It's pity it's not widely popular on desktop boards. Especially in 2017 when there's no problem overpaying for a smartphone.

poorandunlucky · Nov 25, 2017

_martin said:
ECC, definitely. I do care about my data.

It's pity it's not widely popular on desktop boards. Especially in 2017 when there's no problem overpaying for a smartphone.

Honestly, I wonder why Non-ECC is even a thing... I think all memory should be ECC...

I think there's cost-cutting in favor of efficiency, and there's cost-cutting in the favor of greed (just being cheap). I think efficiency is a good thing, whereas being greedy or cheap should be punishable by death.

Life is already shitty enough people who make a point of making it worse should meet their makers at the gallows.

_martin · Nov 26, 2017

poorandunlucky said:
Honestly, I wonder why Non-ECC is even a thing... I think all memory should be ECC...

I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before.

But just a thought that most of FS operations goes through RAM and you have no way of knowing if "data in" == "data out" .. Scary ..

My job is managing big corp servers, from small RX Itanium servers, through Blade servers ending with high-end Itanium Superdomes. Of course all of these have ECC RAM. In two-three years of time you can find single-bit errors on considerably large amount of RAM modules. Imagine you would use non-ECC RAM -- that's like playing Russian roulette with your data.

ralphbsz · Nov 26, 2017

_martin said:
I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before.

It is still more expensive; you end up using a few percent more RAM (in the sense of gates and silicon area, perhaps not chips or DIMMs). That costs money. For large computer users (Google, Facebook, the US government, ...) that is a tradeoff; but those large users can make informed choices. Where I agree with you: consumer computers should all have ECC, because (a) the extra cost is minimal compared to the price elasticity of consumers, and (b) end users are not capable of making those informed choices.

But just a thought that most of FS operations goes through RAM and you have no way of knowing if "data in" == "data out" ...
In two-three years of time you can find single-bit errors on considerably large amount of RAM modules. Imagine you would use non-ECC RAM -- that's like playing Russian roulette with your data.

It's not quite that bad. In large systems, file system data is written to disk pretty quickly, so at least the write path (application -> disk) is only vulnerable to memory errors for about 30s or less. The read path for cache hits is obviously still a problem. And some file systems (for example ZFS, but others too) protect the data with checksums, and the checksums are also kept in memory for the data in memory. Obviously, this is not perfect: during the calculation of the checksum the data is still unprotected, but in a well-designed system, the checksum is calculated first and checked last, to keep that window as small as possible. Some kernel software products even keep light-weight checksums of other long-lived data structures in memory, to protect them against bit rot. Again, this is not perfect (the cost of protecting every data structure would be too high, it would amount to implementing ECC in software), but it covers a very large fraction of the memory in use.

Still, if I had to specify a server, I would go for ECC if reasonably possible. That you Phishfry for pointing out some MicroATX boards with ECC; it might be time to do a hardware upgrade.

Brian Cully · Nov 28, 2017

ralphbsz said:
ECC does not compete with fast hard drives. It competes with RAID and good storage systems (such as ZFS): both make your computer more reliable, and make loss of data or corruption of data less likely.

ECC is highly encouraged with ZFS, so it's not at all in competition. Memory is one of the few weak links in the chain, as a memory corruption will screw with your on-disk checksums and lead to corruption, potentially of the entire pool, no matter how much redundancy you have. It's potentially far worse to have memory corruption with ZFS than a non-checksummed filesystem.

Snurg · Nov 28, 2017

SirDice said:
Believe me when I say you will never want to deal with it. It's a royal pain trying to recover from it and there's never a guarantee the data you recovered is not corrupted in some very subtle way (a few flipped bits here and there).
[...] And it can take a while for you to notice those files are corrupt. Now imagine you diligently backup your data every day. Your backups will be corrupted too because you're backing up bad files. If this goes unnoticed long enough all your backups will be worthless too.

I was in that situation and had to sort that mess out, to save at least part of valuable data.
After that I came to the conclusion that I do no longer want this kind of hidden data rot, decided "Never again", sold my non-ECC desktop PCs, and bought some cheap used workstations with buffered ECC DDR3, which are the cheapest RAMS in $/GB when bought used in bulk.
My laptop has no ECC, so I equipped it with 14800 ram modules from first source, being clocked as 10600, making sure it's well underclocked. It seems quite reliable, but who really knows.

_martin said:
I completely agree. Personally I think it's also a historic thing - it was more expensive to produce these modules before.

Not really. Up to the mid-late 1980s there was a ninth socket (parity) for each byte row. When memory modules came up then, they first were commonly ECC also in consumer grade.
But the cheapo mind slowly took over. Some 8088 PC clones already had a dip switch to deactivate parity checking, so people could save 1/9th of the memory cost.
More and more memory modules were sold with the 9th chip missing. In the late 1990s it was almost impossible to find consumer grade PCs that were still able to use ECC/parity protected memory.
And the data rot age began...

poorandunlucky · Nov 30, 2017

Snurg said:
Not really. Up to the mid-late 1980s there was a ninth socket (parity) for each byte row. When memory modules came up then, they first were commonly ECC also in consumer grade.
But the cheapo mind slowly took over. Some 8088 PC clones already had a dip switch to deactivate parity checking, so people could save 1/9th of the memory cost.
More and more memory modules were sold with the 9th chip missing. In the late 1990s it was almost impossible to find consumer grade PCs that were still able to use ECC/parity protected memory.
And the data rot age began...

... That's awful... For a single chip... All that for a single, miserable, chip... it's not like they don't have the design for it, either, or have to draw a special ECC card... they draw the non-ECC from the ECC one...

That's so low...

ralphbsz · Nov 30, 2017

In computers with large memory (lots of enterprise machines ship with 256 or 512GB these days, and 1TB is not uncommon), the cost of memory is a driving factor. Now take that and multiple that into a large cluster (with thousands of machines, which for the likes of Google and Facebook and supercomputers and non-existing agencies would be a small cluster). In that situation, an customer user can make the deliberate tradeoff that his machine will crash occasionally (very rarely, memory errors are actually not common), and generate wrong results (also very rarely). The wrong results will usually be caught by the toolchain (since usually they create corruption which is detected by the next processing stage), and crashes can be handled transparently by rerunning jobs on remaining machines. So the performance loss due to crashes/reruns is very small, and the cost saving is many percent, which works out to millions (it is not uncommon for single customers to buy $10M or $100M clusters). Economically, this may be a win. It may also be a loss, if the cost of worrying about wrong results is worse. If I remember right, Google uses ECC for all machines, even data processing; but I know some large analytics customers deliberately do not use ECC.

For an end-user with one machine, and without a well-developed data processing chain, cluster management, and job scheduling tools, the answer is obviously very different. There, the memory saving is a few dollars, and the cost of one crash is high, and of one data corruption very high (it might mean days of downtime).

_martin · Nov 30, 2017

ralphbsz said:
In that situation, an customer user can make the deliberate tradeoff that his machine will crash occasionally (very rarely, memory errors are actually not common), and generate wrong results (also very rarely). The wrong results will usually be caught by the toolchain (since usually they create corruption which is detected by the next processing stage), and crashes can be handled transparently by rerunning jobs on remaining machines.

I disagree ; this is not how business is done where I work (note: it doesn't mean it's not done like this somewhere).

Silent corruption is bad. You need to trust your data (trust but audit). Even the more weird solution architect would not sacrifice ECC RAM to save small amount of money. We have some PROD HANA boxes with 12TB of RAM (SuperdomeX). The cost of these boxes is so huge you don't think about saving spare change on non-ECC ram.

But even price of entry-level class servers is high enough not to consider non-ECC RAM. And even if you scale it to few 100s servers it's not that huge savings. If you need this amount (and more) of servers you are in business - money makes money. And if you are small starting company -- you can't afford to have silent data corruption undetected. Fines you would (probably) have to pay to a customer would ruin your business.
And that's all business ..

Personally I don't want a ZFS server storing my personal data with non-ECC RAM. I bought 32GB ECC RAM (8GB DDR3-1600MHz Kingston ECC CL11 w TS Intel) in 2013 for 303.60 EUR. Unfortunately I don't know how much it was for non-ECC ones, I'm guessing around 150EUR maybe. So not that huge difference overall.

drhowarddrfine · Nov 30, 2017

If one wants to search for it, codinghorror.com has an article about this and Google published one, too. Both had quite a bit of detail.

Snurg · Nov 30, 2017

poorandunlucky said:
...they draw the non-ECC from the ECC one...

I see, I was oversimplifying in my historic lookback for brevity. The topic is more complex. So I tell a bit more history.

DRAM data security is not a new issue, and this is one of the reasons why mission-critical embedded hardware often is static (SRAM, static processors etc).

Memory safety always had a high esteem in "serious" computing. Before the advent of the IBM PC microcomputing were factually toys for enthusiasts. Maybe except the S-100 bus based CP/M and MP/M systems, which ran the killer app "WordStar" and were much used professionally even though belittled by the "real computing" mainframe world.
And all these 8-bit systems had no memory parity checking.

Memory safety in the form of parity checking was introduced by the IBM PC into the microcomputing world.
This meant a high reliability leap, which was very important because at the same time hard disks began to become a mass article when the miniaturization reached the 5 inch full height form factor.
The first PC hard disks were whopping 10 megabytes large and blazing fast (80 ms average access time) to the then industry standard 8" 1.2MB diskette.
In comparison to diskettes, which were typically specified around 40h operating lifetime, but in reality often lasted much less, this was a revolution.
People were accustomed to regularly change much-used diskettes every few weeks to months. A hard disk had practically unlimited lifetime in comparison.
Before the advent of hard disks microcomputer users thus were accustomed with regular diskette failures and it was not necessary to do much backup education.

Apple products, for example, were considered as hobbyists' toys back then. Apple is an example of a company who began using ECC memories quite late, albeit afaik only for their professional-grade products.
The "real computing scene", i.e. Big Blue, DEC, Amdahl, NCR etc, who served professionals, who depend on their data to last more than a few months, in contrast, had used error correction on RAM and disk already back in the 1950s.

--

The codinghorror article drhowarddrfine mentions is a good example of the mindset that led to the discarding of RAM data integrity checking when deemed economical.
It begins with showing Google's first handmade servers as an example of commercial use of non-ECC protected RAM, when they were still a startup.
But think about it... their business is to rebuild their data continuously. Small mishaps will practically go undetected and go away by themselves. So I think this is a typical misleading type of example when argumenting contra ECC.

Google's 2009 study is quite good, but they also see only a part of the whole picture. There are many more neglected reasons why soft and hard errors happen that I rarely see discussed, if at all.

I still remember the discussion that was around 1980, when in the course of the transition from 16kbit to 64kbit DRAMs it became common to cover the chips with a particle-shielding cover, as it turned out that the structures, then still many micrometers, thousands of times bigger than nowadays cutting edge, became so small that a single beta particle could flip bits in the DRAMs if it hits the right spot at the right time.
And you just cannot shield for the trillions of cosmic particles that hit earth every second. The most powerful cosmic subatomic sized particles have a kinetic energy of a well-pitched baseball because they travel with almost light speed. It is really hard to imagine the immensely powerful micro-wreckage when such a particle hits an atomic nucleus in, say, a memory cell.
And keep in mind that the radiation is not constant, there are constant outbursts and spikes like solar flares.

It is a well-observable fact in huge server farms that reboot rate rises sharply when exceptionally strong flares happen (still far away from a Carrington event).

And now imagine, your computer happens to get hit by a cosmic particle shower which flips, say 10, 100 or 1000 bits of your gigabytes.
If you have ECC memories, the risk that you will suffer data rot is relatively low. Your computer might indicate a spurious increase of corrected errors or even reboot, that will probably be all.
But if you have no ECC, you could end up with very nasty scenarios, without noticing the actual incident at all.

I just have to look at my own data rot case.
I started to notice more and more corrupted files. Here and there, just random. Just a bit. It was easy to recognize in text files. Images displayed either distorted, or even crashed the viewer etc.
That was very disturbing because I knew that these files once were intact. I checked my backup DVDs and got the impression that the main damage must have at some time or some time interval quite a long time ago.
At that time in question my main pc (consumer grade, no ECC) was running Linux with ext3 filesystem (I was a few years on Linux because my Symbios hardware prevented me booting FreeBSD).
I don't know what caused the errors. RAM? Bus? Other things? The RAM error issue can be mitigated to a high degree by using ECC. The disk error issue likewise.

So I finally changed to ECC hardware and ZFS and hope I won't experience such thing again.
It really felt like a cosmic particle storm made my whole data a Swiss cheese with many holes. The most scariest thing was the realization how late I realized that.

rigoletto@ · Nov 30, 2017

There is also a difference in price depending on who is buying the memory. Home users and small business usually pay a lot of more for memory (and everything) than big business, who do not need to rely on local distribution.

If you are need for a quality number of memory sticks you can call the factory in China directly and buy directly from them. It will cost a lot of less, but you will need to handle all the import related stuff (or have someone to handle for you), and know the right contacts abroad to not fall in a scam.

Just take a look on the prices in something like eBay or AliExpress, and those (re-)sellers are already making money on it.

Snurg · Nov 30, 2017

Luckily as a small user I am not in the situation that I must use the most brand-new hardware. My computing needs can be perfectly fulfilled with used workstations.
Many of these are of much higher quality than any brand new consumer PC and their performance is comparable.

What I also like is that registered DDR3 ECC memories (sold in mass on ebay by sellers specialized on refurbished stuff) cost about one third of the price that even used non-ECC consumer grade memories cost!
And you can put in much more modules into most workstations than into any consumer PC!

So this is my recommendation to those of you FreeBSD friends who have similiar needs

Terri_Kennedy · Dec 26, 2017

poorandunlucky said:
... That's awful... For a single chip... All that for a single, miserable, chip... it's not like they don't have the design for it, either, or have to draw a special ECC card... they draw the non-ECC from the ECC one...

It is worse than that. 30-pin SIMMs (the 9-chip ones) often came with "logic parity" (fake parity), where the SIMM computed the expected parity value based on the data that was in the 8 memory chips (which may not have been the data that the system actually thought it was storing). There was enough of this that a number of BIOS companies skipped the parity test on power-up and instead just zeroed all of memory (they needed to write a predictable pattern to all of memory on the off chance that there was real parity memory in there, in which case the system would get an NMI if it read from uninitialized memory that happened to have the wrong parity).

Things got somewhat better with 72-pin modules, as there were now 32 bits of data per module. A parity module would have 36 bits, one parity bit per byte. But when installed in pairs, that gave you 8 parity bits for 64 bits of memory. ECC became practical.

However, there were other issues to confuse buyers - FPM (traditional DRAM) vs EDO, parity / ECC vs non, and worse of all, gold leads vs tin leads. Tin leads were used in piercing (pointy pin) sockets, and gold leads were used in non-piercing (wiping finger) sockets. Putting a gold module in a piercing socket damaged the socket, as the pins "stubbed their toes" on the gold pins (while gold is soft, it is a very thin layer on top of copper, which isn't).

ECC or non-ECC

ECC or non-ECC

ECC

Non-ECC

What?

Depends...

poorandunlucky

Phishfry

poorandunlucky

ralphbsz

poorandunlucky

Phishfry

poorandunlucky

SirDice

Administrator

poorandunlucky

Phishfry

OlivierW

_martin

poorandunlucky

_martin

ralphbsz

Brian Cully

Snurg

poorandunlucky

ralphbsz

_martin

drhowarddrfine

Snurg

rigoletto@

Snurg

Terri_Kennedy