Anybody use STEC ZeusRAM SSD?

asomers

Developer
The STEC ZeusRAM is a SAS SSD based on supercap-backed DRAM that flushes to flash upon power loss. When it first came out, it was a great choice for ZFS log devices. But by now it's old. Today we rebooted the last server at $WORK that still had any of these devices installed. Not just a soft reboot either, but a power cycle. And when the server came back up, ALL FOUR SSDS HAD FAILED. That's really weird. SMART did not (and still does not) show any warning signs for any of them. But I can't read a byte from any of the four.

WTF? Maybe there was a power surge that fried them. But if so, it didn't affect any other equipment. Or maybe their supercaps had all failed and we didn't notice until power cycling? Or, worst of all, maybe some firmware bug caused them to fail to boot after a certain date, or a certain number of power-on hours? Has anybody else had a similar experience?

For the record, these are model STM0001952A6, firmware revision C023.
 
How old are they? I remember getting some of the first STEC Zeus when they came out (must have been 2007 or 2008). They were heinously expensive (5-digit amounts), only connected via fiber channel, and were amazingly fast and reliable. We used them to replace real RAM disks, which were 6U rack mounts with a few dozen GB of dynamic RAM, a CPU, 200 pounds of lead-acid batteries, and two fiber channel interfaces.

The STECs are not completely based on DRAM. They are mostly flash, with a "small" DRAM write buffer that is backed by supercap. Having the small write buffer means that recently modified data which is overwritten rapidly doesn't have to be flushed to flash (pun!) immediately, which greatly improves flash lifetime. Nice design. One of the interesting issues with the Zeus is that their performance characteristics are "weird". Reads are obviously somewhere between fast and extremely fast. Writes can be really fast (if you keep overwriting the same thing in DRAM), or pretty fast (if the DRAM needs to destage into flash), or somewhat slow (if the write pattern is such that the DRAM buffer is completely defeated and every data page has to go to flash).

In your example, I think the plausible theory is: their supercaps had all silently failed. Now when they try to power up, they probably go into some internal safety mode, since they can't even find their own internal metadata.

Is STEC still around? Do you have a service contract? Can you try contacting them? They might be able to recover something to the last state recorded in flash; you may have lost whatever was in the DRAM buffer.
 
Back
Top