Has anyone seen any analysis of modern SSDs similar to Backblaze's HDD failure (and SMART indicator) analysis.
Yes, google for "Bianca Schroeder", "Arif Merchant", SSD and Google (this is not a joke, Arif and Bianca did the study using data from Google data centers). Bianca is the master (mistress?) of disk reliability, and Arif is likely the smartest guy in the storage industry. If I remember right, the conclusion is that SSDs are not perfect, and (just like with disks) the current rate of errors is a good predictor of the future rate of errors, and of eventual failure. The other fascinating thing is that accumulated write cycles is not as good a predictor of failure as one would have thought (we all thought that SSDs get worn out by writing and nothing else), but age is a reasonably good predictor (no, I don't remember why). What I completely forgot is whether their measured failure rate / MTBF / UBER matches the manufacturers specification or not, and how it compares with the known data for disk drives. When I see them, we tend to talk about personal stuff, not work.
No, I don't know anything that's like the Backblaze analysis, which tells you which brands to buy and which brands to avoid. Bianca and Arif are academics and researchers, and they need good relations with all vendors, and in their academic publications they can not explicitly praise one brand and dump on another. Backblaze doesn't have such restrictions.
In my personal experience: SSDs are different from disks. Yes, they occasionally have media errors, just like disks. Unlike disks, they tend to be perfect (many SSDs are perfect for their economic lifespan, and most SSDs are perfect when they are young), until they start failing; in contrast, even good new disks may have an occasional error (but most errors in a large system are contributed by a small fraction of bad disks). Like disks, once they start getting multiple errors, they will likely get more errors. Like disks, one of their favorite failure modes it totally bricking themselves (often with taking the bus down the cliff with them). Unlike disks, they don't have mature and useful SMART implementations (I've only used SCSI SMART, not SATA SMART); and some newer implementations (such as NVMe) don't seem to have useful SMART at all. Personally, I haven't seen value in watching wear-out and error counters on SSDs, but maybe I just didn't do it right. The performance of SSDs is completely weird, and in spite of some effort, using performance characterization to do predictive fault analysis doesn't seem to work on them (it works pretty good on disks).
On my server at home, I have two SSDs (both were very cheap, consumer-grade); I use one as the boot and root disk (the /home
data is on separate devices), and the other is a backup that is updated occasionally. So if my main SSD croaks, I can be up and running within an hour (just diagnose the problem, remove the dead SSD, tell the BIOS to boot from the other one). But because my backup to the other SSD is rare, after such a failure, the root file system might be a day or a week out of date. That's when a whole evening of restoring backups would start, and hand-merging directories and files. Not fun, but given that the one SSD is unlikely to die, and given that everything outside the /home
file system can be restored by performing a new install and redoing my customization, that's a risk I'm willing to take.