How can hardware be killed by a test that does not write, if there is not a latent hardware fault prior to reading?
Because it's not hardware that got killed.
The hardware is perfectly fine, but it's bricked.
Step 1:
I created a script that does extended selftest periodically every few weeks. And I put the script into
etc/periodic/daily
Step 2:
At 3AM, as expected, the script grabbed one or two disks and started the selftest.
Step 3:
First SSD was dead.
Step 4:
I replaced it with a different brand.
Step 5:
Exactly when the selftest started on the new SSD, it was also dead (replaced per warranty)
That was then enough for me to start research. The first victim was a Kingston A400.
smartctl says this is Phison. It is not, at that time it was rebranded SiliconMotion 2258XT. The second was HP S700. Resources on the web say this is also SM2258XT. That explains things.
What happened: that machine runs about ten nodes. At 3AM, hell breaks loose: there are lots of
find running, there are lots of database VACUUM running, etc. The devices get hammered with i/o.
Whatever the extended selftest is supposed to do on a SSD, apparently nobody had tested it under such conditions.
These SSD have only one persistent memory, that's the flash cells. The configuration data -what kind of flash memory is installed, how is it to be treated and how is it mapped- is stored inside the flash cells. The running organizational data, the so-called Flash-Translation-Layer, is also inside the flash cells. This is all together one big messy mesh.
I opened the device, shortened the factory-mode-enable pins, and indeed it reports itself as some cryptic device with 64MB inaccessible storage space. At this point one could download a new configuration.
Further details are at usbdev.ru. That one is the real freakshow. Because there is big money to make with this: there are people who have such failures, and do not have a backup. And they want their data back - and they are willing to pay.
I for my part do have backups, but I would love to get that piece back in working order - if it were just for the sports of my zoo.
But then, getting to some configuration that allows to read out the stored payload data in order to then try and reconstruct it, that is one thing; getting to a fresh configuration that would reliably work for continued operation, that is yet another.