Other might smartctl selftest be harmful to SSDs?

PMc · Jan 3, 2023

I just found an interesting coincidence, concerning my failed SSDs.
This is the cronlog showing the last minute of system activity:

Code:

Jan  3 03:07:00 <cron.info> edge /usr/sbin/cron[87083]: (root) CMD (/usr/libexec/atrun)
Jan  3 03:08:00 <cron.info> edge /usr/sbin/cron[88520]: (root) CMD (/usr/libexec/atrun)

So between 03:08 and 03:09 the machine died, most likely because of swapspace-gone-fishing, and afterwards the ada3 holding the root and the swapspace, didn't report back anymore.

Now this is my status file for the selftests:

Code:

# ls -laT /var/db/smart_selftest.db
-rw-r--r--  1 root  wheel  234 Jan  3 03:07:51 2023 /var/db/smart_selftest.db
# head /var/db/smart_selftest.db
ada3 1672741346
ada4 1672638283
ada7 1672575384
ada5 1672464886
ada8 1672298710

It has ada3 at the first position, so the selftest for ada3 was started at 03:07:51. And this is in my daily/periodic report:

Code:

Stopping gstopd.
Waiting for PIDS: 48829.
Running disk selftest on ada3, waiting 10 + scrub 474 minutes
Starting gstopd.

That says, 10 minutes is what the disk reports as the duration of the selftest, and 474 minutes is what zfs reports as the estimated duration of current scrubs. (The scrub time is surprizingly long, probably it runs on another array where this disk only serves as l2arc. That value is relevant only for sysutils/gstopd, because gstopd must be disabled for that timespan - and gstopd won't get to a system disk anyway.)

But that means, there was also a scrub running somewhere, plus the port-build, plus some database vacuums from daily/periodic, plus whatever else.

Interesting is that the former SSD at that place, which died end of november, died under similar circumstances. There was a smart selftest initiated, and while it did complete successfully, there was a controller timeout reported during that time:

Code:

Nov 27 03:46:05 <kern.crit> edge kernel: [810799] ahcich3: Timeout on slot 25 port 0
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] ahcich3: is 00000000 cs fc00007f ss fe00007f rs fe00007f tfd c0 serr 00000000 cmd 0804da17
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 70 28 8d bc 40 08 00 00 00 00 00
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain

I noticed this, I wanted to have a closer look and tried to run the selftest once more, and that was the end of the disk:

Code:

Nov 27 04:44:34 <kern.crit> edge kernel: [813775] ahcich3: Timeout on slot 18 port 0
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] ahcich3: is 00000000 cs 00000000 ss 007c0000 rs 007c0000 tfd 40 serr 00000000 cmd 0804d617
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 70 24 27 40 05 00 00 00 00 00
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
Nov 27 04:44:34 <kern.crit> edge kernel: [813806] ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
...
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] ahcich3: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd 80 serr 00000000 cmd 0804d917
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] ada3 at ahcich3 bus 0 scbus5 target 0 lun 0
Nov 27 04:44:34 <kern.crit> edge kernel: ada3: <HP SSD S700 250GB S0704A1> s/n ************** detached
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] GEOM_ELI: g_eli_read_done() failed (error=6) ada3p9.eli[READ(offset=21992349696, length=8192)]/code]

ralphbsz · Jan 3, 2023

It could be a coincidence, it could be a real cause-and-effect relationship.

But one should not generalize it to other SSDs. There is at least a half dozen different vendors of the firmware inside the SSD (that's called the FTL, or Flash Translation Layer). It's possible that you've found a bug in this particular firmware, where running SMART commands causes some internal corruption which disables the SSD. But it's very unlikely that the same bug exists in other firmwares.

Obnoxious side remark: Most amateurs (me included at home!) buy a disk (or SSD) drive, install it, and then ignore it. If they feel very generous, they may monitor the disk's health using SMART. But another thing one *should* do is to update the firmware on the drive regularly. Disk firmware is very complex, and is regularly updated. If you work directly with disk manufacturers, you can report bugs, and and see the status of bug fixes. In enterprise settings, drive firmware is regularly obtained from the manufacturer, and loaded into the disks when it changes. However, for amateurs, this is impractical. To begin with, the tools required to update drive firmware are either bizarrely primitive (they only run under DOS!), or undocumented and not distributed to users (you need an NDA to find out), or highly complex (hundreds of lines of complex SCSI code is required), or only supported on a limited set of OSes (for example Seagate's "SeaTools" only exists for Windows and Linux). And as far as I've seen, the disk vendors don't usually distribute firmware updates in public, perhaps because the risk of destroying drives by failed firmware upgrades is too high (been there, done that).

PMc · Jan 3, 2023

ralphbsz said:
It could be a coincidence, it could be a real cause-and-effect relationship.

But one should not generalize it to other SSDs. There is at least a half dozen different vendors of the firmware inside the SSD (that's called the FTL, or Flash Translation Layer). It's possible that you've found a bug in this particular firmware, where running SMART commands causes some internal corruption which disables the SSD. But it's very unlikely that the same bug exists in other firmwares.

Yeah, but it wasn't the same devices. One has a Phison controller, the other Silicon Motion - so certainly not the same firmware.

PMc · Jan 4, 2023

There is more to this:

#1241 (BUG: Dead ssd during Extended test (KingDian S200)) – smartmontools

www.smartmontools.org

PMc · Jan 6, 2023

PMc said:
One has a Phison controller,

That was wrong information on the web. I opened the device, and in fact both are the SMI2258XT, and

PMc said:
#1241 (BUG: Dead ssd during Extended test (KingDian S200)) – smartmontools

www.smartmontools.org

this one is supposedly also SMI2258XT.

This controller appears to have a problem that can trigger when issuing Smart extended offline selfcheck while it is busy serving i/o.

Such problems are not uncommon: I found reports that the SandForce controller has a known problem when resuming from hibernate, and there is even a funny article about how to repair that.

From there it gets clear that these devices aren't broken, they are "bricked": They have only suffered a software crash, and a reset would fix it - but due to gross misconception in the general design of SSD drives (and USB-sticks, for that) such an option is technically not provided.

ralphbsz said:
But one should not generalize it to other SSDs.

In fact one should. Here is a general design flaw that results in lots of devices having to be thrown away at the first software crash, thereby increasing sales and adding to the toxic waste piles which suffocate the planet.

ralphbsz said:
There is at least a half dozen different vendors of the firmware inside the SSD (that's called the FTL, or Flash Translation Layer). It's possible that you've found a bug in this particular firmware, where running SMART commands causes some internal corruption which disables the SSD. But it's very unlikely that the same bug exists in other firmwares.

There are always bugs in software, we should know that. But, where is that "FTL" (which is simply a mapping table) stored? Right, within the flash memory itself!

What we have here is a laqueus clausus, a self-referencing loop. If anything goes slightly wrong in that process, for whatever reason, then the entire thing fails to operate.

ralphbsz said:
But another thing one *should* do is to update the firmware on the drive regularly.

No, one should NOT. There are reports that exactly this might also trigger device failure (specifically in the SMI2258XT). Here is one of them.

There are actually people calling themselves the "recovery industry", who are, for infamous prices, trading the necessary firmware (which must either be reverse-engineered or somehow obtained from the manufacturers) to recover the devices and optionally even recover (some of) the stored data - and I very much doubt those would sign a NDA.

And this is not new either. Back in the days I had a nokia 6310i cellphone, and one day that also went into a failure of the internal software, so it was no longer useable. My employer told me straightaway to throw it away and ordered me a new one - but I did like that piece, so what did I do: I went to the next barber shop.
Barber shop for men are Turks, and Turks know how to fix such. It took them ten minutes, and I got my phone back in working condition. It works until today.

richardtoohey2 · Jan 6, 2023

PMc said:
Here is a general design flaw

I didn’t read too deeply but the link you posted seemed to be saying it was a bug in the Silicon Motion firmware, not a general design flaw.

Or is the perceived flaw the fact that there doesn’t seem to be an easy way to restart/reset the device?

Argentum · Jan 6, 2023

PMc said:
Such problems are not uncommon: I found reports that the SandForce controller has a known problem when resuming from hibernate, and there is even a funny article about how to repair that.

From there it gets clear that these devices aren't broken, they are "bricked": They have only suffered a software crash, and a reset would fix it - but due to gross misconception in the general design of SSD drives (and USB-sticks, for that) such an option is technically not provided.

This is really funny. I mean the unbricking instructions and also the fact that I have had a similar experience - after starting a short self test on the SSD with (now I know) SandFore controller chip, the drive started to give errors and became inaccessible. However it came back to life after power cycle. Never tried this again...

PMc · Jan 6, 2023

richardtoohey2 said:
I didn’t read too deeply but the link you posted seemed to be saying it was a bug in the Silicon Motion firmware, not a general design flaw.

Or is the perceived flaw the fact that there doesn’t seem to be an easy way to restart/reset the device?

Certainly the second.
The problem could be avoided if the device had something like an independent rom (that cannot get garbled), and then the controller could get to some point where it can no longer allow read/write, but can still allow a format or factory-reset.

It would need public interest to change that, but that is doubtful. The industry is certainly not motivated, and the so called "recovery industry" even less.
Only from the viewpoint of ethical engineering it is a constructional defect..

smithi · Jan 7, 2023

PMc said:
It would need public interest to change that, but that is doubtful. The industry is certainly not motivated, and the so called "recovery industry" even less.
Only from the viewpoint of ethical engineering it is a constructional defect..

Ethical engineering needs public support, which can only happen when people are well and truly informed.

I'm already much better informed by this thread, thanks, but we appear to be in the minority regarding ethics vs raw profitability.

Only when this becomes a major issue both inside and outside 'the industry' might there be pressure on dodgy manufacturers and dealers, from eventually standards and consumer protection authorities.

I've been quite amazed that particularly in the EU, people don't seem to even expect warranties to be honoured - where in Australia they generally do - from recent various 'SSD fail' threads.

Not wanting to take the discussion too far from the technical but ethics matters, as well as faith in storage over a reasonable term, say 10 years.

And what do Turkish barbers know that we don't?

PMc · Jan 7, 2023

smithi said:
And what do Turkish barbers know that we don't?

Good point. What do Russians know about flash memory that we don't? Or, spinning back in time, what do witches know about healing that doctors don't?

I think the key-term is "deviant people". People at the borders of society, fringe groups. As they are not so much backed by core society, they develop strong knowledge-sharing bonds, and also they treat ressources more carefully because they don't have them in such abundance.
While we believe in "scientific proof", they figure out what practically works, unlimited by convention or moral rules. They're the witchers.

Society doesn't change only because we use computers.

This is also why the whole political-correctness and all-are-equal stuff, when taken too far, becomes nonsense. There will always be fringe groups, because society needs them. Perception as a whole does only work by noticing that something is different to something else.

Argentum · Jan 7, 2023

PMc said:
Society doesn't change only because we use computers.

This is interesting! I entered this as a question to the AI (LLM). Here is the answer:

Code:

Yes, computers and the Internet have had a significant impact on society and have played a role in shaping many aspects of modern life. The widespread adoption of computers and the Internet has led to changes in communication, work, education, and entertainment, among other things.

One example of how computers and the Internet have changed society is the way we communicate. The Internet has made it possible for people to connect and communicate with each other instantly, regardless of their location. This has led to the development of social media and messaging platforms, which have changed the way people interact and communicate with one another.

Computers and the Internet have also had a major impact on the way we work. Many jobs that were previously done manually or with traditional tools can now be done more efficiently and accurately with the help of computers and software. In addition, the Internet has made it possible for people to work remotely and collaborate with others from different locations.

The education system has also been impacted by computers and the Internet. Many schools and universities now use computers and online learning platforms as a way to teach students and provide them with access to a wealth of information and resources.

Finally, the entertainment industry has been transformed by computers and the Internet. It is now possible to access movies, music, and other forms of entertainment online, and many people stream content rather than purchasing physical copies.

Overall, computers and the Internet have had a significant impact on society and have changed many aspects of modern life.

My point is that the computer already thinks it has changed the society

dgmm · Jan 7, 2023

smithi said:
I've been quite amazed that particularly in the EU, people don't seem to even expect warranties to be honoured - where in Australia they generally do - from recent various 'SSD fail' threads.

I think the person talking about not RMAing under warranty was concerned about being restrained from sending an SSD back with secure or personal data on it which may be covered by various laws. One solution would be to use encryption but even then there may be legal restrictions on allowing that data to leave the premises. I deal with at least one customer who, while getting the free labour for on-site warranty repairs still has to pay for failed SSDs because they refuse to allow them to be exchanged, preferring to shred them. Most consumers in the EU would not be worrying about this and will fully expect a warranty to be honored and they almost always are.

PMc · Jan 7, 2023

Argentum said:
This is interesting! I entered this as a question to the AI (LLM). Here is the answer:

Just trying to figure out what that is...

Argentum said:
My point is that the computer already thinks it has changed the society

Thank You! Wonderful wording!

That is why I say, AI will not work in the long run. It will tell you only commonsense which smart people could easily figure out by themselves.
But the more important point is that this ...

Code:

The Internet has made it possible for people to connect and communicate with each other instantly,

... is basically wrong: they do NOT talk to each other, first and foremost they talk to machines.
And what impact this has on society, we have already discussed in this thread.

So I think we should not deepen that matter here. I would rather enjoy to find somebody who speaks enough Russian to figure out how this here is supposed to work - because apparently it can repair my broken device: https://www.usbdev.ru/f/index.php?topic=9577

PMc · Jan 7, 2023

Even a machine translation of that stuff is utter kickass:

Ребята случилась такая история, взял я 2 года назад SSD Kingston SA400S37/240G.

Yep, that's exactly mine, too. Even the mem-chip labelling matches.

Тогда по глупости я думал прошить его без потери данных, нашел вроде нужную прошивку, прошивал его как то там

So he flashed it. And he wanted to recover the data - Sir Dmitry, probably 27 years old. (I don't want to recover data, I just want to get my thing back working, because throwing away stuff without need is bad.) How does he do that? Is that normal in Russia? What do the Russians know that we don't?
But it gets further than that:

После этого я заказал голую плату SSD, на ней один контроллер SM2258XT G AB и 4 посадочных места под память, я отпаял со старой платы микросхемы памяти(их всего 2шт Kingston FB12808UCM1-62) и припаял их на новую плату.

Let me show you how that looks like:

These are BGA, the PCB are mass-produced by robots. Modifying them manually is, well, lets say ambitious. Except in Russia, there it is just the next thing you do when the device doesn't respond.

That's why I have utmost respect of Russia. I should have migrated there twenty years ago...