I just found an interesting coincidence, concerning my failed SSDs.
This is the cronlog showing the last minute of system activity:
So between 03:08 and 03:09 the machine died, most likely because of swapspace-gone-fishing, and afterwards the ada3 holding the root and the swapspace, didn't report back anymore.
Now this is my status file for the selftests:
It has ada3 at the first position, so the selftest for ada3 was started at 03:07:51. And this is in my daily/periodic report:
That says, 10 minutes is what the disk reports as the duration of the selftest, and 474 minutes is what zfs reports as the estimated duration of current scrubs. (The scrub time is surprizingly long, probably it runs on another array where this disk only serves as l2arc. That value is relevant only for sysutils/gstopd, because gstopd must be disabled for that timespan - and gstopd won't get to a system disk anyway.)
But that means, there was also a scrub running somewhere, plus the port-build, plus some database vacuums from daily/periodic, plus whatever else.
Interesting is that the former SSD at that place, which died end of november, died under similar circumstances. There was a smart selftest initiated, and while it did complete successfully, there was a controller timeout reported during that time:
I noticed this, I wanted to have a closer look and tried to run the selftest once more, and that was the end of the disk:
This is the cronlog showing the last minute of system activity:
Code:
Jan 3 03:07:00 <cron.info> edge /usr/sbin/cron[87083]: (root) CMD (/usr/libexec/atrun)
Jan 3 03:08:00 <cron.info> edge /usr/sbin/cron[88520]: (root) CMD (/usr/libexec/atrun)
So between 03:08 and 03:09 the machine died, most likely because of swapspace-gone-fishing, and afterwards the ada3 holding the root and the swapspace, didn't report back anymore.
Now this is my status file for the selftests:
Code:
# ls -laT /var/db/smart_selftest.db
-rw-r--r-- 1 root wheel 234 Jan 3 03:07:51 2023 /var/db/smart_selftest.db
# head /var/db/smart_selftest.db
ada3 1672741346
ada4 1672638283
ada7 1672575384
ada5 1672464886
ada8 1672298710
It has ada3 at the first position, so the selftest for ada3 was started at 03:07:51. And this is in my daily/periodic report:
Code:
Stopping gstopd.
Waiting for PIDS: 48829.
Running disk selftest on ada3, waiting 10 + scrub 474 minutes
Starting gstopd.
That says, 10 minutes is what the disk reports as the duration of the selftest, and 474 minutes is what zfs reports as the estimated duration of current scrubs. (The scrub time is surprizingly long, probably it runs on another array where this disk only serves as l2arc. That value is relevant only for sysutils/gstopd, because gstopd must be disabled for that timespan - and gstopd won't get to a system disk anyway.)
But that means, there was also a scrub running somewhere, plus the port-build, plus some database vacuums from daily/periodic, plus whatever else.
Interesting is that the former SSD at that place, which died end of november, died under similar circumstances. There was a smart selftest initiated, and while it did complete successfully, there was a controller timeout reported during that time:
Code:
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] ahcich3: Timeout on slot 25 port 0
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] ahcich3: is 00000000 cs fc00007f ss fe00007f rs fe00007f tfd c0 serr 00000000 cmd 0804da17
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 70 28 8d bc 40 08 00 00 00 00 00
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 03:46:05 <kern.crit> edge kernel: [810799] (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
I noticed this, I wanted to have a closer look and tried to run the selftest once more, and that was the end of the disk:
Code:
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] ahcich3: Timeout on slot 18 port 0
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] ahcich3: is 00000000 cs 00000000 ss 007c0000 rs 007c0000 tfd 40 serr 00000000 cmd 0804d617
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 28 70 24 27 40 05 00 00 00 00 00
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 04:44:34 <kern.crit> edge kernel: [813775] (ada3:ahcich3:0:0:0): Retrying command, 3 more tries remain
Nov 27 04:44:34 <kern.crit> edge kernel: [813806] ahcich3: AHCI reset: device not ready after 31000ms (tfd = 00000080)
...
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] ahcich3: is 00000000 cs 02000000 ss 00000000 rs 02000000 tfd 80 serr 00000000 cmd 0804d917
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): CAM status: Command timeout
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] (aprobe0:ahcich3:0:0:0): Error 5, Retry was blocked
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] ada3 at ahcich3 bus 0 scbus5 target 0 lun 0
Nov 27 04:44:34 <kern.crit> edge kernel: ada3: <HP SSD S700 250GB S0704A1> s/n ************** detached
Nov 27 04:44:34 <kern.crit> edge kernel: [813958] GEOM_ELI: g_eli_read_done() failed (error=6) ada3p9.eli[READ(offset=21992349696, length=8192)]/code]