[20/07 3:47] iceland # zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
mypool 928G 871G 56.8G - - 75% 93% 1.00x ONLINE -
mirror 928G 871G 56.8G - - 75% 93.9% - ONLINE
9214606650531292110 - - - - - - - - UNAVAIL
ada3p1 - - - - - - - - ONLINE
[20/07 3:47] iceland # gpart show
=> 40 1953525088 ada1 GPT (932G)
40 1953525088 1 freebsd-zfs (932G)
=> 40 1953525088 ada3 GPT (932G)
40 1953525088 1 freebsd-zfs (932G)
[20/07 3:47] iceland # zpool replace mypool 9214606650531292110 ada1p1
[20/07 3:49] iceland # zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
mypool 928G 872G 56.4G - - 75% 93% 1.00x DEGRADED -
mirror 928G 872G 56.4G - - 75% 93.9% - DEGRADED
replacing - - - - - - - - DEGRADED
9214606650531292110 - - - - - - - - UNAVAIL
ada1p1 - - - - - - - - ONLINE
ada3p1 - - - - - - - - ONLINE
[20/07 3:49] iceland # zpool status -v
pool: mypool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Tue Jul 20 15:48:40 2021
163G scanned at 3.07G/s, 2.90G issued at 56.1M/s, 871G total
2.99G resilvered, 0.33% done, 04:24:08 to go
config:
NAME STATE READ WRITE CKSUM
mypool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
9214606650531292110 UNAVAIL 0 0 0 was /dev/ada1p1/old
ada1p1 ONLINE 0 0 0 (resilvering)
ada3p1 ONLINE 0 0 0
errors: No known data errors
[20/07 3:50] iceland # camcontrol devlist
<Seagate IronWolf ZA1000NM10002-2ZG102 SU3SC011> at scbus3 target 0 lun 0 (ada1,pass1)
<CT1000MX500SSD1 M3CR032> at scbus5 target 0 lun 0 (ada3,pass3)
<AHCI SGPIO Enclosure 2.00 0001> at scbus6 target 0 lun 0 (ses0,pass4)
[20/07 3:50] iceland #
I got it. ThanksI recommend to do a periodic logging ofsmartctl -x
(and partitioning data), from /etc/monthly or /etc/weekly[1]. (Here is some sample script, fix it up to your needs.) Write the output into /var/backup with the month or week in the filename, so they rotate annually.
Thank you Sir. The problem is that I cannot see it let alone running smartctl on it. I can run it on other devices. I am used to running smartctl on all but have not been paying so much attention to its report. This thread - https://forums.FreeBSD.org/threads/scrub-task-best-practice.78802/post-493837 - provided some more insight.Let me be a little more direct here. If scrubbing is what you do and scrubbing is "better than smartctl" then scrubbing should have predicted the failure of this drive, which it did not. Scrubbing is not going to tell you if you're approaching the TBW (Terabytes Written) limit for the drive. Scrubbing is just going to tell you that you ARE screwed, not that you're GOING to be screwed. You're going to have to get this from the drive using smartctl or the vendor's software. I suggest using both to cross verify.
Once the drive exceeds the TBW limit and/or runs out of over-provisioned cells, inexplicable and immediate failure is possible. As far as I can tell you haven't ruled out this possibility by running smartctl because you've posted zpool statuses, motherboard manuals, and partition tables, all of which are the complete wrong level to be looking at to predict a hardware failure of the other SSDs you can get a smartctl report from.
Simply, by checking the other drives, you can tell if you're running them too hard and approaching the wear limit. If this is the case, your other disk most likely is completely dead and not because of any other hardware failure.
# smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 13.0-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron Client SSDs
Device Model: CT1000MX500SSD1
Serial Number: 2022E2A60E94
LU WWN Device Id: 5 00a075 1e2a60e94
Firmware Version: M3CR032
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jul 22 07:49:56 2021 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x0031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 5966
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 178
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 011 011 000 Old_age Always - 5188
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 116
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 26
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 062 033 000 Old_age Always - 38 (Min/Max 0/67)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 200
202 Percent_Lifetime_Remain 0x0030 011 011 001 Old_age Offline - 89
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 3365800522078
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 31328909475
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 24324258093
SMART Error Log Version: 1
Invalid Error Log index = 0x11 (T13/1321D rev 1c Section 8.41.6.8.2.2 gives valid range from 1 to 5)
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 5961 -
# 2 Short offline Completed without error 00% 5945 -
# 3 Short offline Completed without error 00% 5925 -
# 4 Short offline Completed without error 00% 5909 -
# 5 Short offline Completed without error 00% 5893 -
# 6 Extended offline Completed without error 00% 5880 -
# 7 Short offline Completed without error 00% 5879 -
# 8 Short offline Completed without error 00% 5862 -
# 9 Short offline Completed without error 00% 5843 -
#10 Short offline Completed without error 00% 5824 -
#11 Short offline Completed without error 00% 5804 -
#12 Short offline Completed without error 00% 5784 -
#13 Short offline Completed without error 00% 5764 -
#14 Extended offline Completed without error 00% 5744 -
#15 Short offline Completed without error 00% 5743 -
#16 Short offline Completed without error 00% 5723 -
#17 Short offline Completed without error 00% 5703 -
#18 Short offline Completed without error 00% 5682 -
#19 Short offline Completed without error 00% 5664 -
#20 Short offline Completed without error 00% 5647 -
#21 Short offline Completed without error 00% 5632 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I shall now be using industrial-grade/NAS-grade (hoping this Seagate lasts)/spinning rust.Thanks for finally posting the smartctl data. This drive is on the last 11% of it's lifespan. This confirms my suspicion that the other drive died due to exceeding its lifespan.
You can probably blame ZFS write amplification because you put it in a RAID. Use spinning rust instead of SSDs.
Most likely! I saw Micron, Intel etc of datacenter class.Aye, then probably Datacenter class is right for You...
If a drive is not detected at all, the first thing to do is to check the cables. SATA cables sometimes have a tendency to loosen themselves slowly (e.g. caused by vibration from fans, or when moving the PC). If in doubt, replace the cable. Be sure to use SATA-III-specified cables with clips; these won’t come loose as easily.There should be four CTxxxx but only three comes up. The other CT1000MX is still attached to the MoBo yet not coming up.
camcontrol reset
command to the bus which the device is connected to (should be scbus1
or scbus2
in your case, I’m not sure), followed by camcontrol rescan
to scan the bus for new devices that have appeared.=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: SAMSUNG SSD 830 Series
Serial Number: S0Z4NEBC808907
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM03B1Q
User Capacity: 256,060,514,304 bytes [256 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 T13/2015-D revision 2
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Jul 25 18:06:10 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 1020) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 17) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 85917
12 Power_Cycle_Count 0x0032 089 089 000 Old_age Always - 10846
177 Wear_Leveling_Count 0x0013 096 096 000 Pre-fail Always - 121
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 058 051 000 Old_age Always - 42
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 253 253 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 10810
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 24060017288
SMART Error Log Version: 1
No Errors Logged
According to that number, your drive has only 11.2 TB written so far (the number is in LBA units = 512 bytes). I think the 830 256G is specified for 100 TBW, so it isn’t anywhere near the end of its lifespan yet, at least as far as the age of the flash cells is concerned.I've been running FreeBSD on a Samsung 830 256G SSD since 2011 and it shows no sign of going away. smartctl data below:
Code:241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 24060017288