Solved Recommendation for a Reliable SSD

covacat · Jul 9, 2021

you can use iostat -Ix and look at kw/i
divide by uptime and 1E6 or 2^20 and you get the number of avg GB you write / day
this won't be an accurate number for the total GB written to TLC because it does not know about
the ssd controller optimizations and ssd write amplification factor

Lamia · Jul 9, 2021

Many thanks again to everyone. The disk is not being detected apart does not see a geom/partition table. I may have to again relocate it e.g. /dev/ada3 to /dev/ada2 then it may come back online unavail or faulted with 3MB chksum or more.

The other disk was relabel /dev/ses1 like this one and later I began to see label not found err. I have looked around dmesg and gdisk including other tools. No geom, no information about it other "online but faulted" reported by zpool.

I have ordered for a Seagate Ironwolf as a replacement for now. The data centre disks are my most preferred though.

Lamia · Jul 20, 2021

Sirs,
I am still worried by how the drive (9214606650531292110) is assumed damaged.
Please see below.

Code:

[20/07 3:47] iceland # zpool list -v
NAME                      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mypool                 928G   871G  56.8G        -         -    75%    93%  1.00x    ONLINE  -
  mirror                  928G   871G  56.8G        -         -    75%  93.9%      -  ONLINE  
    9214606650531292110      -      -      -        -         -      -      -      -  UNAVAIL 
    ada3p1                   -      -      -        -         -      -      -      -  ONLINE  
[20/07 3:47] iceland # gpart show
=>        40  1953525088  ada1  GPT  (932G)
          40  1953525088     1  freebsd-zfs  (932G)

=>        40  1953525088  ada3  GPT  (932G)
          40  1953525088     1  freebsd-zfs  (932G)

[20/07 3:47] iceland # zpool replace mypool 9214606650531292110 ada1p1
[20/07 3:49] iceland # zpool list -v                                     
NAME                        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
mypool                   928G   872G  56.4G        -         -    75%    93%  1.00x  DEGRADED  -
  mirror                    928G   872G  56.4G        -         -    75%  93.9%      -  DEGRADED
    replacing                  -      -      -        -         -      -      -      -  DEGRADED
      9214606650531292110      -      -      -        -         -      -      -      -  UNAVAIL 
      ada1p1                   -      -      -        -         -      -      -      -  ONLINE  
    ada3p1                     -      -      -        -         -      -      -      -  ONLINE  
[20/07 3:49] iceland # zpool status -v
  pool: mypool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Tue Jul 20 15:48:40 2021
        163G scanned at 3.07G/s, 2.90G issued at 56.1M/s, 871G total
        2.99G resilvered, 0.33% done, 04:24:08 to go
config:

        NAME                       STATE     READ WRITE CKSUM
        mypool                  DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            replacing-0            DEGRADED     0     0     0
              9214606650531292110  UNAVAIL      0     0     0  was /dev/ada1p1/old
              ada1p1               ONLINE       0     0     0  (resilvering)
            ada3p1                 ONLINE       0     0     0

errors: No known data errors

[20/07 3:50] iceland # camcontrol devlist
<Seagate IronWolf ZA1000NM10002-2ZG102 SU3SC011>  at scbus3 target 0 lun 0 (ada1,pass1)
<CT1000MX500SSD1 M3CR032>          at scbus5 target 0 lun 0 (ada3,pass3)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus6 target 0 lun 0 (ses0,pass4)
[20/07 3:50] iceland #

The CT1000MX........ (Crucial) devices should be two; well, technically four - the other two, now excluded, are fine. I replaced one of them not long ago.
This is not a PECKAD situation. The system with its pool was fine until a restart or so.

Screen Shot 2021-07-20 at 4.11.07 pm.png

The image above is the board. /dev/ada[0,2,3,4] are detected in the setup. The now called /dev/ada1 is an additional sata with power being used and currently replacing 9214606650531292110. It was unused until now yet 9214606650531292110 was previously working [maybe on ada1]. Hence, I can confirm that there are FIVE SSDs on the board between A to F in the image. For the above commands, I have excluded the other mirrored pool - zroot - with ada0/ada2 in use for it.

This is the second time I am replacing a disk in less than four months. Perhaps, they are still working. I am wanting to change the ACHI to RAID in the setup if the disk will be detected. This is a lame thought but what killing the disks? I shall now be paying more attention to the smartctl report.

Lamia · Jul 21, 2021

There should be four CTxxxx but only three comes up. The other CT1000MX is still attached to the MoBo yet not coming up. The casing for this machine is rivetted; so opening it is not for the fail-hearted. That's the reason for my nagging.

msplsh · Jul 21, 2021

You need to check the TBW for the drives that you can see and/or run the manufacturer health utility/lifespan tool.

Lamia · Jul 21, 2021

Thank you Sir for this information. I got some more useful information from a thread on scrubbing best practices. It appears scrubbing, which I do, is better than the smartctl. I can now always get prepared to replace one or more of the SSDs whenever I see checksum errors. On those two occasions, I had seen zpool status report the errors, some were permanent.

I can still do the smartctl. If anyone could point me at the script for it with an ability to send email of the report, that would be good. I used to have it.

I shall now close this thread. Merci pour votre réponse.

msplsh · Jul 21, 2021

Let me be a little more direct here. If scrubbing is what you do and scrubbing is "better than smartctl" then scrubbing should have predicted the failure of this drive, which it did not. Scrubbing is not going to tell you if you're approaching the TBW (Terabytes Written) limit for the drive. Scrubbing is just going to tell you that you ARE screwed, not that you're GOING to be screwed. You're going to have to get this from the drive using smartctl or the vendor's software. I suggest using both to cross verify.

Once the drive exceeds the TBW limit and/or runs out of over-provisioned cells, inexplicable and immediate failure is possible. As far as I can tell you haven't ruled out this possibility by running smartctl because you've posted zpool statuses, motherboard manuals, and partition tables, all of which are the complete wrong level to be looking at to predict a hardware failure of the other SSDs you can get a smartctl report from.

Simply, by checking the other drives, you can tell if you're running them too hard and approaching the wear limit. If this is the case, your other disk most likely is completely dead and not because of any other hardware failure.

Lamia · Jul 22, 2021

PMc said:
I recommend to do a periodic logging of smartctl -x (and partitioning data), from /etc/monthly or /etc/weekly[1]. (Here is some sample script, fix it up to your needs.) Write the output into /var/backup with the month or week in the filename, so they rotate annually.

I got it. Thanks

Lamia · Jul 22, 2021

msplsh said:
Let me be a little more direct here. If scrubbing is what you do and scrubbing is "better than smartctl" then scrubbing should have predicted the failure of this drive, which it did not. Scrubbing is not going to tell you if you're approaching the TBW (Terabytes Written) limit for the drive. Scrubbing is just going to tell you that you ARE screwed, not that you're GOING to be screwed. You're going to have to get this from the drive using smartctl or the vendor's software. I suggest using both to cross verify.

Once the drive exceeds the TBW limit and/or runs out of over-provisioned cells, inexplicable and immediate failure is possible. As far as I can tell you haven't ruled out this possibility by running smartctl because you've posted zpool statuses, motherboard manuals, and partition tables, all of which are the complete wrong level to be looking at to predict a hardware failure of the other SSDs you can get a smartctl report from.

Simply, by checking the other drives, you can tell if you're running them too hard and approaching the wear limit. If this is the case, your other disk most likely is completely dead and not because of any other hardware failure.

Thank you Sir. The problem is that I cannot see it let alone running smartctl on it. I can run it on other devices. I am used to running smartctl on all but have not been paying so much attention to its report. This thread - https://forums.FreeBSD.org/threads/scrub-task-best-practice.78802/post-493837 - provided some more insight.

I am afraid I cannot dump the smartctl for the damaged device since it is not recognised on found with camcontrol devlist.

Here is a smartctl output for the second device in the mirror

Code:

# smartctl -a /dev/ada3
smartctl 7.2 2020-12-30 r5155 [FreeBSD 13.0-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron Client SSDs
Device Model:     CT1000MX500SSD1
Serial Number:    2022E2A60E94
LU WWN Device Id: 5 00a075 1e2a60e94
Firmware Version: M3CR032
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jul 22 07:49:56 2021 AWST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
  5 Reallocate_NAND_Blk_Cnt 0x0032   100   100   010    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       5966
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       178
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
173 Ave_Block-Erase_Count   0x0032   011   011   000    Old_age   Always       -       5188
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       116
180 Unused_Reserve_NAND_Blk 0x0033   000   000   000    Pre-fail  Always       -       26
183 SATA_Interfac_Downshift 0x0032   100   100   000    Old_age   Always       -       0
184 Error_Correction_Count  0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   062   033   000    Old_age   Always       -       38 (Min/Max 0/67)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_ECC_Cnt 0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       200
202 Percent_Lifetime_Remain 0x0030   011   011   001    Old_age   Offline      -       89
206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
210 Success_RAIN_Recov_Cnt  0x0032   100   100   000    Old_age   Always       -       0
246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       3365800522078
247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       31328909475
248 FTL_Program_Page_Count  0x0032   100   100   000    Old_age   Always       -       24324258093

SMART Error Log Version: 1
Invalid Error Log index = 0x11 (T13/1321D rev 1c Section 8.41.6.8.2.2 gives valid range from 1 to 5)

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5961         -
# 2  Short offline       Completed without error       00%      5945         -
# 3  Short offline       Completed without error       00%      5925         -
# 4  Short offline       Completed without error       00%      5909         -
# 5  Short offline       Completed without error       00%      5893         -
# 6  Extended offline    Completed without error       00%      5880         -
# 7  Short offline       Completed without error       00%      5879         -
# 8  Short offline       Completed without error       00%      5862         -
# 9  Short offline       Completed without error       00%      5843         -
#10  Short offline       Completed without error       00%      5824         -
#11  Short offline       Completed without error       00%      5804         -
#12  Short offline       Completed without error       00%      5784         -
#13  Short offline       Completed without error       00%      5764         -
#14  Extended offline    Completed without error       00%      5744         -
#15  Short offline       Completed without error       00%      5743         -
#16  Short offline       Completed without error       00%      5723         -
#17  Short offline       Completed without error       00%      5703         -
#18  Short offline       Completed without error       00%      5682         -
#19  Short offline       Completed without error       00%      5664         -
#20  Short offline       Completed without error       00%      5647         -
#21  Short offline       Completed without error       00%      5632         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

msplsh · Jul 22, 2021

Thanks for finally posting the smartctl data. This drive is on the last 11% of it's lifespan. This confirms my suspicion that the other drive died due to exceeding its lifespan.

You can probably blame ZFS write amplification because you put it in a RAID. Use spinning rust instead of SSDs.

Lamia · Jul 22, 2021

msplsh said:
Thanks for finally posting the smartctl data. This drive is on the last 11% of it's lifespan. This confirms my suspicion that the other drive died due to exceeding its lifespan.

You can probably blame ZFS write amplification because you put it in a RAID. Use spinning rust instead of SSDs.

I shall now be using industrial-grade/NAS-grade (hoping this Seagate lasts)/spinning rust.

How did you calculate the lifespan left on it?

Lamia · Jul 22, 2021

PMc said:
Aye, then probably Datacenter class is right for You...

Most likely! I saw Micron, Intel etc of datacenter class.

msplsh · Jul 22, 2021

S.M.A.R.T. Block 202. The "Value" is the % lifetime remaining, the "raw value" is the lifetime used. Check your other SSDs. I would consider them "dead" at 10% since you're writing them to death.

olli@ · Jul 23, 2021

Lamia said:
There should be four CTxxxx but only three comes up. The other CT1000MX is still attached to the MoBo yet not coming up.

If a drive is not detected at all, the first thing to do is to check the cables. SATA cables sometimes have a tendency to loosen themselves slowly (e.g. caused by vibration from fans, or when moving the PC). If in doubt, replace the cable. Be sure to use SATA-III-specified cables with clips; these won’t come loose as easily.

You could try issuing a camcontrol reset command to the bus which the device is connected to (should be scbus1 or scbus2 in your case, I’m not sure), followed by camcontrol rescan to scan the bus for new devices that have appeared.

If all of that doesn’t help, I guess either the controller (unlikely) or the device (more likely) is pushing up the daisies.

trev · Jul 25, 2021

I've been running FreeBSD on a Samsung 830 256G SSD since 2011 and it shows no sign of going away. smartctl data below:

Code:

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG SSD 830 Series
Serial Number:    S0Z4NEBC808907
LU WWN Device Id: 5 002538 043584d30
Firmware Version: CXM03B1Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 T13/2015-D revision 2
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Jul 25 18:06:10 2021 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 1020) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  17) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       85917
 12 Power_Cycle_Count       0x0032   089   089   000    Old_age   Always       -       10846
177 Wear_Leveling_Count     0x0013   096   096   000    Pre-fail  Always       -       121
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   058   051   000    Old_age   Always       -       42
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   253   253   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       10810
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       24060017288

SMART Error Log Version: 1
No Errors Logged

Lamia · Jul 25, 2021

The processes running on the drive determine its longevity. We have tonnes of jails doing chunking big data, poudrière daily build, and many more.

olli@ · Jul 25, 2021

trev said:
I've been running FreeBSD on a Samsung 830 256G SSD since 2011 and it shows no sign of going away. smartctl data below:

Code:

241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 24060017288

According to that number, your drive has only 11.2 TB written so far (the number is in LBA units = 512 bytes). I think the 830 256G is specified for 100 TBW, so it isn’t anywhere near the end of its lifespan yet, at least as far as the age of the flash cells is concerned.