Solved Are my hard drives dying or the motherboard going bad?

I came back from vacation to find error messages in my logs pertaining to a ZFS pool used to store media. This manifested itself as cam error and io operations suspended. I don't have the precise logs for some reason, not sure if it got lost from power cycling or what.

I *think* I was able to back up the drive in question by using dd.

Why would dd work with no issues, but attempting to online another vdev in the mirror cause the failures and subsequently, the io operations to be suspended?

When I attempted the same exact ZFS operations on a similar computer, I was able to reproduce the cam / suspend errors, so I suspect the hard drive itself is going bad and not the motherboard or controller.

After making an image of the drive, I used mdconfig to attach it to a memory disk and then use zpool import as I would normally. Then, I onlined another device to resilver it and essentially mirror it.

As a result of this apparent failure above, I decided I should rebuild my system. In the process of rebuilding my system on a separate computer, using separate disks, it pulls ZFS snapshots from the active machine and thus places a considerable amount of IO load on it. Long story short, I triggered more cam errors and an IO suspend on the main system drive this time.

I recently rebuilt my system from scratch a few days earlier and my system disk did not complain or error out then. Would you suspect the main system drive is also going bad?

Is there a good way to check?
 
I can't quite picture what you describe. Can you edit a little, maybe use things such as A, B, C, 1, 2, etc. to distinguish between e.g. the disks and systems? Thanks
 
... cam error and io operations suspended...
... reproduce the cam / suspend errors, ...
It sounds to me like you're talking about two different disks. The probability of both going bad at the same time is very small, unless there is a bizarre common cause.

To debug this, I think the best thing would be to know what the cam errors exactly are. There is a huge difference between communications errors (ECC or parity on the SATA bus) versus read/write errors. Also, if you can show us the output of smartctl -a.
 
Yes, I'm talking about 2 different disks.

My observation has been that the first thing I noticed was my media drive apparently dying. By that, I saw the cam errors and the ZFS pool was suspended. However, I do not see that in the logs (after having rebooted and power cycled the machine). I don't know how I lost that information, one would thing it would be written to the system logs. Upon attempting to reboot my system, the system hung because it could not export the ZFS pool because IO was suspended, so each time I rebooted, I had to power cycle the machine until I realized it was the media drive that was the issue. When I did, I unplugged it and removed it from /etc/rc.conf.

Next, since my media drive was offline, I decided to dump my pictures to my system drive on another ZFS volume (separate from the main ZFS volumes, ROOT, usr, home, etc). I was able to dump the pictures there, but later when I was rebuilding my system (on another computer), it pulls my active system for ZFS snapshots. It was at that point it caused the system to hang thereby suspending IO operations on the system drive and thus I couldn't interact with the system at all (unless I powered it down). At that point, I realized that my system drive also appears to be having an issue and after a day or so, I narrowed it down to using a specific ZFS volume for reading or writing files. It seems the other ZFS volumes are okay for RW operations, but just this one is problematic. Now, if I do ZFS operations like send snapshot to clone, that also causes problems.

As I mentioned in my earlier post, I tried the original media disk in a separate, similarly speced machine and it reproduced the exact same errors and behavior. I have since replaced that drive for another similar sized drive after having successfully dd'd the failing drive and then mirroring it on that drive. After having replaced the drive, I am no longer getting any errors for the ZFS pool or disk, so I am fairly confident that the drive itself was failing and not the SATA port.

That said, here is the smartctl output from my system disk. I will do the same for the media drive.
Code:
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, [URL='http://www.smartmontools.org']www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue
Device Model:     WDC WD5000AAKX-00ERMA0
Serial Number:    WD-WCC2EF576369
LU WWN Device Id: 5 0014ee 20776986f
Firmware Version: 15.01H15
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database 7.3/5528
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Apr  7 22:43:40 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         ( 8760) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     (  89) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x3037)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       121
  3 Spin_Up_Time            0x0027   142   138   021    Pre-fail  Always       -       3891
  4 Start_Stop_Count        0x0032   093   093   000    Old_age   Always       -       7687
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   082   082   000    Old_age   Always       -       13804
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   093   093   000    Old_age   Always       -       7041
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       119
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7573
194 Temperature_Celsius     0x0022   117   092   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   197   197   000    Old_age   Always       -       306
198 Offline_Uncorrectable   0x0030   197   197   000    Old_age   Offline      -       306
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   199   000    Old_age   Offline      -       332

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more
 
Here is the smartctl output from my media drive, the one that originally appeared to be failing:

Code:
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, [URL="http://www.smartmontools.org"]www.smartmontools.org[/URL]

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST31000528AS
Serial Number:    6VPD48MX
LU WWN Device Id: 5 000c50 036c1e5d4
Firmware Version: CC46
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sun Apr  7 22:59:29 2024 EDT

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
[URL unfurl="true"]http://knowledge.seagate.com/articles/en_US/FAQ/207931en[/URL]
[URL unfurl="true"]http://knowledge.seagate.com/articles/en_US/FAQ/213891en[/URL]

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  609) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   1) minutes.
Extended self-test routine
recommended polling time:     ( 176) minutes.
Conveyance self-test routine
recommended polling time:     (   2) minutes.
SCT capabilities:           (0x103f)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       77185527
  3 Spin_Up_Time            0x0003   094   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1369
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       4948635329
  9 Power_On_Hours          0x0032   038   038   000    Old_age   Always       -       55175
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       532
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   084   084   000    Old_age   Always       -       16
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032836
189 High_Fly_Writes         0x003a   076   076   000    Old_age   Always       -       24
190 Airflow_Temperature_Cel 0x0022   084   048   045    Old_age   Always       -       16 (Min/Max 16/16)
194 Temperature_Celsius     0x0022   016   052   000    Old_age   Always       -       16 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   019   000    Old_age   Always       -       77185527
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       56623 (136 61 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2282680202
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2664697217

SMART Error Log Version: 1
ATA Error Count: 15 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 15 occurred at disk power-on lifetime: 55175 hours (2298 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f8 03 31 0c  Error: UNC at LBA = 0x0c3103f8 = 204538872

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 c8 68 51 31 4c 00      00:09:02.693  READ FPDMA QUEUED
  60 00 18 68 40 31 4c 00      00:09:02.691  READ FPDMA QUEUED
  60 00 08 90 33 31 4c 00      00:09:02.686  READ FPDMA QUEUED
  60 00 10 b0 31 31 4c 00      00:09:02.682  READ FPDMA QUEUED
  60 00 88 88 03 31 4c 00      00:09:02.679  READ FPDMA QUEUED

Error 14 occurred at disk power-on lifetime: 55168 hours (2298 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 68 1e 31 0c  Error: UNC at LBA = 0x0c311e68 = 204545640

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 42 80 00 1e 31 4c 00      00:52:59.806  READ DMA EXT
  25 42 80 80 1d 31 4c 00      00:52:59.803  READ DMA EXT
  25 42 80 00 1d 31 4c 00      00:52:59.802  READ DMA EXT
  25 42 80 80 1c 31 4c 00      00:52:59.800  READ DMA EXT
  25 42 80 00 1c 31 4c 00      00:52:59.798  READ DMA EXT

Error 13 occurred at disk power-on lifetime: 55168 hours (2298 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f8 03 31 0c  Error: UNC at LBA = 0x0c3103f8 = 204538872

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 42 80 80 03 31 4c 00      00:52:56.826  READ DMA EXT
  25 42 80 00 03 31 4c 00      00:52:56.824  READ DMA EXT
  25 42 80 80 02 31 4c 00      00:52:56.822  READ DMA EXT
  25 42 80 00 02 31 4c 00      00:52:56.820  READ DMA EXT
  25 42 80 80 01 31 4c 00      00:52:56.818  READ DMA EXT

Error 12 occurred at disk power-on lifetime: 55168 hours (2298 days + 16 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d0 fb 30 0c  Error: UNC at LBA = 0x0c30fbd0 = 204536784

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 42 80 80 fb 30 4c 00      00:52:53.990  READ DMA EXT
  25 42 80 00 fb 30 4c 00      00:52:53.987  READ DMA EXT
  25 42 80 80 fa 30 4c 00      00:52:53.986  READ DMA EXT
  25 42 80 00 fa 30 4c 00      00:52:53.984  READ DMA EXT
  25 42 80 80 f9 30 4c 00      00:52:53.982  READ DMA EXT

Error 11 occurred at disk power-on lifetime: 55167 hours (2298 days + 15 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 f8 03 31 0c  Error: UNC at LBA = 0x0c3103f8 = 204538872

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 c8 68 51 31 4c 00      00:07:12.783  READ FPDMA QUEUED
  60 00 08 90 33 31 4c 00      00:07:12.781  READ FPDMA QUEUED
  60 00 88 88 03 31 4c 00      00:07:12.777  READ FPDMA QUEUED
  60 00 b0 10 d3 30 4c 00      00:07:12.776  READ FPDMA QUEUED
  60 00 38 d0 92 30 4c 00      00:07:12.767  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     21129         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

It appears to me that the device does have some errors in its logs.
 
It sounds to me like you're talking about two different disks. The probability of both going bad at the same time is very small, unless there is a bizarre common cause.
It might be that they are from the same batch. You can check the serial numbers and see how far apart they are. I f.e. have discs in an array that are xxx4 and xxx6 in their serial numbers, clearly the same batch and when under the same load, prone to fail at the same time. Worst moment is when one drive dies and the second does when the resilver comes around.
 
Error 15 occurred at disk power-on lifetime: 55175 hours (2298 days + 23 hours)
Nice, old disk. Been running for more than 6 years now. I'm betting it's on its last legs. Mine typically last 4 years (turned on 24/7) then usually start showing the DMA errors and/or CAM time outs and retries. They usually stop working entirely not long after that. You're lucky you managed to get a backup from it.
 
Ok, I keep my stuff on 24x7 too. Perhaps I could turn things off at night or put them into a lower power state to perhaps reduce wear and tear.

So, the million dollar question then, you don't see any reason to suspect any other issues?
 
The second disk has 3 errors right now (categories 197/198), and has had 16 over its lifetime. It is also a 1 TB Seagate Barracuda, and there were several generations of Barracudas that were exceedingly failure prone. I personally went through 3 of them (two failed), a colleague through 5 (he needed 4, got one replaced early on by Seagate warranty).

Not only has the disk been powered on for about 6 years, I think the 1TB Barracudas were made and sold about 12 years ago. I can roughly date mine: It was bought when "Fry's" electronics superstore still operated the giant store in a suburb of Portland OR.

I suspect the 500 GB WD drive is even older, even though it has only 1-1/2 years of power up time. So both are very long in the tooth.

Now the question is: How long do disk drives typically last? That's a super tough thing, and has many factors. The real AFR (annual failure rate) in good conditions is somewhere between 1.5% and 0.5%, which would mean that statistically they should on average last between 60 and 200 years (*). But that is a terrible extrapolation: the data that goes into these measurements are all taken on relatively young disks (at most 5-7 years), so estimating what happens after 10 or 12 is tough. The big cloud companies (which all run MILLIONS of drives) are now depreciating them financially over 5 to 7 years. But in most cases, the older drives that are being thrown away aren't dead, they are just too low in capacity, and it is cheaper to replace them (since physical space and power costs money too). So I disagree with SirDice: From a reliability point of view, a 6- or 12-year old disk doesn't always have to be dead. But the risk of them dying soon is increasing. In particular since desktop systems, amateurs, and household environments are much harsher on disks than giant data centers with good temperature, humidity and vibration control.

In your case though, both disks are either tired, sick, or outright dying. I would replace them. Do I see any other issues? No, just two ancient disks, which are getting errors.

(*) Footnote: The oldest disk in the world is the IBM RAMAC, about the size of two large refrigerators, and with 5 MB capacity. I used to walk past serial number 3 every morning and evening on the way to/from work. One of the RAMACs is still "functioning", in the sense that it is in a museum (either the Computer History Museum or at Santa Clara University), where it regularly gets powered up and exercised, and bits, bytes and tracks can be read from the original disk. That disk is from 1956, making it about 66 years old.
 
Yes, that is what I was thinking too that power cycles are perhaps more destructive than leaving them on 24x7.
 
That's another super complex question. Historically, that has changed back and forth. Modern disks (made in the last few years) probably can handle starting/stopping lots of times; for example Amazon's "glacier" storage is thought to be some combination of powered-down disks and tapes. On the other hand, 15 or 10 years ago, the proposed "MAID" (massive arrays of idle disks) technology and startup died because disks did not actually survive being spun up and down every few hours.
 
So I disagree with SirDice: From a reliability point of view, a 6- or 12-year old disk doesn't always have to be dead. But the risk of them dying soon is increasing. In particular since desktop systems, amateurs, and household environments are much harsher on disks than giant data centers with good temperature, humidity and vibration control.
Oh, I've had 10 year old disks working fine. I've also had 2 year old drives just die on me, 4 to 5 years is an average over the past 20 or so years I've had my home lab. I should probably also mention it gets hot during summer. Every summer, or just after summer, there's a disk that needs replacing. My ambient "room" temperature during summer is usually 35C or more and have bad airflow in the cases. Cheap consumer grade disks just don't last that long in my environment. But they're cheap, and they're in a RAID-Z, so I'm not really that worried about it :D
 
I have been able to clear Current_Pending_Sector (197) and Offline_Uncorrectable (198) errors in the past by re-silvering the disk (which allows the firmware to relocate bad blocks that were in use by the O/S). This does require redundant storage...

However, my experience is that any persistence or return of 197 and 198 errors (especially when numerous) generally portends death of the drive.

If these were my disks, I would immediately assess the recovery and replacement strategies, and acquire replacement media.

For the uninitiated, Seagate drives store two numbers reported as a single integer in ID# 1, 7, 188, and 195. There's a note [UPDATE (1 Nov 2020)] covering this in the excellent TrueNAS Hard Drive Troubleshooting Guide.
 
IMO, these disks are on the way out.
You can change them proactively on your terms, or reactively on their terms.

Seagate and Western Digital provide bootable software tools that will read the SMART data and make a determination for you.
They probably won't work on a ZFS formatted drive for track reading purposes, but I have never tested this.

I change my client disks at 26,280 hours, or 3 years of 24x7 operation.
None of my systems stay up anymore on a 24x7 basis.. no need to put that amount of wear on spinning devices like disks and fans.

My Server 2012 machine crapped out at power on a week ago.
The Corsair 450w PSU popped a cap, blew the PSU and took one G.Skill memory stick with it.
Drat... only got 12 years out of that rig.
 
Good points.

I have already got a new media drive and was extremely lucky. I was able to recover the media by running dd, then attaching the image to a memory disk and importing the pool, then resilvering to the other 2 devices in the mirror. After that completed, I create a new pool from the snapshot and am in the process of adding devices to that pool to mirror the storage. When that is done, I will essentially have 3 drives in a mirrored configuration. Again, 1 drive is online 24x7, and the other 2 are offline mirrors. This has worked well for me, but perhaps to someone's point, I should be powering down the drive when it isn't in use.

Regarding my system drives, I have not used smartmontools in the past, so I have no baseline, but all of my equipment is old with the newest stuff purchased 2nd hand in 2016 excluding the memory I got new and any replacement hard drives :). When running smartmontools on the other drives, they too look just as bleak. I guess the takeaways are:

1. have a cold spare or 2 ready (even if the drive is older and in a prefail state)
2. have a bootable FreeBSD image or 2 such as nomadbsd
3. have a new system drive on the way to replace failing disks
4. use smartmontools to warn of an possible drive failure, I believe periodic has a cam check job, is there a similar one for smartmontools I can use?

Lastly, what I used to do with Gentoo / Funtoo was to boot the system from USB squashfs image and run it entirely from memory. I got persistence by plugging the USB drive back in and writing the changes. Would something like that help to reduce wear and tear?
 
My personal opinion is that spinning drives up and down repeatedly is a bad idea unless they are specifically designed to operate in such a heavy duty cycle.

I would, instead, look to the operational environment of the drives. Both ambient temperature and airflow really matter for durability, as does a quality power supply. Is your server in the coolest place it can be? Is it ever touched by the sun? Does each disk drive have good separation from the next, and have a fan blowing directly onto it? Do you have a UPS? The smartmon tools can raise an alarm if the temperature goes too high.
 
Running dd normally means you are reading data sequentially which can be more reliable than trying to get the drive to swing the heads to seek to a sector. Sometimes you can also have differing results depending how the drive is physically oriented. I'd replace a drive showing either symptom.

The first disk was at 7041 (avg=every 1.96 hours used) while the second disk you listed had 532 power cycles (avg=every 103.7 hours used); how much you 'used' them and when changes what you could spare for unneeded uptime. There are stresses powering on circuitry like the light bulb question which is a thing: capacitors charging causes a (small) surge of electrons and is limited by resistors and other resistances, motor receiving power while at a standstill has its highest current draw and lowest when at maximum speed, the torque of starting the motor is a moment of increased friction, and heating cooler components (source of heat doesn't matter) causes expansion of materials (they don't all go at the same rate) while cooling causes contraction which can lead to differences in friction/wear and physically stresses on solder connections and components overall. Leaving it running is not stress free as hotter components generally have a shorter life, electrical power can slowly degrade components, moving components still cause wear, etc. Which is worse always gets debated back and forth but longer unused time I've always seen it recommended to turn off for durability and power savings. Regular restarts have shown problems in some green drives being spun up just after they were spun down causing short lifespans/rapid deaths and got a firmware update. Similarly, cars that turn off their engine at times like stoplights put much higher wear on components like the starter so such components had a much shorter life than on an engine not doing that.

Running smartctl with -x instead of -a can sometimes give more details; I think that is what gsmartcontrol also uses which grahamperrin asked for results from though that tool gives light and dark red coloring depending if it thinks it is a minor or major problem being detected.

You can add periodic jobs for smart tests to be reported and ran. I think there was a daemon that can be monitoring basic attributes nearly-live to alert the moment it sees a change. Running a full test takes longer but does a full surface check. The short test normally tests no more than 10% of the disk surface (and often much lower) so is why you can have a fast/conveyance test pass and a slow fail when it is just 1 to a few bad sectors. If you truly question a drive, a quick test pass always means you should continue with a more thorough test.

Reported temperatures looked okay but I have reliably caused bad sectors with overtemp on a drive and they always reliably took new data. BackBlaze reported best results with magnetic drives kept between 35C-45C if I recall. Bad power supply, connection, and dusty/dirty drive circuitry can also be a culprit for bad sectors and other issues. Data cables that are damaged, poorly connected, or bent too much can cause data issues but it won't show as bad sectors.

Vibrations from outside and inside the computer may reach the drive in detrimental ways; adding padding if the case is hard feet against a vibration prone surface can help. Inside the case you should repair (yes, sometimes possible) replace, or disable/remove fans that do not spin properly due to dirt/wear/damage) as they can give everything a good shake. Some cases offer vibration dampening mounting for drives and fans (usually rubber or silicone gromets/pads) which can help keep things quiet and keep vibrations from one component from reaching others. Vibrations can sometimes show as intermittent drive issues but can lead to shorter drive life and sudden failure that could have been avoided.

If you improve issues with temperature, dirt, power, connections, and vibrations to a drive then you can possibly get many years out of a seemingly questionable drive. Occasionally but "rarely" getting a bad sector, or finding one on after for example power interruption wouldn't scare me too much but any sector problems = data problems so make sure you don't use such drives with anything you want to guarantee you can read if it keeps happening/is reproducible.

USB sticks are usually slow and not very durable for tasks like running an operating system from it. Using a RAM filesystem that you later sync to disk can minimize writing so it can be more durable. I did a bit of shopping to find different drives that were significantly faster for booting from but best speed and durability will come from using a hard drive (magnetic or solid state) on USB instead of a USB stick or memory card.
 
My personal opinion is that spinning drives up and down repeatedly is a bad idea unless they are specifically designed to operate in such a heavy duty cycle.

I would, instead, look to the operational environment of the drives. Both ambient temperature and airflow really matter for durability, as does a quality power supply. Is your server in the coolest place it can be? Is it ever touched by the sun? Does each disk drive have good separation from the next, and have a fan blowing directly onto it? Do you have a UPS? The smartmon tools can raise an alarm if the temperature goes too high.
My computer is in the basement where it is fairly cool, even in the summer. Nope, it is out of the sun.

Yes, this case does have a fan blowing directly over the hard drive; however, my media drive does not, that was the one that failed first albeit not by much. The media drive is sitting in a 5.25-in bay meant for a DVD burner with a 3.5-in adapter, not ideal.

They're vertically oriented as this is a slim tower ... I am looking to go back to a full-size tower once these boxes die.

I will look to setup smartmontools to monitor the temperature and check for differences between the 2 drives.
 
My personal opinion is that spinning drives up and down repeatedly is a bad idea unless they are specifically designed to operate in such a heavy duty cycle.
Do you keep the same opinion about a device that spins at 7,200 rpm for every second it is powered up?

All night while you are asleep.
Every day while you are away at work.
All week when you are away on vacation?

The bearings are tiny, and lubrication is required in those bearings.
The slightest wear can induce wobble and disc misalignment and a potential crash.
The heads fly at a height less than that of a human hair... tight tolerances.

On the flip side, I can argree with the detrimental effect of new vehicles that shut off the engine at every traffic light.
This is a severe duty cycle for the starter motor, plus the constant wear resulting from an engine starting repeatedly without full oil pressure.
 
Do you keep the same opinion about a device that spins at 7,200 rpm for every second it is powered up?
Yes, because they are designed, tested, and warranted to do that,
On the flip side, I can argree with the detrimental effect of new vehicles that shut off the engine at every traffic light.
This is a severe duty cycle for the starter motor, plus the constant wear resulting from an engine starting repeatedly without full oil pressure.
and I can see that you understand the sorts of issues at play...
 
All night while you are asleep. ...
On the flip side, ...
That's the kind of tradeoff that the engineering people at disk manufacturers do. And they are VERY good at it, assuming that (a) they know what the customer wants and how the customer will use the drive, and (b) the customer uses the drive the way it was intended to be used. As you can see, the problem here is the communication between (a) and (b). A drive that dies if you use it 24x365 is not badly engineered, it is being used incorrectly. And a drive that dies if you spin it up every hour or every day for a few minutes is also not badly engineered, it is also being used incorrectly.

For 90% or 99% of all disk drives, this is not a problem: Most drives made are sold to a handful of customers (Amazon, Microsoft, Google, ...), and these companies have large engineering teams that are constantly in contact with the engineering teams at the drive makers. I've been there, done that, got the T-shirt. If a big company wants to use disk drives in an unusual way (like spinning them down for long periods, or running them at unusual environmental conditions), they consult with the drive maker to get advice. And they test and measure, usually on thousands of drives for many months. I've worked with databases that contain the failure rate of several million drives, broken down exactly by things like workload, temperature, when and where the drive was made, and so on.

For amateur consumers, or pro-sumers, or hobbyists (and FreeBSD is pretty much only used in those markets), the best one can do is read the spec sheet of the drive. Pay attention to the rated number of of hours, start/stops, annual write or read traffic, and the AFR. Play extra attention to the warranty conditions. For example, if it says "5 year warranty, but only for 1000 start/stops per year, and no more than 550 TB/year of write traffic", that pretty much tells you where the edge of the drive's reliability is going to be.

If you have really old drives (older than 5-7 years), then you're just running on borrowed time. Every day of use you get out of your drive is a little gift. Cherish it.
 
Back
Top