ZFS System freeze possibly caused by bad disk?

Is it normal for a FreeBSD system to freeze when a RAIDZ2 HDD disk errors?

I have a 11.2-RELEASE-p5 system running 2 pools.
ZPOOL - Mirror - 2 SSD's: Boot and root partitions
DPOOL - RAIDZ2 - 6 WD Red HDD: Personal data

Within the DPOOL I have had a spate of disks failing over the last few years. Initially I see CDB errors in the log pertaining to a particular disk. Then the system will start to freeze. It will respond to pings but not SSH. Upon reboot everything is fine. The ZFS pools are fine. Then it will freeze with increasing regularity. No indications in the log at the time of the freeze.

When I replace the disk from the DPOOL that was logging the CDB errors, the freeze goes away - until another disk is failing. (I think I got a bad batch of WD Reds - they last about 2 years.)

So, is it normal for a WD Red NAS drive in a RAIDZ2 pool to be able to hang a system??? I assumed that the RED NAS would time out and the ZFS system would flag the error and continue operating?
 
So, is it normal for a WD Red NAS drive in a RAIDZ2 pool to be able to hang a system???
Normal, no. But it is possible. I've had that happen a couple of times. One broken disk in a ZFS pool stalling the entire pool. Sometimes disks get bad enough they'll start showing time-outs or other errors and this can unfortunately hang up the entire bus. Which in turn causes the whole system to stall on disk access. I typically just remove the disk, even if I don't have a replacement for it yet. The pool will obviously be in a degraded state but will continue to work without stalling the system.
 
Exactly. I had one a few years ago at home where the SATA disk completely hung the system, to the point that it wouldn't boot (not even a BIOS screen), and when you plugged it in the system would come to a complete freeze. In such a situation, there is nothing the OS can do.

In theory, all IO operations should time out (after typically between a second and 5 minutes), the OS should be able to do operations against other devices at the same time, and resume operation normally after the sick or dead device is disabled. In practice, this doesn't work on commodity hardware. To get to this point, you need to very carefully align all the software- and firmware versions, fix all bugs in the software stack (from application through OS and HBA firmware to the device), make sure you are only running in configurations that have been fully debugged and tested. Been there, done that, it takes many man-years. That's why reliable enterprise storage costs big money.
 
Okay. So I really want to continue using ZFS on FreeBSD but I can't justify the effort if my system regularly freezes.

To elaborate further, in the weeks leading up to the system freezes I see error messages like this:
Code:
Oct 25 02:32:59 medusa kernel: (ada2:ahcich3:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 87 9c da 40 58 01 00 00 00 00
Oct 25 02:32:59 medusa kernel: (ada2:ahcich3:0:0:0): CAM status: Uncorrectable parity/CRC error
Oct 25 02:32:59 medusa kernel: (ada2:ahcich3:0:0:0): Retrying command
Once the system starts freezing I replace the disk that is producing the CRC errors and the system is stable until the next disk dies. I have replaced 4 out of 6 disks over a 5 year period.

I thought I was using reasonably hardware:
  • Supermicro X10SL7-F Motherboard
  • Intel Xeon Processor E3-1231V3B
  • Crucial 16GB (8GBx2) DDR3 ECC Server Memory
Are there any suggestions to solve this problem?

My current plan is to slowly replace the WD disks with Seagate Ironwolf disks. Should I look at a more robust PCI-E disk controller?
 
My sense is that ZFS is quite possibly saving you from even greater harm.

WD Reds are not at all good for reliability. The 3 TB drives are probably the worst. But losing 4 out of 6 six drives seems excessive. I would not rule out a bad batch, or firmware issues, but my first instincts would be to check for cooling or vibration issues.

Is ZFS not reporting a degraded status for the disk concerned, from zpool status?

Are you running the smart daemon, smartd(8)()? If not, I recommend you install it ( sudo pkg install smartmontools), and enable it (in /etc/rc.conf: smartd_enable="YES").

Showing us the output of smartctl -H -a /dev/ada2 would assist in the diagnostics.

The Backblaze Hard Drive Stats are worth checking before you make a decision on replacements.
 
I have never got to the point where ZFS fails the drive or even reports any errors. I get the aforementioned retries in the log and then system freezes.

When I replace the disk, ZFS happily resilvers and the system freezes stop.

I do have smartmon enabled. Once I got a report of errors, but the other 3 disks and the current failing disk report nothing.

As requested, here is the diagnostics for ADA2:
Code:
root@medusa:~#  smartctl -H -a /dev/ada2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.2-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68N32N0
Serial Number:    WD-WCC7K3xxxxxx
LU WWN Device Id: 5 0014ee 20e86b5f8
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov 21 17:57:34 2019 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (29340) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 313) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x303d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   177   167   021    Pre-fail  Always       -       6108
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       21
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   071   071   000    Old_age   Always       -       21300
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       21
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       636
194 Temperature_Celsius     0x0022   114   107   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     21212         -
# 2  Short offline       Completed without error       00%     21045         -
# 3  Short offline       Completed without error       00%     20877         -
# 4  Short offline       Completed without error       00%     20715         -
# 5  Short offline       Completed without error       00%     20548         -
# 6  Short offline       Completed without error       00%     20380         -
# 7  Short offline       Completed without error       00%     20212         -
# 8  Short offline       Completed without error       00%     20045         -
# 9  Short offline       Completed without error       00%     19880         -
#10  Short offline       Completed without error       00%     19721         -
#11  Short offline       Completed without error       00%     19554         -
#12  Short offline       Completed without error       00%     19386         -
#13  Short offline       Completed without error       00%     19224         -
#14  Short offline       Completed without error       00%     19056         -
#15  Short offline       Completed without error       00%     18888         -
#16  Short offline       Completed without error       00%     18720         -
#17  Short offline       Completed without error       00%     18553         -
#18  Short offline       Completed without error       00%     18385         -
#19  Short offline       Completed without error       00%     18217         -
#20  Short offline       Completed without error       00%     18049         -
#21  Short offline       Completed without error       00%     17881         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

You can see that it has operated for 2.5 years with a handful of power cycles, no sector errors and self-tests are regular intervals without error.

Thoughts?
 
It seems that this has nothing to do with ZFS. The underlying IO layer gets errors, and freezes. What do you expect ZFS to do when the system freezes? Defrost it magically?

It may even have nothing to do with FreeBSD, and be a problem in the underlying IO layer. As I said above, it's perfectly possible for a bad disk and HBA (such as a SATA interface) to freeze the disk at the level where the OS doesn't even run any longer. Actually, your description (ping still works, everything else hangs) indicates that the CPU has become stalled for normal instructions, but interrupt handlers still run.

While SMART data can be helpful in diagnosing disk problems, it isn't always capable of doing so. In your particular case, it is unlikely to help. Here's why: the IO errors you are seeing are parity/CRC errors. Those are not internal errors inside the disk drive, but errors on the communication between the disk drive and the host. That also matches the observation that the whole system freezes: Probably the error handling in your IO interface is very badly implemented (common on commodity hardware).

My educated guess: You have a serious hardware problem with your disk interfaces. Judging by your description, those are probably just SATA cables directly from the motherboard to the disks. I would look at power supply and SATA cabling. To begin with, figure out whether the problem is specific to one particular SATA cable or not. Also check whether your power supply is being overloaded. Buying a PCIe disk controller or replacing the disks (WD -> Seagate) would probably fix the problem too, but is overkill, this problem should be solvable with your existing hardware.
 
Your WD Reds are the same model as mine, but somewhat newer (more recent firmware). Your hardware choice looks well considered. The smart report looks good (temperature OK and no indication of anything anomalous).

As ralphbsz suggests, power supply, cables, and SATA interfaces are probably next on the suspect list.

Does the power supply have sufficient capacity for the base system plus the disks? Each WD30EFRX disk requires 1.73 amps (21 watts) on the 12V tap.

Is it a quality branded power supply (the generics supplied with many cases tend to be poor, sometimes abysmal)?

If quality or capacity of the PSU is suspect, replacing the power supply with something highly regarded like a Seasonic should not break the bank.

I would also shutdown, re-seat all the power cables, re-seat and tag all the SATA cables (source and destination), and keep a diary to see if anything follows a cable, controller port, or controller.

I have a 620W Seasonic power supply on hardware similar to yours. I also have an LSI 9211-8i SAS2008 disk controller. There are 5 x WD Reds are on the LSI controller for ZFS, and 2 x WD Velociraprors on motherboard SATA for gmirror ufs dual boot. One WD Red lost early on. No problems since.

A quick check with Google suggests that the LSI 9211-8i SAS2008 are very affordable too, but I would canvas this list for recommendations before buying.

Lastly, please consider the quality of the inbound power. Is anything else electronic playing up?
 
The power supply is a Corsair CS650M 650W 80+ Gold installed in Mar 2018. So I am assuming that is not the source of the issue. The 12V rail has a capability of 51A (612W). With no graphics cards or other PCI cards, it is only the CPU/MB/Stock fan cooler drawing current. That should leave plenty for the disks.

I will take the advice to re-seat the power cables and suspect the SATA cables. Starting a log is a good idea. I have enough spare SATA cables, so once they are tagged for tracking, I will replace half the cables and see if anything changes.

The bad/good news is that the system has failed overnight for the past three days. Nothing in the log files, just a frozen system in the morning. I've setup better monitoring from another system to determine the exact failure time so that can I see if any tasks are correlated to the freeze.

gpw928 Can I ask why you chose a disk controller rather than use motherboard SATA ports? I am only using motherboard SATA ports and wonder if the motherboard just doesn't cope with so many ports in use?
 
You are almost at the point where a hardware engineer would start switching out components, one at a time. Trouble is they have the spare parts...

I would re-seat the memory too. Also run with half of it, at a time, if that's possible.

Your motherboard has three different types of SATA/SAS ports. I'd be tracking each fault back to the header.

Is ada2, above, one of your boot disks? Are your DPOOL disks on your LSI 2308 SAS Controller? Do the DPOOL disks show up as da[0-5]?

Is it only the ada disks that play up?

I bought the LSI disk controller purely for capacity. There were not enough on-board SATA ports. But it does mean that my ZFS disks are using a different driver (da, not ada).

The frequent failure is good. If the same disk is being identified on the console, you can try switching power and SATA cables to it.
 
The bad/good news is that the system has failed overnight for the past three days. Nothing in the log files, just a frozen system in the morning. I've setup better monitoring from another system to determine the exact failure time so that can I see if any tasks are correlated to the freeze.
That's actually a really good idea. Write a little script that records as many things as you can think, and does it relatively frequently: CPU load (vmstat or something like that), IO workload (iostat), temperature (motherboard, and perhaps from SMART), number of active tasks. Perhaps every minute or so. Append to a file on another disk, and sync that file right away.

A few years ago, I used an external backup disk that was connected via USB 2.0, and it regularly gave me trouble, causing the ZFS file system on that disk to be unhappy, lots of IO errors, and all that. It usually happened in the middle of the night. Through a really simple script I discovered that the problem was the big nightly job of ZFS scrubbing the backup disk; I presume the intense workload of scrubbing that runs for an hour or two caused something to overheat. I turned scrubbing off for a few days, and the problem went away. Then I changed the hardware around (first USB 2.0 -> eSATA, and then switched to USB 3.0 once FreeBSD supported that), and the problem never came back. But the clue to the random IO errors was really: what was happening at the time.

gpw928 Can I ask why you chose a disk controller rather than use motherboard SATA ports? I am only using motherboard SATA ports and wonder if the motherboard just doesn't cope with so many ports in use?
In my opinion, the motherboard should cope with all SATA ports running full speed all the time, or any other performance. BUT: Bugs exist, and intense workload likes to tickle bugs more.

One more debugging suggestion: Get some extra fans (could even be an external floor fan), and cool your system much better for a few days. That can help identify heat-related problems.
 
@gpsw928 Great find!!

Release 12.1 contains revision 352735 which has a bunch of fixes - half of which could be relevant. I will wait and try the SATA cables first, but it is tempting to move to R12.1 to capture those "stability fixes" which include a fix for "mpr/mps crash badly" (rev 352741).

I'll start with the cable/port tracking, and swap some cables.
 
Just a word of caution. I don't believe ada2, shown in your post above with uncorrectable parity/CRC errors, uses the mps driver. But your motherboard has an integrated LSI 2308 SAS Controller which would use it.

My FreeBSD ZFS server ran OK on FreeBSD 10.2. But I skipped 11.x and upgraded straight to 12.0.
 
Just an update for gpw928 and ralphbsz who were kind enough to help me with this issue.

Your "first instinct" to check cooling appears to be correct!

One of the two non-CPU heatsinks on the motherboard has always been very hot to touch. I placed an 80mm fan above the heatsink to direct air onto the heatsink. I have not had a single problem since adding the fan.

I was making changes one step at a time. Previously I tried changing SATA cables and that did not have any effect. Adding this one fan has stopped the overnight system hangs. I haven't tried removing the fan for a scientific confirmation....

I guess I have discovered why extra care should be taken when running server motherboards inside consumer desktop cases. I have now also added an extra case fan.

Thanks for your help.
 
Back
Top