ZFS drives going unavailable

Pawtuxet · Jan 15, 2011

Hello all,
For some time now, I've been struggling with harddrives in my raidz pool becoming unavailable under heavy usage. First time it happened, I assumed a faulty disk and began replacing them with higher capacity disks, as I'd been meaning to do that anyway.

Replacing and resilvering the first disk went well, but while working on replacing the second disk, suddenly the first disk went offline and the job failed as there now wasn't enough data to maintain the pool. I searched around on various forums, seeing folks with sometimes similar issues, but ultimately finding no definite answers.

Rebooting brought the failed device back with all data intact, so I started a scrub in an attempt to conclude replacing the second drive. A couple of failed attempts later, for no discernible reason, the job finished successfully and the pool was back to full health.

At this point, I updated FreeBSD, noticed zfs went from version 13 to 14 and updated the pool to 14 before moving on.

Replacing drive #3 went without a hitch; perhaps the issue had been addressed in the last FreeBSD update.

It had not. I'm now stuck on replacing drive #4, a process that has failed numerous times at this point. I've noticed that, so far, it's always been ad4 or ad6 that goes offline.

# atacontrol list

Code:

ATA channel 0:
    Master:      no device present
    Slave:       no device present
ATA channel 2:
    Master:  ad4 <WDC WD20EARS-00MVWB0/50.0AB50> SATA revision 2.x
    Slave:       no device present
ATA channel 3:
    Master:  ad6 <WDC WD20EARS-00MVWB0/50.0AB50> SATA revision 2.x
    Slave:       no device present
ATA channel 4:
    Master:  ad8 <WDC WD20EARS-00MVWB0/51.0AB51> SATA revision 2.x
    Slave:       no device present
ATA channel 5:
    Master: ad10 <WDC WD20EARS-00MVWB0/51.0AB51> SATA revision 2.x
    Slave:       no device present
ATA channel 6:
    Master: ad12 <ST9320421AS/SD13> SATA revision 2.x
    Slave:       no device present
ATA channel 7:
    Master:      no device present
    Slave:       no device present

I find it suspicious that it's always a drive with the older firmware that's acting up. I've contacted WD customer support, in hopes they'll provide a firmware upgrade, so that I at least can eliminate it as a possible cause. Nothing so far, but I don't expect a response over the weekend.

At the time of this post, the zpool status looks like this. Ugly and scary, but the errors are due to the drives spontaneously vanishing. At least it's no longer 8000 errors.

# zpool status

Code:

  pool: tank
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: resilver in progress for 6h13m, 22.97% done, 20h53m to go
config:

        NAME            STATE     READ WRITE CKSUM
        tank            DEGRADED     5     0     0
          raidz1        DEGRADED    12     3     0
            ad4         UNAVAIL     76  184K 2.64K  experienced I/O failures
            ad6         ONLINE      11     3     0  162M resilvered
            ad8         ONLINE       0     0     0  159M resilvered
            replacing   DEGRADED     0     0     0
              ad10/old  UNAVAIL      0  218K     0  cannot open
              ad10      ONLINE       0     0     0  171G resilvered

errors: 2 data errors, use '-v' for a list

I've issued a stop command to the scrub process, but for the time being it's ignoring me.

Once a drive starts misbehaving, it fills up /var/log/messages with entries like these.

Code:

ad4: FAILURE - READ_DMA48 timed out LBA=380864299
ata2: SIGNATURE: ffffffff
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES SET TRANSFER MODE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE RCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE WCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SET_MULTI command
ad4: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=380864342
ata2: SIGNATURE: ffffffff
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES SET TRANSFER MODE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE RCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SETFEATURES ENABLE WCACHE command
ata2: timeout waiting to issue command
ata2: error issuing SET_MULTI command
ad4: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=380864470

LBA is never the same, keeps decreasing and eventually hits 0.
I've tried replacing the SATA controller, but with the same make/model (Promise PDC40718 SATA300 controller), so it may still be a problem with that particular device.
Reading the FreeNAS forums suggested I tried activating ATA_REQUEST_TIMEOUT in the kernel, which I did, trying different values (5, 15, 30) but ultimately, it just made the drives go offline faster, while leaving the scrub to run much, much longer before giving up.
I realize there are many threads about issues with the sector size of the WD20EARS, but they all seem to be related to performance, not outright failures.

So, before I forget why I started this post; has anyone here had similar issues, suggestions or simply an idea about what could be happening?

tingo · Jan 16, 2011

Have you tried the simple things first? Like changing the sata cables?

jb_fvwm2 · Jan 16, 2011

I've that same controller. Works perfectly here (rsync) with the bwlimit parameter ( you can search the forum...bwlimit )... throttling the data transfer rate. Using it at the same speed as the sending data often corrupts the bsdlabel on the target drive, making it unmountable without the -force option, (those are the effects I recall anyway). Probable reasons why are to be found, but I do not exactly recall, and do not wish to disparage its purchase for those using it for 'rsync and bwlimit' for which it works perfectly... as I am unsure if there are better options for the pci (not pci-e ) bus. Reading you post, however, I wonder if there is enough memory installed as the "instead' factor.

Pawtuxet · Jan 16, 2011

tingo said:
Have you tried the simple things first? Like changing the sata cables?

Well, no, but I don't really think the issue is defective cables, since it affects two drives. I've tried running # dd if=/dev/ad4 of=/dev/null bs=1m on both drives, to see if I could provoke the error, but both drives complete successfully, having averaged ~90Mb/s throughout.

Also, the case is a compact Shuttle-like enclosure, and to get to the back of the drive bays, I'd have to take the entire thing apart. Putting it together in the first place, was a weekend-consuming trial in making most of the available space.

I suppose I can try switching the drives around and see if the problem stays with the drive or the bay, but I'm not sure how ZFS would react to that.

jb_fvwm2 · Jan 16, 2011

"under heavy usage", your first post... using bwlimit, I limit the to-tx4 data rate to "1000" while the sending drive is sending at "10000". I can do four rsyncs at once (4000) and still have the controller send data reliably. No chance of limiting the data sent to/from the disks by that controller to half of what "heavy usage" is?

Pawtuxet · Jan 16, 2011

jb_fvwm2 said:
I've that same controller. Works perfectly here (rsync) with the bwlimit parameter ( you can search the forum...bwlimit )... throttling the data transfer rate. Using it at the same speed as the sending data often corrupts the bsdlabel on the target drive, making it unmountable without the -force option, (those are the effects I recall anyway). Probable reasons why are to be found, but I do not exactly recall, and do not wish to disparage its purchase for those using it for 'rsync and bwlimit' for which it works perfectly... as I am unsure if there are better options for the pci (not pci-e ) bus. Reading you post, however, I wonder if there is enough memory installed as the "instead' factor.

The system has 2Gb memory installed. While it's not much, I believe I read in the FreeBSD ZFS tuning guide, that 1Gb was sufficient.

I looked up bwlimit, but I'm not sure how I would apply it to ZFS operations like resilvering. I'm going to look for a list of ZFS tunables, and see if there's something that'll make it ease up on the controller.

I forgot to attach the dmesg.boot log on my initial post, so I'm doing it here in case someone wishes to take a glance at it and point out any glaring discrepancies.

Pawtuxet · Jan 17, 2011

jb_fvwm2 said:
"under heavy usage", your first post... using bwlimit, I limit the to-tx4 data rate to "1000" while the sending drive is sending at "10000". I can do four rsyncs at once (4000) and still have the controller send data reliably. No chance of limiting the data sent to/from the disks by that controller to half of what "heavy usage" is?

I reread my initial post and realized I hadn't been clear on defining "heavy usage". What I meant was intensive ZFS operations like scrubbing and resilvering. I haven't noticed any issues during regular use, but of course it could've still happened without anyone noticing, as long as three drives were still running. Data on the zpool is accessed mostly by Apache and Samba. The rsync command is new to me, and reading the manual doesn't suggest that it'd immediately lend itself to the way the server is used. Unless there's some way of applying the limitation it provides, to others processes?

As for limiting the controller; I keep seeing posts reiterating that there's no way to limit SATA controllers to speeds below 150Mb/s, which is still a fair ways above what my drives are willing to exhibit. Perhaps there's a way of throttling ZFS, but so far it eludes me.

danbi · Jan 17, 2011

It still looks like you may have cabling/contact problem somewhere. If you cannot replace cables, at least try to unplug the cable from the drive end and the motherboard end and plug it back again.

These are apparently sata2 (300MB/s) drives. You may want to limit them to SATA1 (150MB/s) to work around any cable/controller issues.

You could also install sysutils/smartmontools from ports and check your drives with smartctl. It is possible that one or more of the drives are experiencing (internal) errors and time out while trying to recover your data.

But.. first check cables

Pawtuxet · Jan 18, 2011

danbi said:
It still looks like you may have cabling/contact problem somewhere. If you cannot replace cables, at least try to unplug the cable from the drive end and the motherboard end and plug it back again.

These are apparently sata2 (300MB/s) drives. You may want to limit them to SATA1 (150MB/s) to work around any cable/controller issues.

You could also install sysutils/smartmontools from ports and check your drives with smartctl. It is possible that one or more of the drives are experiencing (internal) errors and time out while trying to recover your data.

But.. first check cables

Well, yes, I can't rule out bad cables, even if I find it unlikely; it just seems like the symptoms would be more immediate. If it's stalled again by tomorrow, I'll go over the cables. I was very particular when attaching them to the controller when I replaced it a little while back, as I find it very tiresome if I have to take the thing apart more than once per decade.

I installed sysutils/smartmontools and ran a short test on all drives, saving the extended test for later.

# smartctl -a /dev/ad4

Code:

smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.1-RELEASE amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA0503965
Firmware Version: 50.0AB50
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Jan 18 00:32:03 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (36600) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   206   206   021    Pre-fail  Always       -       4700
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3357
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3645
194 Temperature_Celsius     0x0022   117   114   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3347         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'll spare you the output of the remaining drives, as they were pretty much identical, save from the firmware version and slightly improved spin-up time on the two with the newest firmware.

I restarted the scrub process after making a few changes.

Upgraded the motherboard firmware, after realizing it was several versions out of date.
Limited the controller to 150MB/s
Left the case open, in case the controller is overheating.
And it just lost a drive as I wrote this; ad6 this time.

I'll go over the cables tomorrow.

jb_fvwm2 · Jan 18, 2011

Those are not the specific 2T drives that need some sort of modification prior to use in a raid? (bios or smartctl setting or bios setting or...) (IIRC threads exist in several forums RE that issue). Maybe search ( WD20EARS raid offline forum ) for an answer. (Apologies if the search terms do not work right away. Missing or including an extra word or two probably).

Pawtuxet · Jan 18, 2011

jb_fvwm2 said:
Those are not the specific 2T drives that need some sort of modification prior to use in a raid? (bios or smartctl setting or bios setting or...) (IIRC threads exist in several forums RE that issue). Maybe search ( WD20EARS raid offline forum ) for an answer. (Apologies if the search terms do not work right away. Missing or including an extra word or two probably).

They are indeed, but sadly they were purchased and in use before I realized there would be a problem. That said, I have already read many of these threads, and the issues seem to boil down to sector misalignment, "Advanced Format", LLC and TLER.

The first two might be one problem under two names. There were many numbers involved, anyway. This mostly seemed to affect performance, which I can live with. Besides, doing anything about it would require partitioning the disks before use, and I opted to give ZFS full reign of the original disks when I first set it up - I get the impression that I'd have to destroy the pool to make use of the partitioned disks.

LLC is an idle spin-down that's fully handled by the drives themselves to trigger after 8 seconds of non-activity. I haven't noticed these drives sleeping at all, though, and certainly not while resilvering. At any length, WD provides a tool to disable it.

TLER sounds like a possible culprit. Without it, a drive might spend as much as two minutes trying to recover from an error. Plenty of time for most hardware raid controllers to give up on the drive and detach it. However, software raids are supposedly more patient. One poster seemed very adamant that ZFS would have no problem with this. I found a thread about using Smartctl to set the SCTERC properties to 7 seconds, but my drives claim to not recognize it. Others, whose drives did accept the change, claim the disks just ignore the setting. It does sound like a possible contender for what I'm experiencing, except I haven't been able to find any errors on any of the drives, that the system could be choking on.

So the WD20EARS disks wasn't the wisest choice, but using them in a ZFS pool doesn't seem completely out of the question.

I didn't get around to checking the cables tonight, but there's something I'd like to try first anyway. Switch ad4<->ad8 and ad6<->ad10 and see if the problem stays with the drives; but I don't know if that's possible, so there's some research to do before trying.

phoenix · Jan 18, 2011

You can power off the system, swap the drive positions, and boot without issues. So long as all the drives are detected, ZFS will figure out which drive is which/where it is in the pool.

If you have / on UFS, you can also export the pool, then swap the drive, and import the pool.

Pawtuxet · Jan 26, 2011

Slight update: I managed to replace the last drive. While this sudden success may well be a coincidence, it also followed an emerging pattern to how at least two of the other drives were replaced. I don't remember the first drive, but two, three and now four, sat idle in the server for 12-36 hours before resilvering was initiated. Most likely because I'd given up for the day.

I realized this while I was gathering up enough old harddrives to back up the contents of the pool, either to recreate it or shrink it to something that would finish resilvering before drive detachment was likely to occur (this was in preparation to switching the drives around - I didn't want to mess with it too much, while the last drive wasn't fully replaced). So, I rebooted the machine to get all drives back to working order, and left it idle for about 24 hours before issuing the scrub command. 14 hours later, and I get this:

# zpool status

Code:

  pool: tank
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scrub: resilver completed after 13h48m with 0 errors on Sat Jan 22 03:37:34 2011
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad4     ONLINE       0     0     0  44K resilvered
            ad6     ONLINE       0     0     0  154K resilvered
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0  778G resilvered

errors: No known data errors

Yes, it may be absolutely unrelated to my observation, but the 15 or so prior failed attempts to replace ad10 suggest there might be something to it. It's now been running for 4 days during which I've put a lot more stress on the pool than there would usually be, including copying everything to /dev/null. Not only hasn't there been a single hardware-related line in /var/log/messages (which is back to being filled by optimistic brute-force fools getting blocked by sshguard), overall performance seems to have increased, at least as far network access goes, from ~12MB/s to ~30MB/s for transfers through Samba.

I can hardly consider the issue solved, as I still have no idea what caused it, as well as no doubt it'd show up again if I were to replace a disk, reboot and try to resilver it right away, but it seems I can get around the problem with a bit of patience, which will have to do until I can build a bigger server.

Thanks for the help and suggestions, everyone.

MattT · Feb 22, 2011

Successful use of WDC WD20EARS drives for ZFS on FreeBSD

It's true that these "Advanced Format" "Green" drives should be avoided for home server RAID based systems but for those that have purchased them without knowing this, there are ways to make them perform reliably in a FreeBSD ZFS configuration.

1) Use the "wdidle3" DOS utility to change the head parking interval from the default of 8 seconds to as high as 5 minutes:

Code:

DOS> wdidle3 /S300

This is to avoid premature failure of the drive due to a high rate of "Load/Unload Cycle Count" (as seen in SMART reports).

2) Some say these drives are unsuitable for RAID systems due to their inability to support TLER (Time Limted Error Recovery) however, sub.mesa seems to think that this is not an issue with ZFS in the following post: http://www.allquests.com/question/4087478/WD20EARS-Safe-To-Use-in-RAID.html and I have not had a single problem of this nature.

3) Align to 4 KiB boundaries... As these are advanced format drives you'll get best performance if all accesses are 4 KiB aligned. I dealt with this by having my NAS boot from different media and using these 2TB drives for data only with the ZFS pool created directly on the entire disk (without any MBR or GPT partition table) using gnop devices that are set to have a 4 KiB block size.

ZFS remembers that the pool was created in this way and continues to use 4 KiB block sizes for access throughout it's life:

Create the 4KiB sector size gnop devices:

# gnop create -S 4096 /dev/ad6
# gnop create -S 4096 /dev/ad8
# gnop create -S 4096 /dev/ad10
# gnop create -S 4096 /dev/ad12

# zpool create zroot raidz1 /dev/ad6.nop /dev/ad8.nop /dev/ad10.nop /dev/ad12.nop

We can show that the pool that has been created is using 4 KiB blocks with the following command returning "12" instead of "9" (even after reboot when the gnop devices no longer exist):

Code:

# zdb | grep ashift
                ashift=12

Other tips for successful FreeBSD ZFS joy:

1) Use 64 bit FreeBSD with enough RAM (I opted for 8 GB in 2 x 4 GB sticks but I'm also running virtual machines and other Java based systems on my NAS).

2) If using Samba (as I am), build Samba to take advantage of Asynchronous I/O (AIO_SUPPORT) and have that kernel module loaded at boot time (in /boot/loader.conf):

Code:

aio_load="YES"

3) Use the modern SATA support in FreeBSD with

Code:

ahci_load="YES"

in /boot/loader.conf (NOTE: I had no problems enabling this after the initial ZFS system was built even though the ad6, ad8, ad10 & ad12 devices all became ada0 -> ada3.

I typically average over 45 MB/s writing large files to my ZFS NAS from a Windows 7 machine.

ZFS drives going unavailable

Attachments