ZFS zpool clear doesn't affect FAULTED disk

yorik.sar · Aug 19, 2016

Hello.

I have a home server running FreeBSD since forever. After I upgraded to FreeBSD 10.1 (by reinstalling everything on mostly new hardware) 1.5 years ago and reshuffled my storage volumes, I got an unrecoverable error in one file (I had storage pool running without redundancy for a while by then). I removed the file (it wasn't valuable for me) and moved to a new raidz1 setup on 3 disks. Half a year later I upgraded to 10.2 and didn't upgrade any further yet.

Recently I had a power glitch (I think) and ZFS began complaining that the drive that used to have that deleted file is FAULTED and the file suddenly became available again in '--head--' snapshot. I've deleted the snapshot, scrubbed all data and now everything is fine except the fact that this drive is in FAULTED state. I know that ZFS is trying to warn me about possible future damage this drive can make to my files, but I'm sure it was my fault that I didn't do everything right during migration to new system back in 2015. So I want to clear FAULTED state from the disk and see my zpool ONLINE, not DEGRADED.

I tried doing zpool clear POOL DRIVE, that did zero READ and WRITE columns in zpool status but the disk is still in FAULTED state.

What can I do to trick ZFS in believing that this disk is OK?

Relevant zpool status:

Code:

  pool: stuff
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0 in 1h27m with 0 errors on Fri Aug 19 13:44:22 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        stuff                                           DEGRADED     0     0     0
          raidz1-0                                      DEGRADED     0     0     0
            gptid/54e55c16-5275-11e5-bf1a-10c37b9dc3be  ONLINE       0     0     0
            ada2p3                                      FAULTED      0     0     0  too many errors
            ada1p8                                      ONLINE       0     0     0
        logs
          gptid/92809934-5276-11e5-bf1a-10c37b9dc3be    ONLINE       0     0     0
          ada2p2                                        FAULTED      0     0     0  too many errors
          ada1p7                                        ONLINE       0     0     0

tetragir · Aug 20, 2016

I would remove the disk from the pool, format it and then add it back.

SirDice · Aug 22, 2016

Check with sysutils/smartmontools to see if the disk isn't really broken. If SMART says there are no errors you can try replacing the disk with itself: zpool replace raidz1-0 /dev/ada2p3 /dev/ada2p3.

usdmatt · Aug 22, 2016

Unrelated to your issue but there's no point using partitions on the pool disks as log devices, it's just increasing the risk of causing more problems.

yorik.sar · Aug 22, 2016

tetragir said:
I would remove the disk from the pool, format it and then add it back.

Yeah, that's the last measure. But it should've cleared FAULTED state from the disk, that's why I'm asking if I'm missing something.

yorik.sar · Aug 22, 2016

SirDice said:
Check with sysutils/smartmontools to see if the disk isn't really broken.

Ok, smartctl shows that it can't collect data:

Code:

% sudo smartctl -a /dev/ada2
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.2-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Does ZFS rely on it though?

yorik.sar · Aug 22, 2016

usdmatt said:
Unrelated to your issue but there's no point using partitions on the pool disks as log devices, it's just increasing the risk of causing more problems.

What problems can it cause? I just saw this in some guide somewhere that suggested to use separate partition for logs. I'm using different partitions for different pools anyway, so GPT overhead is already there.

yorik.sar · Aug 22, 2016

SirDice said:
you can try replacing the disk with itself: zpool replace raidz1-0 /dev/ada2p3 /dev/ada2p3.

I tried that, it can't do that unless I explicitly detach disk from the pool, I guess:

Code:

% sudo zpool replace -f stuff ada2p3 ada2p3
cannot replace ada2p3 with ada2p3: one or more devices is currently unavailable

It seems I'll have to live with DEGRADED pool: even though drive is definitely goes over the Styx, I can use it until ZFS tells me it can't. Once data errors become unbearable I'll go shopping for a new drive.

phoenix · Aug 22, 2016

Your drive is dead, your pool is now running in a degraded, non-redundant mode, and the longer you persist in running like this, the greater the likelihood you'll lose everything.

Remove the dead disk from the system. Replace it with another disk, and get the pool resilvered ASAP.

To do anything else is madness!

yorik.sar · Aug 22, 2016

phoenix said:
Your drive is dead, your pool is now running in a degraded, non-redundant mode, and the longer you persist in running like this, the greater the likelihood you'll lose everything.

Remove the dead disk from the system. Replace it with another disk, and get the pool resilvered ASAP.

To do anything else is madness!

When I did zpool scrub, it did work and even cleared some errors from zpool status output. But yeah, I see madness in the mirror now. Thanks!

usdmatt · Aug 23, 2016

Pool disks are used for the intent log anyway. There's no benefit to adding separate partitions on the same disks as log devices. By the look of it there's no redundancy in the log devices, so a crash/panic/etc could leave you having to forcefully import/rewind the pool if it thinks one of those log devices has failed.

The only reason to add log devices is if the devices you are going to use have much higher write throughput/iops than the pool.

SirDice · Aug 23, 2016

yorik.sar said:

Ok, smartctl shows that it can't collect data:

Code:

% sudo smartctl -a /dev/ada2
smartctl 6.4 2015-06-04 r4109 [FreeBSD 10.2-RELEASE amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

Read Device Identity failed: Input/output error

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

Does ZFS rely on it though?

No, but ZFS does rely on a functional drive. And judging by the refusal of the drive to provide even the most basic SMART data I'd say this drive is anything but functional. It's broken, which is why you can't clear the error or do anything else with the drive. Time to replace it.

I'd replace it a.s.a.p. though. RAID-Z will protect from data loss with one broken disk but if a second drive breaks, the whole pool (including its data) will be gone.

yorik.sar · Aug 26, 2016

Thanks everybody for help, I've replaced faulty drive with a new one. At last I can enjoy having equal drives in the home server and have one storage ZFS pool across all of them instead of recycling leftovers in a separate pool.

ZFS zpool clear doesn't affect FAULTED disk

yorik.sar

tetragir

SirDice

Administrator

usdmatt

yorik.sar

yorik.sar

yorik.sar

yorik.sar

phoenix

yorik.sar

usdmatt

SirDice

Administrator

yorik.sar