ZFS zfs drive allegedly failing, but which one?

MMacD · Aug 20, 2019

I get information from zfs that says the first drive in my smaller 3-way mirror has too many r/w erriors, and it offers me a diskid to identify it. But walking through the various drives using sysutils/smartmontools, and gpart, I can't determine which it is. Smart says they're okay, gpart claims several drives have no geom (I presume that's a side effect of zfs) .

Is there a way to identify the unhappy drive without repeatedly bringing the server down to disconnect all the drives one after another?

(Both mirrors are controlled by the same LSI 9207-8i, so I'm also going to swap the fanout cables just in case the real problem is a defective cable)

moridin · Aug 20, 2019

Pasting the actual output from zfs would help to help you.

Jeckt · Aug 20, 2019

Assuming the hard drives have led lights for drive activity, if you offline the drive with problems, it should be the one that isn't blinking when you read/write to the mirror.

olli@ · Aug 20, 2019

diskinfo -v and diskinfo -s display the disk ID. See the diskinfo(8) manual page for details.

Rich (BB code):

$ diskinfo -v /dev/ada? | egrep '/dev|ident'
/dev/ada0
        S0NFJ1DPB04336  # Disk ident.
/dev/ada1
        S0NFJ1DPB04338  # Disk ident.

By the way, if ZFS claims there are many R/W errors, then there should be some indication in the SMART values of that drive, too. Otherwise I would replace the drives (I mean all of them, if they're the same model) because the firmware seems to be unreliable.

ralphbsz · Aug 21, 2019

MMacD said:
I get information from zfs that says the first drive in my smaller 3-way mirror has too many r/w erriors, and it offers me a diskid to identify it.

That diskid is probably a hex or decimal number with about two dozen digits, right? Seen it before.

Sadly, the diskid from ZFS has little to do with the serial number that diskinfo report (and which can also be found in dmesg, /var/log/messages, and the output of smartctl).

Sadly, the only way I can think of is to do what Jeckt said: Look for the disk that is idle. You can do that with indicator lights, and you can do that with the iostat tool. If you use iostat, you will quickly find out the name of the disk (like /dev/ada3), which you can then use to identify model and serial number with diskinfo or smartctl.

Smart says they're okay,

That's unfortunately common. Smart does not catch all failures. Nor are all failures that smart catches real. The good news is that smart failures are correlated with real failures, so one should listen to it, even if it isn't perfect.

In this particular case, it would be good if you posted the actual messages from ZFS. It is possible that your problem is all checksum or CRC errors on the disk. Which might mean that the disk is actually perfectly OK in hardware, but someone has been scribbling on it, which causes ZFS to detect data corruption errors (duh, obviously), which it recovers from redundant copies.

gpart claims several drives have no geom (I presume that's a side effect of zfs) .

No, it means that somone set it up without geom. Which is a bad practice, although it works fine. The advantage of using gpart and having a perfectly normal geometry on the drive is that it is easier to administer. For example, I name all my ZFS partitions with human-readable names, like "HD16_home" means the Hitachi Data drive, which I bought in 2016, and then the partition that is used in the /home zpool or file system. If I knew that ZFS finds errors on that disk, I would immediately know which drive to physically pull.

Is there a way to identify the unhappy drive without repeatedly bringing the server down to disconnect all the drives one after another?

That is even more brutal than looking at LEDs, but if all else fails, you may end up having to do that. Painful!

Here is another idea. Can you identify the drive by exclusion? For example, if you have four drives, and you can identify the other three (for example in the output of "zpool status" you see /dev/ada0, ada1 and ada3), then you pretty much know that the bad one must have been ada2.

(Both mirrors are controlled by the same LSI 9207-8i, so Ï'm also going to swap the fanout cables just in case the real problem is a defective cable)

Cabling and contact problems are the #2 source of storage systems problems, so checking and replacing them is a good idea. Unfortunately, the #1 problem is humans. Which leads to the old joke: The correct way to administer a computer is to hire a man and a dog. The man is to feed the dog. The dog is to bite them man when he tries to touch the computer.

Good luck! Post more logs or information, and maybe we can help more.

olli@ · Aug 21, 2019

ralphbsz said:
Cabling and contact problems are the #2 source of storage systems problems, so checking and replacing them is a good idea.

Absolutely!
I once had intermittent problems with one SATA drive (CRC errors). It only occurred when the machine had a certain load, so I suspected a bug in the OS at first, i.e. the driver having a timing problem, a race condition or similar. Also, the problem was alleviated (but not completely gone) when I reduced the SATA speed. On the other hand, the other SATA drives in the same machine did not have any problems at all; it was only that one drive that had an issue.

Upon closer inspection it turned out that the SATA cable of the first drive was tied to the cable of the CPU fan. That obviously caused some kind of interference when the fan had a certain rpm speed (dependent on processor load). I separated the cables, and the problem was gone.

Reaperzx · Aug 21, 2019

I had failing power supply causing temporary HDD errors. After upgrading to 750W power supply problem solved.

MMacD · Aug 21, 2019

Unfortunately, the drives (WD SATA) don't seem to have connectors for telltale lights. I wish they did, because they were a very good 0th-cut diagnostic tool back in the day.

I did also consider interference from other sources, but reckoned that since both pools are hanging off the same controller, I should see errors in both if it were an outside-interference problem

But here's my zpool status:

Code:

      mirror-0  ONLINE       0     0     0
        da3     ONLINE       0     0     0
        da4     ONLINE       0     0     0
        da5     ONLINE       0     0     0

errors: No known data errors

  pool: files
state: DEGRADED
status: One or more devices has been removed by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: resilvered 114K in 0h0m with 0 errors on Tue Aug 20 13:44:01 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    files                                           DEGRADED     0     0     0
      mirror-0                                      DEGRADED     0     0     0
        17164376090346451623                        REMOVED      0     0     0  was /dev/diskid/DISK-WD-WMC6M0H25LMRp2
        gptid/5cb49dec-ec53-11e5-a523-0cc47a796c56  ONLINE       0     0     0
        gptid/65242035-ec53-11e5-a523-0cc47a796c56  ONLINE       0     0     0

I thought I followed SirD's suggestion and labeled them when setting up the pools, but either zfs ignores the labels when reporting status, or my memory is faulty.

And here's the output from diskinfo -v. Unless my eyes are skipping over it, I can't see "WMC6"

Code:

9:39 Wed, 21 Aug                                                                                                                      [momcat:root]~> diskinfo -v /dev/da*
/dev/da0
    512             # sectorsize
    1000204886016    # mediasize in bytes (932G)
    1953525168      # mediasize in sectors
    4096            # stripesize
    0               # stripeoffset
    121601          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCC3F5142221    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da0s1
    512             # sectorsize
    1000204853760    # mediasize in bytes (932G)
    1953525105      # mediasize in sectors
    4096            # stripesize
    3584            # stripeoffset
    121601          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCC3F5142221    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da0s1b
    512             # sectorsize
    25769803776     # mediasize in bytes (24G)
    50331648        # mediasize in sectors
    4096            # stripesize
    3584            # stripeoffset
    3133            # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCC3F5142221    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da0s1d
    512             # sectorsize
    974435049984    # mediasize in bytes (908G)
    1903193457      # mediasize in sectors
    4096            # stripesize
    3584            # stripeoffset
    118468          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCC3F5142221    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da1
    512             # sectorsize
    1000204886016    # mediasize in bytes (932G)
    1953525168      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    121601          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW3M5AX61S    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da1p1
    512             # sectorsize
    17179869184     # mediasize in bytes (16G)
    33554432        # mediasize in sectors
    0               # stripesize
    17408           # stripeoffset
    2088            # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW3M5AX61S    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da1p2
    512             # sectorsize
    966367641600    # mediasize in bytes (900G)
    1887436800      # mediasize in sectors
    0               # stripesize
    17408           # stripeoffset
    117487          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW3M5AX61S    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da2
    512             # sectorsize
    1000204886016    # mediasize in bytes (932G)
    1953525168      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    121601          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW34RYCH31    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da2p1
    512             # sectorsize
    17179869184     # mediasize in bytes (16G)
    33554432        # mediasize in sectors
    0               # stripesize
    17408           # stripeoffset
    2088            # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW34RYCH31    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da2p2
    512             # sectorsize
    966367641600    # mediasize in bytes (900G)
    1887436800      # mediasize in sectors
    0               # stripesize
    17408           # stripeoffset
    117487          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCAW34RYCH31    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da3
    512             # sectorsize
    4000787030016    # mediasize in bytes (3.6T)
    7814037168      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    486401          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    K4HE26JB                # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da4
    512             # sectorsize
    4000787030016    # mediasize in bytes (3.6T)
    7814037168      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    486401          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    K4GLKN8B                # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da5
    512             # sectorsize
    4000787030016    # mediasize in bytes (3.6T)
    7814037168      # mediasize in sectors
    0               # stripesize
    0               # stripeoffset
    486401          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WMC130F5SLXM    # Disk ident.
    Not_Zoned       # Zone Mode

/dev/da7
    512             # sectorsize
    2000398934016    # mediasize in bytes (1.8T)
    3907029168      # mediasize in sectors
    4096            # stripesize
    0               # stripeoffset
    243201          # Cylinders according to firmware.
    255             # Heads according to firmware.
    63              # Sectors according to firmware.
    WD-WCC4M5YX0R26    # Disk ident.
    Not_Zoned       # Zone Mode

olli@ · Aug 21, 2019

MMacD said:
And here's the output from diskinfo -v. Unless my eyes are skipping over it, I can't see "WMC6"

Interesting. Have you tried grep -i WMC6 /var/run/dmesg.boot?
Another thing to try is camcontrol identify /dev/da0 | grep serial (repeat for all disks).

olli@ · Aug 21, 2019

PS: Maybe the disk was detached from the driver? In that case, the device node in /dev is gone, of course. I notice there's a gap in your diskinfo output: /dev/da5 is followed by /dev/da7, but /dev/da6 is missing. So, maybe /dev/da6 is the faulty disk?

In that case, you should see a detach message in the kernel output (use dmesg(8) or look at /var/log/messages). Also, the disk should still be visible in the kernel output from the last boot when it was attached, i.e. grep -i WMC6 /var/run/dmesg.boot should reveal it.

MMacD · Aug 21, 2019

It must indeed have been da6, because 6 doesn't now show up in ls /dev/da* (or in dmesg.boot).

ralphbsz · Aug 21, 2019

Well, if da6 is gone, you have a problem. It's really hard to find a disk that isn't there any longer. You may have to resort to physically identifying all the other disks. Crazy suggestion: Write down the model and serial number of all surviving disks on a piece of paper. Take the server apart, and inspect all disks. The serial numbers are printed on a paper label on the disk. Find the one disk that is missing from the piece of paper.

At least your server is still booting. About 5 years ago, I had a disk fail so hard, the server would neither run nor boot with the disk plugged in. That was diagnosed by disconnecting all disks, plugging them in one at a time (resetting every time), and doing combinations and permutations until I found the offending (offensive?) disk.