I get information from
zfs that says the first drive in my smaller 3-way mirror has too many r/w erriors, and it offers me a diskid to identify it.
That diskid is probably a hex or decimal number with about two dozen digits, right? Seen it before.
Sadly, the diskid from ZFS has little to do with the serial number that diskinfo report (and which can also be found in dmesg, /var/log/messages, and the output of smartctl).
Sadly, the only way I can think of is to do what Jeckt said: Look for the disk that is idle. You can do that with indicator lights, and you can do that with the
iostat
tool. If you use
iostat
, you will quickly find out the name of the disk (like /dev/ada3), which you can then use to identify model and serial number with diskinfo or smartctl.
That's unfortunately common. Smart does not catch all failures. Nor are all failures that smart catches real. The good news is that smart failures are correlated with real failures, so one should listen to it, even if it isn't perfect.
In this particular case, it would be good if you posted the actual messages from ZFS. It is possible that your problem is all checksum or CRC errors on the disk. Which might mean that the disk is actually perfectly OK in hardware, but someone has been scribbling on it, which causes ZFS to detect data corruption errors (duh, obviously), which it recovers from redundant copies.
gpart
claims several drives have no geom (I presume that's a side effect of
zfs) .
No, it means that somone set it up without geom. Which is a bad practice, although it works fine. The advantage of using gpart and having a perfectly normal geometry on the drive is that it is easier to administer. For example, I name all my ZFS partitions with human-readable names, like "HD16_home" means the Hitachi Data drive, which I bought in 2016, and then the partition that is used in the /home zpool or file system. If I knew that ZFS finds errors on that disk, I would immediately know which drive to physically pull.
Is there a way to identify the unhappy drive without repeatedly bringing the server down to disconnect all the drives one after another?
That is even more brutal than looking at LEDs, but if all else fails, you may end up having to do that. Painful!
Here is another idea. Can you identify the drive by exclusion? For example, if you have four drives, and you can identify the other three (for example in the output of "zpool status" you see /dev/ada0, ada1 and ada3), then you pretty much know that the bad one must have been ada2.
(Both mirrors are controlled by the same LSI 9207-8i, so Ï'm also going to swap the fanout cables just in case the real problem is a defective cable)
Cabling and contact problems are the #2 source of storage systems problems, so checking and replacing them is a good idea. Unfortunately, the #1 problem is humans. Which leads to the old joke: The correct way to administer a computer is to hire a man and a dog. The man is to feed the dog. The dog is to bite them man when he tries to touch the computer.
Good luck! Post more logs or information, and maybe we can help more.