Other Disklabels usable for SmartCtl

Is it possible to label disks AND use those labels in smartctl?

Long story
I use nagios to monitor drive temps on my FreeBSD ZFS file server.

Yesterday, an SSD failed and all comms to it are timing out. I power cycled the server to check for a transient problem. The SSD did not come back and freebsd treated the SSD as not connected. As expected the ada devices shuffled numeric labels.

The problem is smartctl is now reporting temps for different disks. Worse, the last disk in the ada list doesn't exist anymore and this causes an nagios alarm. (I have ada0 to ada3. Disk ada2 failed. Therefore ada3 moved to ada2 (after reboot). Now ada3 doesn't exist and smartctl reports an error.)

Of course, the ZFS file system is fine because it uses labels.

TL;DR
How can I label ZFS formated disks so that smartctl can use device labels that don't change??

(Any help would appreciated.)
 
I spent a few hours on this a year ago, when first installing 11.1. My answer was: It can't be done. Smartctl (or smartd) is not smart enough (<- obvious pun) to know that /dev/ada2p1 is a partition on device /dev/ada2, and that /dev/gpt/my_home_fs is an alias for /dev/ada2p1.

For smartctl (which is run from the command line, for one disk at a time), I came up with two potential solutions, but implemented neither. One is to grab the source code for the smart package, and start editing it to give it that capability. Too much work. The second one is to write a small shell script, which does the translation from /dev/gpt/my_home_fs to /dev/ada2p1 to /dev/ada2, and then runs smartctl on the result of that shell script. Unfortunately, that's not a trivial job, because every partition or slice and every named GPT partition is a character device of its own, and it would require some inspecting of kernel data structures to figure out that character devices 0/195, 0/171 and 0/198 all share the same underlying device.

For smartd, it's even worse: It has a configuration file, which has the disk names explicitly listed as /dev/ada2 and so on. So a similar script would have to rewrite the config file /usr/local/etc/smartd.conf after every boot, from a template. Solving this requires solving the above script (and device translation) problem, plus doing file editing from inside a script. Yuck yuck yuck, way too much work. So I gave up on it; since disk administration is so rarely necessary, I'll just do things by hand.

Exact same complaint for GPT IDs, by the way.

By the way, there is another place that would really benefit from this: Many functions of camcontrol apply to the whole device, and camcontrol should usually auto-translate GPT labels or IDs to the underlying devices.

TL;DR: To my knowledge, it is an unsolved problem.
 
Not unsolved. Use the smartd(5)’s -A /var/log/smartd/ option to save a log file. They will be named by MODEL-SERIAL.(ata|scsi).csv. You’ll have to scrape the log (awk works great) the parameters of interest, but it will always be for the same physical device. (And you get a nice log you can go back to if you ever want to track down any other trends.)
 
Use the smartd(5)’s -A /var/log/smartd/ option to save a log file.
Very cool! That solves half the problem, getting log output that's split by real disk identity. Just got put in my to-do list to implement this weekend.

Still doesn't solve the other half of the problem: My smartd.conf file has disk-specific settings; instead of saying "for /dev/ada3, use these parameters", I would love to give it a symbolic name.

For now, I just live with it; sometimes, smartd outputs weird stuff and error messages, and that tells me that some of the disks must have been missing at boot time, and everything got renumbered.

Speaking of disk IDs: Today, every disk that is made has a WWN, or world-wide unique name. One would think that this would make it easy for operating systems to identify disks. Unfortunately, it is not perfect enough to rely on exclusively. Older disks (parallel SCSI for example) have no unique ID or WWN. A disk can be so broken that you know it's there, but you can't get the inquiry or identify operation to work to get its ID. I've heard stories of two disks having the same ID; the only time I've personally seen that happen is in a prototype lab, where dozens of disks all had serial number = ID = zero. But I have seen two ethernet cards that had exactly the same Mac address, including the sticker (and these weren't even prototypes or directly from the manufacturer, they were store-bought).
 
Back
Top