Solved disk lost when going multiuser (diskid problem?)

The machine did work correctly with two disks, until I added a third one. The new disk appears as ada0, and the old ones have changed names (from ada0+1 to ada1+2; proper adjustments in /etc/fstab etc. have been made).

The new ada0 disk shall not be used, it is added for a burn-in test, and it must be on a lower sata port because it can do sata600.

But now the following problem appears - when booting singleuser, everything is fine and as expected:
Code:
root@# ls /dev/ada*
/dev/ada0       /dev/ada1s1a    /dev/ada1s1e    /dev/ada1s2     /dev/ada2s1a    /dev/ada2s1e    /dev/ada2s2
/dev/ada1       /dev/ada1s1b    /dev/ada1s1f    /dev/ada2       /dev/ada2s1b    /dev/ada2s1f    /dev/ada2s2a
/dev/ada1s1     /dev/ada1s1d    /dev/ada1s1g    /dev/ada2s1     /dev/ada2s1d    /dev/ada2s1g
root@# gvinum ld
2 drives:
D a10                   State: up       /dev/ada1s1f    A: 38068/59592 MB (63%)
D a11                   State: up       /dev/ada2s1f    A: 38068/59592 MB (63%)

But after going multiuser, ada1 is lost:
Code:
root@# ls /dev/ada*
/dev/ada0       /dev/ada2       /dev/ada2s1a    /dev/ada2s1d    /dev/ada2s1f    /dev/ada2s2
/dev/ada1       /dev/ada2s1     /dev/ada2s1b    /dev/ada2s1e    /dev/ada2s1g
root@# gvinum ld
2 drives:
D a11                   State: up       /dev/ada2s1f    A: 38068/59592 MB (63%)
D a10                   State: down     /dev/???        A: 0/0 MB (0%)

I did not find a way yet to get that disk back in multiuser mode, so gvinum currently runs on broken mirrors.
The ada1 disk works and is accessible with dd.
The trouble happens per /etc/rc.d/zvol, or any zfs command. During the first zfs command, a bunch of errors appear from "g_access", error code 6 (supposed ENXIO).
ZFS itself finds its stuff, but uses now different paths:
Code:
        NAME                                STATE     READ WRITE CKSUM
        build                               ONLINE       0     0     0
          mirror-0                          ONLINE       0     0     0
            diskid/DISK-WD-WCASY7821919s1g  ONLINE       0     0     0
            ada2s1g                         ONLINE       0     0     0

Investigating further I found weird things in the output from "gpart show".
In singleuser mode it shows:
Code:
root@# gpart show
=>       63  976773105  ada1  MBR  (466G)
         63  242769933     1  freebsd  (116G)
  242769996  734003172     2  !191  (350G)

=>        0  242769933  ada1s1  BSD  (116G)
          0         16          - free -  (8.0K)
         16    1200000       1  freebsd-ufs  (586M)
    1200016    2000000       4  freebsd-ufs  (977M)
    3200016     200000       5  freebsd-ufs  (98M)
    3400016  122045271       6  freebsd-vinum  (58G)
  125445287   10485760       2  freebsd-swap  (5.0G)
  135931047  106838886       7  !10  (51G)

=>       63  976773105  diskid/DISK-WD-WCASY7821919  MBR  (466G)
         63  242769933                            1  freebsd  (116G)
  242769996  734003172                            2  !191  (350G)

=>       63  976773105  ada2  MBR  (466G)
         63  242769933     1  freebsd  [active]  (116G)
  242769996  734003172     2  !191  (350G)

=>        0  242769933  diskid/DISK-WD-WCASY7821919s1  BSD  (116G)
          0         16                                 - free -  (8.0K)
         16    1200000                              1  freebsd-ufs  (586M)
    1200016    2000000                              4  freebsd-ufs  (977M)
    3200016     200000                              5  freebsd-ufs  (98M)
    3400016  122045271                              6  freebsd-vinum  (58G)
  125445287   10485760                              2  freebsd-swap  (5.0G)
  135931047  106838886                              7  !10  (51G)

=>        0  242769933  ada2s1  BSD  (116G)
          0         16          - free -  (8.0K)
         16    1200000       1  freebsd-ufs  (586M)
    1200016    2000000       4  freebsd-ufs  (977M)
    3200016     200000       5  freebsd-ufs  (98M)
    3400016  122045271       6  freebsd-vinum  (58G)
  125445287   10485760       2  freebsd-swap  (5.0G)
  135931047  106838886       7  !10  (51G)

=>        0  734003172  ada2s2  BSD  (350G)
          0         16          - free -  (8.0K)
         16  734003156       1  !0  (350G)

After going multiuser (or after any zfs command) this changes to:
Code:
root@# gpart show
=>       63  976773105  diskid/DISK-WD-WCASY7821919  MBR  (466G)
         63  242769933                            1  freebsd  (116G)
  242769996  734003172                            2  !191  (350G)

=>       63  976773105  ada2  MBR  (466G)
         63  242769933     1  freebsd  [active]  (116G)
  242769996  734003172     2  !191  (350G)

=>        0  242769933  diskid/DISK-WD-WCASY7821919s1  BSD  (116G)
          0         16                                 - free -  (8.0K)
         16    1200000                              1  freebsd-ufs  (586M)
    1200016    2000000                              4  freebsd-ufs  (977M)
    3200016     200000                              5  freebsd-ufs  (98M)
    3400016  122045271                              6  freebsd-vinum  (58G)
  125445287   10485760                              2  freebsd-swap  (5.0G)
  135931047  106838886                              7  !10  (51G)

=>        0  242769933  ada2s1  BSD  (116G)
          0         16          - free -  (8.0K)
         16    1200000       1  freebsd-ufs  (586M)
    1200016    2000000       4  freebsd-ufs  (977M)
    3200016     200000       5  freebsd-ufs  (98M)
    3400016  122045271       6  freebsd-vinum  (58G)
  125445287   10485760       2  freebsd-swap  (5.0G)
  135931047  106838886       7  !10  (51G)

It seems, disks can appear by name, or by diskid, or by both. But I do not currently understand the meaning of each.

Probably the most effective approach would be to zero out the whole disks and rebuild the partitioning scheme from scratch. But I do not like that approach; I would prefer to understand what is wrong, and to fix precisely the offending bytes.
The system is currently running Rel. 11.1, but was originally installed with Rel. 2.1, and piece-wise upgraded ever since.

Is there some kind of paper/documentation that might be helpful in understanding the secrets of the diskid scheme and how it is supposed to work?
 
Workaround:
adding kern.geom.label.disk_ident.enable=0 to /boot/loader.conf makes the diskid stuff disappear completely, and all signs of the problem disappear as well. Currently I do not see unwanted side-effects.

Forensics:
It seems that nowadays disks get a diskid identifier magically when treated with "fdisk -i" the first time. This was not yet the case when adding ada2, but was the case when adding ada1.

Conclusion:
Under certain circumstances this fabric seems to behave in a strange way. I would consider such behaviour as incorrect, but further investigation may be necessary.
It is still unclear to me what might be the most correct ("best practice") way to handle this. Suggestions are welcome.
 
The "best" way is to label the filesystem and use those labels in /etc/fstab. Then the actual order of the drives will become irrelevant.

Sorry, but this is something different!

The problem here is with disk ID labels, not with filesystem labels. And the problem (disappearing drive) happens before /etc/fstab is used; and when it comes to mounting, the drive is already gone.

These disk ID labels are mentioned in "man glabel", but there is not much clue:

Generic disk ID strings are exported as labels in the format
/dev/diskid/GEOM_CLASS-ident e.g. /dev/diskid/DISK-6QG3Z026.



Further findings:
  1. The disk ID label seems not to be present on the drive itself. It seems to be constructed from the serial number reported by the drive.
  2. This behaviour can be switched off per kern.geom.label.disk_ident.enable=0, but it seems to be dependent on other circumstances as well: another system here does not show disk ID labels at all.
  3. The automated scan for partitions (which is performed by zfs on first invocation) does not perform well with this feature: I could provoke any scheme of disappeared disks (one, the other, both), including failures from disappeared active filesystems.
It seems that the device names tend to be handled in an exclusive fashion: either /dev/ada* or /dev/diskid/DISK-*. So, if if the disk ID labels are not disabled per tuneable, and if the disk is not currently open for writing via /dev/ada*, the system might remove the classical device nodes.
 
Back
Top