Really banging my head against the wall with this one... I have a Supermicro X10DRH-iT and it was neglected for some time and was running I think 12.0 or 12.1 without any real issues. I upgraded to 12.4 and my memory is a bit fuzzy at this point (it was months ago), but I think it did boot OK. Some time later I got some SMART alerts that a drive was failing. Since that time, the server has rebooted for no obvious reason (not at console, so no idea if it was a panic or not), but I thought perhaps the bad drive was tickling some kind of undiscovered bug or something. Each time it rebooted on its own, the last thing in the log would be that particular drive timing out, resetting, and then timing out again.
Now when it rebooted, it appeared to hang. I would get this far in the boot process (BTX loader, then a list of BIOS drives):
And if I waited for a few hours, I'd start seeing this (same, but then a "ZFS: can't find vdev details"):
Now if this sits for like 4 hours, it does eventually boot. But 4 hours is a really long time.
So I thought perhaps something in the loader is "tasting" all the disks for zfs stuff and it's getting lost on the bad drive. So I had someone pull the two spinny drives that were getting a bit long in the tooth and were not currently in use anyhow for anything critical. Same result - a hang then if you wait long enough, it boots.
Today I was poking in the BIOS and noticed a boot device I didn't recognize - looked closer and it turns out we'd plugged in a USB-SATA bridge and an old drive ("Drive A" above, seen as a floppy?), likely for some kind of recovery. I thought "ah ha!" this must be what's hanging the loader, even if I boot the drive seems a mess - has a partition on it, but can't mount it, can't fsck, etc. On top of that, it was showing some timeouts/resets.
So we just pulled that drive out and... same thing, hangs forever.
Internally I have two decent, small Intel SSDs with gmirror as my boot drives. Then 4 enterprise Intel SSDs for the main storage (these are zfs, two mirrors), and two WD 6TBs also in a zfs mirror as scratch space and temp backup area.
What could be going on here?
I tried booting a 12.4 DVD and got the same results. I tried booting a 12.1 DVD and no issues AT ALL. Bug? Incompatibility? I updated the BIOS to Supermicro's latest last night, no improvement.
I can order new hardware if needed, but this is the first time I've seen anything like this and I have a TON of supermicro under my belt.
Is there any chance my (now odd) combo of UFS boot drives and ZFS pools on the remaining drives is a problem?
Looking at the mainboard health logs, no issues there aside from one entry about the (now gone) drive that had SMART errors.
Now when it rebooted, it appeared to hang. I would get this far in the boot process (BTX loader, then a list of BIOS drives):
And if I waited for a few hours, I'd start seeing this (same, but then a "ZFS: can't find vdev details"):
Now if this sits for like 4 hours, it does eventually boot. But 4 hours is a really long time.
So I thought perhaps something in the loader is "tasting" all the disks for zfs stuff and it's getting lost on the bad drive. So I had someone pull the two spinny drives that were getting a bit long in the tooth and were not currently in use anyhow for anything critical. Same result - a hang then if you wait long enough, it boots.
Today I was poking in the BIOS and noticed a boot device I didn't recognize - looked closer and it turns out we'd plugged in a USB-SATA bridge and an old drive ("Drive A" above, seen as a floppy?), likely for some kind of recovery. I thought "ah ha!" this must be what's hanging the loader, even if I boot the drive seems a mess - has a partition on it, but can't mount it, can't fsck, etc. On top of that, it was showing some timeouts/resets.
So we just pulled that drive out and... same thing, hangs forever.
Internally I have two decent, small Intel SSDs with gmirror as my boot drives. Then 4 enterprise Intel SSDs for the main storage (these are zfs, two mirrors), and two WD 6TBs also in a zfs mirror as scratch space and temp backup area.
What could be going on here?
I tried booting a 12.4 DVD and got the same results. I tried booting a 12.1 DVD and no issues AT ALL. Bug? Incompatibility? I updated the BIOS to Supermicro's latest last night, no improvement.
I can order new hardware if needed, but this is the first time I've seen anything like this and I have a TON of supermicro under my belt.
Is there any chance my (now odd) combo of UFS boot drives and ZFS pools on the remaining drives is a problem?
Looking at the mainboard health logs, no issues there aside from one entry about the (now gone) drive that had SMART errors.