Other NVMe hot swap working?

I'm about to buy a few more servers and it seems like NVMe is the way things are going, and the pricing is decent enough. The "U.2" standard seems here to stay and there are plenty of servers with NVMe "hot swap" bays for "U.2" format drives.

I've done a fair amount of googling and looking through mailing list archives and I'm not really seeing anything definitive on whether or not FreeBSD supports hot swap on NVMe drives or not. I imagine it's a tricky mess - it's not like pulling a drive off the SATA or SAS bus, you're basically removing a PCIe device without powering the host off.

We'd probably be looking at FreeBSD 12.1 as the OS by the time we have everything in place...

Anyone with practical experience who can weigh in on whether hot swap exists and more importantly, how well it works?
 
NVMe is new, very new. So, with regards to FreeBSD you are asking a question where the answer might be in the future some time.
Questions more suited for today would be things like:
- how good is NVMe support in FreeBSD today?
- how stable is it? (Production, can be tested, or do not put data on it that you can't afford to lose)
- are all NVMe controllers / chips supported?
and so on. No, I do not now the answers to those questions myself, but this is the feeling I get from reading some of FreeBSD's mailing lists on the subject.
 
I came here hoping for those answers. :)

Based on how little there is on the forums and mailing lists, I'm kind of leaning towards avoiding NVMe until one of the big FreeBSD sponsors starts pouring money/hardware into improving support.
 
NVMe is new, very new. So, with regards to FreeBSD you are asking a question where the answer might be in the future some time.
Questions more suited for today would be things like:
  • how good is NVMe support in FreeBSD today?
  • how stable is it? (Production, can be tested, or do not put data on it that you can't afford to lose)
  • are all NVMe controllers / chips supported?
and so on. No, I do not now the answers to those questions myself, but this is the feeling I get from reading some of FreeBSD's mailing lists on the subject.

No, NVMe isn't that new. FreeBSD supports it for more than 5 years. NVMe support in FreeBSD is excellent, it's rock stable, and I have never come across a controller that was not supported.

I've got a 1TB Samsung 970 PRO (not the cheaper EVO variant) in a high-end Asus mainboard (M.2 slot), and it's by far the fastest single disk I have ever used to date. You can create a RAID of several of those babies to make it even faster.

However, I can confirm that hot-swap is not yet supported for NVMe devices. According to a recent message on one of the mailing lists, work is under way to implement it, but it's probably much too early to hold one's breath.
 
However, I can confirm that hot-swap is not yet supported for NVMe devices. According to a recent message on one of the mailing lists, work is under way to implement it, but it's probably much too early to hold one's breath.

Netflix has been using NVMe to serve video for 3 years now in our OCA platform. It's quite solid and able to deliver full PCIe bandwidth if the NVMe drive supports that. I've been running a NUC as a personal desktop machine for a while now too now that graphics is supported on it.

Hot swap hasn't been too important to Netflix, however, so I've not spent much time with it. Parts of it work, parts not so much. A lot depends on the hot plug controller that's being used (which means you need a custom kernel with options PCI_HP at least). The nvme driver has a detach method, so if that's called manually, the hot plug will work (assuming you've cleaned up all references to it up the stack). Surprise unplug likely doesn't work too well. I have no hot-plug gear, so I can't easily test this stuff.

Warner
 
Yeah, I think for server use, not having hot-swap kind of pulls it from consideration. Supermicro (and I assume others) make backplanes that have SAS/SATA/NVMe connections, and the U.2 form factor slips right in like any other drive. The idea that we'd have to power stuff down to swap, or in the case of M.2 drives, de-rack and disassemble to replace is a show-stopper.

Any realistic thoughts on when FreeBSD gets up to speed on hot-swap? I also wonder if perhaps the server market is going the Apple Way - big disposable box with few moving parts other than some fans and making the U.2 format and hot-swap sort of dead on arrival?

(I know, I know, we're just supposed to move everything to the Cloud) :)
 
You don't have to power down. Hot-plug (meaning plugging in) works. It's basically just hot-unplug (detaching) that's not automatic.

You unmount the filesystem (UFS) or offline the drive (ZFS).

You use camcontrol(8) or nvmecontrol(8) to detach the device from the bus manually.

You physically remove the drive.

You physically plug in the new drive, whereby the controller detects it and adds it normally.

You start using the device.
 
I wanted to chime in here. I have some U.2 Samsung PM983 drives for testing and it seems hot swapping does not work.

To take down a drive there is no mechanism in nvmecontrol(8) so I went to the next subsystem and its tool, devctl.
It performs like it takes down the device but there are lingering fragments.
Here is what I used after unmounting:
devctl detach -f nvme0
nvme0 detached

So that seemed to work but when I checked /dev I still see the physical device nvd0.
The namespace is detached as is parent device nvme0.

I try and eject the drive and then reinsert. Nothing comes across the console like direct attached devices would.
Then I try devctl rescan nvme0 and it won't work because nvme0 is not present.
I also try rescaning by bus address since NVMe are PCIe bus devices. No luck.
devctl rescan pci0:2:0:0

There is a mechanism in nvmecontrol to reset but with no NVMe showing that don't work.
Rebooting the machine does reattach.
I feel like you could probably use devctl to also down the physical device too, but the real issue is reattaching.
 
I encountered the same problem yesterday - removing device works but inserting NVME drive does not - nothing happens, device is not detected.
 
To take down a drive there is no mechanism in nvmecontrol(8) so I went to the next subsystem and its tool, devctl.
It performs like it takes down the device but there are lingering fragments.
Here is what I used after unmounting:
devctl detach -f nvme0
nvme0 detached

So that seemed to work but when I checked /dev I still see the physical device nvd0.
The namespace is detached as is parent device nvme0.
The problem (or rather, one of the problems) is that nvd(4) does not support hot swapping. Also, I'm not sure if detaching nvme0 on the PCI level is the right thing to do.

I recommend to use nda(4) instead of nvd(4) (*) because it uses the standard CAM infrastructure and supports hot swapping via camcontrol(8). So, after unmounting, you can detach the disk device. After that you can detach nvme0 from the namespace (/dev/nvme0ns1 for example) with nvmecontrol(8).

Do not try to detach the nvme0 controller. Instead, to attach the new device, reset the controller (nvmecontrol reset). I haven't tried that myself yet, so I'm not sure if it attaches to the new namespace automatically – if not, you can do that with nvmecontrol(8), too. Afterwards you should be able to rescan the CAM bus with camcontrol(8) for the new nda(4) device.

(*) PS: FreeBSD 13-current is in the process of migrating towards nda(4), I think it's already the default on some architectures, and nvd(4) is going to be declared legacy and will probably go away at some point in the future. Therefore it's a good idea to make the switch. I'm using nda(4) with stable/12 for several months without any problems. Being able to use standard tools like camcontrol(8) is a good thing. Of course, smartctl(8) works, too (admittedly I don't remember if that worked with nvd(4), too).
 
  • Thanks
Reactions: dh
Thank you olli!

I replaced nvd(4) driver with nda(4) driver by adding hw.nvme.use_nvd=0 sysctl to /boot/loader.conf and rebooted.
After reboot I had /dev/ndaX devices and I could see them with camcontrol devlist and nvmecontrol devlist.

I tested detatching and attaching from command line without physically removing the device (I used nvme8 device for testing), it seems to work:

First I'll have to get the device address of the nvme device, because I'll need it later when I re-attach the device and I can't use the device name because the device doesn't exist anymore after detaching:

Code:
# pciconf -l | grep nvme8
nvme8@pci0:131:0:0:     class=0x010802 card=0xa801144d chip=0xa808144d rev=0x00 hdr=0x00

Detaching the device and checking the results:

Code:
# devctl detach nvme8
# dmesg
pass10 at nvme8 bus 0 scbus30 target 0 lun 1
pass10: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N102890>
 detached
nda8 at nvme8 bus 0 scbus30 target 0 lun 1
nda8: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N102890>
 detached
(pass10:nvme8:0:0:1): Periph destroyed
(nda8:nvme8:0:0:1): Periph destroyed
nvme8: detached
# camcontrol devlist | grep nda8
<nothing>

Attaching the device using device address:

Code:
# devctl attach pci0:131:0:0
# dmesg
nvme8: using IRQs 681-713 for MSI-X
nvme8: attempting to allocate 33 MSI-X vectors (33 supported)
nvme8: using IRQs 681-713 for MSI-X
nda8 at nvme8 bus 0 scbus30 target 0 lun 1
GEOM: new disk nda8
nda8: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N102890>
nda8: nvme version 1.2 x4 (max x4) lanes PCIe Gen3 (max Gen3) link
nda8: 915715MB (1875385008 512 byte sectors)
pass10 at nvme8 bus 0 scbus30 target 0 lun 1
pass10: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N102890>
pass10: nvme version 1.2 x4 (max x4) lanes PCIe Gen3 (max Gen3) link
# camcontrol devlist | grep nda8
<SAMSUNG MZQLB960HAJR-00007 EDA5302Q>  at scbus30 target 0 lun 1 (pass10,nda8)

I'll try physical removal of device too next week, I'll report the results here too.
 
No luck. Couldn't re-attach the removed nvme device with nda driver either.
After physically removing device dmesg showed:
Code:
[362750] nvme0: Resetting controller due to a ti and possible hot unplug.
[362750] nvme0: resetting controller
[362750] nvme0: failing outstanding i/o
[362750] nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
[362750] nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
[362750] nda0 at nvme0 bus 0 scbus22 target 0 lun 1
[362750] nda0: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N101431>
[362750] detached
[362750] pass2 at nvme0 bus 0 scbus22 target 0 lun 1
[362750] pass2: <SAMSUNG MZQLB960HAJR-00007 EDA5302Q S437NE0N101431>
[362750] detached
[362750] (pass2:nvme0:0:0:1): Periph destroyed
[362750] (nda0:nvme0:0:0:1): Periph destroyed

Executing nvmecontrol devlist hanged for a quite a bit. I did devctl detach <address> and then devctl attach <address> but the device did not re-appear. Unfortunately I didn't save the command outputs, I'll probably try again at some point.
 
I have tried manual hot-swap U.2 NVME working on FreeBSD 12.2-RELEASE.

First, you need a NVME hot-swap available HW.
Plug the U.2 NVME and then boot up to get the PCIe address and pci bus ID.
Code:
# pciconf -lv |grep nvme
nvme0@[B]pci0:199:0:0[/B]: class=0x010802 card=0x2263126f chip=0x91001e95 rev=0x03 hdr=0x00

# sysctl dev.nvme.0.%parent
dev.nvme.0.%parent: pci7

Case 1: Hot Plug-in
Plug-in the U.2 NVME, after system booting done.
There is no pci conf.
Code:
# pciconf -r pci0:199:0:0 0:128
pciconf: ioctl(PCIOCREAD): Operation not supported by device
Rescan pci bus, it will probe the NVME device automatically.
Code:
# devctl rescan pci7
# nvmecontrol devlist
   nvme0: NVMe CL1-8D128
   nvme0ns1 (122104MB)
Case 2: Hot Plug-out
Remove the U.2 NVME directly, and the pcie conf is gone.
Code:
# pciconf -r pci0:199:0:0 0:128
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff ffffffff ffffffff ffffffff
ffffffff
manual detach the NVME device.
# devctl detach nvme0
Case 3: Replug-in
Check the pci conf is available.
Code:
# pciconf -r pci0:199:0:0 0:128
91001e95 00100000 01080203 00000000
00000004 00000000 00000000 00000000
00000000 00000000 00000000 2263126f
00000000 00000040 00000000 000001ff
00035001 00000008 00000000 00000000
01867005 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0002b010 192c8fc0 00102010 0045c843
10430000
manual probe device:
Code:
# devctl reset pci0:199:0:0
# devctl rescan pci0:199:0:0
# devctl set driver -f pci0:199:0:0 nvme
# devctl attach pci0:199:0:0
# nvmecontrol devlist
nvme0: NVMe CL1-8D128
    nvme0ns1 (122104MB)
 
Back
Top