ZFS Switched NVMe driver from nvd to nda and now ZFS has a non-native block size error

snakedoctr · Jan 20, 2023

Hi,
I was checking to see if switching to the nda driver (by using hw.nvme.use_nvd=0 in loader.conf) would increase performance. After rebooting, ZFS is now throwing this message for the pool:

Code:

  pool: zroot
 state: ONLINE
status: One or more devices are configured to use a non-native block size.
    Expect reduced performance.
action: Replace affected devices with devices that support the
    configured block size, or migrate data to a properly configured
    pool.
  scan: scrub repaired 0B in 00:04:56 with 0 errors on Sun Jan 15 00:19:56 2023
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nda0p4  ONLINE       0     0     0  block size: 4096B configured, 16384B native
        nda1p4  ONLINE       0     0     0  block size: 4096B configured, 16384B native

errors: No known data errors

I used 4k sectors ( ashift: 12) when creating the pool, and wasn't aware of there being 16k sectors, so why are the drives reporting 16k sectors, and how do I fix this? Both drives are Samsung SSD 980 1TB.

Thanks!

skeletor · Jan 20, 2023

Only recreate pool with ashift=14 or come back to hw.nvme.use_nvd=1. If not, you get a bad performance.

snakedoctr · Jan 20, 2023

Why would using nda cause this issue, but using nvd does not?

skeletor · Jan 20, 2023

I don't know. I just telling you, that you can't change block size without recreate (destroy/create) zfs pool.

snakedoctr · Jan 20, 2023

Yea, that looks like the only option if I want to stick with nda, but I'm just not sure why switching from nvd to nda would have caused this. I haven't yet switched back to nvd to see if the error goes away, but it wasn't happening while I was using nvd, so I'm guessing it will go away. Performance with nda seems fine now, even better than nvd, so I'm not sure if this is just a bug in what ZFS is seeing, or if it's really affecting performance.

skeletor · Jan 20, 2023

If I understand correctly from the man pages, NDA (direct access device driver) which implements direct disk access, unlike NVD which implements the API.
So, to my mind, NDA better, then NVD.

snakedoctr · Jan 20, 2023

That's my understanding as well -- better performance with nda. But if nda is seeing a 4k / 16k block size mismatch, and indicates there's going to be a performance penalty, then I'm not sure what that performance penalty is.

Just trying to figure out if this is actually a bug since it wasn't happening when I was using nvd, or why nda is seeing this as an issue as the Samsung 980 NVMe drives are 4k sector drives.

snakedoctr · Jan 30, 2023

I switched back to the nda driver and the error/warning is gone -- zpool status is back to normal.

Anybody else run into this?

PMc · Jan 30, 2023

ZFS talkss to Geom. Geom talks to the hardware driver.
What has Geom to say about the matter?

sko · Jan 30, 2023

skeletor said:
I don't know. I just telling you, that you can't change block size without recreate (destroy/create) zfs pool.

but he could reformat the namespace if the firmware supports different block sizes.

check smartctl -a /dev/nvmeN for this section:

Code:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

If it supports 4k blocksize, try to detach ONE of the drives from the pool, backup your gpt table and use nvmecontrol(8) format to reformat the namespace. Then restore your gpt-table (and maybe reinstall efi code) and reattach the partition to the pool. Rinse, repeat with the second drive after resilver.

snakedoctr · Jan 30, 2023

PMc said:
ZFS talkss to Geom. Geom talks to the hardware driver.
What has Geom to say about the matter?

With nvd:

Code:

1. Name: nvd0
   Mediasize: 1000204886016 (932G)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0

With nda:

Code:

1. Name: nda0
   Mediasize: 1000204886016 (932G)
   Sectorsize: 512
   Stripesize: 16384
   Stripeoffset: 0

So for whatever reason, using nda shows the drives as a 16k stripesize, but nvd shows it (correctly) as 4k.

snakedoctr · Jan 30, 2023

sko said:
but he could reformat the namespace if the firmware supports different block sizes.

check smartctl -a /dev/nvmeN for this section:

Code:

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1

If it supports 4k blocksize, try to detach ONE of the drives from the pool, backup your gpt table and use nvmecontrol(8) format to reformat the namespace. Then restore your gpt-table (and maybe reinstall efi code) and reattach the partition to the pool. Rinse, repeat with the second drive after resilver.

Looks like these drives only support 512:

Code:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

PMc · Jan 30, 2023

hp550c said:
With nda:

Code:

1. Name: nda0 Mediasize: 1000204886016 (932G) Sectorsize: 512 Stripesize: 16384 Stripeoffset: 0

So for whatever reason, using nda shows the drives as a 16k stripesize, but nvd shows it (correctly) as 4k.

That's interesting. I don't see Stripesize on (any) native disks:

Code:

Consumers:
1. Name: nda0
   Mediasize: 256060514304 (238G)
   Sectorsize: 4096

I see such stances only on ZFS-provided images:

Code:

1. Name: zvol/build/base.pole
   Mediasize: 45097156608 (42G)
   Sectorsize: 512
   Stripesize: 32768
   Stripeoffset: 0

Now vaguely assuming along, it appears to me like your consumer would be something virtual that has yet another layer below it.
And the only thing that comes to my mind would then be this one (but I don't have devices supporting that, so I don't know any details):

Code:

# nvmecontrol ns create help
[...]
 -L, --flbas=<NUM>             - Namespace formatted logical block size setting

Update: I finally found a few of these "stripesized" disks, and they are either 512e mode ...
Sector Sizes: 512 bytes logical, 4096 bytes physical
... or SATA via USB3.

So the most likely explanation is that the namespace-formatting did configure some kind of 512e mode. I don't know how this is usually done; on my laptop I had to do it myself (and that was probably when I choose 4096-native). Here is some further information: https://unix.stackexchange.com/a/520256

snakedoctr · Jan 30, 2023

Thanks, PMc. These are Samsung 980 NVMe drives directly connect to the host, and I installed FreeBSD using 4k sector size -- no virtualization whatsoever. Smartctl shows:

Code:

Namespace 1 Formatted LBA Size:     512

Not sure where to go next.

PMc · Jan 30, 2023

So then, that means that the warning from ZFS is quite meaningless, because it is 512 bytes down below in either case.

The basic background is that NVMe devices have a kind of extra partitioning inside their controller, which is called "namespaces". So you could have multiple namespaces with different LBA sizes on the same device, and since these are handled inside the device controller, they can use the best possible performance for a given LBA size.

Usual desktop/laptop NVMe do support only one single namespace. But it should be possible to delete and recreate that one. So you could delete it (beware: that would erase the entire device) and recreate it with LBA 4096.

snakedoctr · Jan 30, 2023

Very helpful, PMc!

Just to make sure I understand what's going on. When the namespace/drive was created/formated/whatever it's called in NVMe terms, namespace 1 was created with 512 byte sectors. ZFS will send the data to the drive in (up to) 4k sectors/bytes, but then the drive further divides those up into 8 blocks to fit in the 512 byte sectors the drive is actually using. Is that right?

So to me it sounds like detaching one of the drives, deleting the namespace, and re-creating it with 4k sectors would help with performance (but who knows how much -- I'm sure it's based on the workload). Do you know the commands to do that? Question: Can I do that, attach the device back to the pool, let it resilver, and then detach the other drive and do the same thing?

PMc · Jan 30, 2023

hp550c said:
Very helpful, PMc!

Just to make sure I understand what's going on. When the namespace/drive was created/formated/whatever it's called in NVMe terms, namespace 1 was created with 512 byte sectors. ZFS will send the data to the drive in (up to) 4k sectors/bytes, but then the drive further divides those up into 8 blocks to fit in the 512 byte sectors the drive is actually using. Is that right?

That seems the most likely assumption.

hp550c said:
So to me it sounds like detaching one of the drives, deleting the namespace, and re-creating it with 4k sectors would help with performance (but who knows how much -- I'm sure it's based on the workload). Do you know the commands to do that?

I don't remember the details, but I'm sure I did it with nvmecontrol.
There are two commands nvmecontrol ns delete and nvmecontrol ns create. There are a bunch of strange options and the documentation is abysmal.

hp550c said:
Question: Can I do that, attach the device back to the pool, let it resilver, and then detach the other drive and do the same thing?

That's a good question, because there are issues with a pool consisting of devices with different blocksize. But I don't know exactly, this needs investigation.

twllnbrck · Nov 15, 2023

Hi,
was there any solution for the namespace thing with Samsung NVMe? Im confronted with the same problem:
Since the 14.0 Release will switch to nda(4) by default I tried to change the nvme driver from nvd(4) to nda(4) before upgrading from 13.2-RELEASE-p5.
I have 3 NVMe disks in my workstation, 2 WD Black 1 TB in zfs mirror hosting data, my /home directory, VMs etc. Then I have a zfs stripe (zroot) on a 250GB Samsung 980 SSD where the system is running.
After switching to nda(4) zfs gives the same message for the Samsung SSD (nda0)

Code:

 pool: zroot 
state: ONLINE
status: One or more devices are configured to use a non-native block size.
        Expect reduced performance.
action: Replace affected devices with devices that support the
        configured block size, or migrate data to a properly configured
        pool.

There are no namespace problems with the WD SSDs (nda1, nda2). Here is what geom(8) says

Code:

# geom -p nda0
Name: nda0
   Mediasize: 250059350016 (233G)
   Sectorsize: 512
   Stripesize: 16384
   Stripeoffset: 0
   Mode: r3w3e7
   descr: Samsung SSD 980 250GB

# geom -p nda1
Name: nda1
   Mediasize: 1000204886016 (932G)
   Sectorsize: 512
   Mode: r1w1e3
   descr: WD_BLACK SN770 1TB

Unfortunately nvmecontrol(8) reports that namespace management is not supported on the Samsung drive

Code:

 # nvmecontrol identify nvme0                                                                                  20:09:14
Controller Capabilities/Features
================================
Vendor ID:                   144d
Subsystem Vendor ID:         144d
Serial Number:               S64BNJ0R304509K
Model Number:                Samsung SSD 980 250GB
Firmware Version:            3B4QFXO7
Recommended Arb Burst:       2
IEEE OUI Identifier:         00 25 38
Multi-Path I/O Capabilities: Not Supported
Max Data Transfer Size:      2097152 bytes
Controller ID:               0x0005
Version:                     1.4.0

Admin Command Set Attributes
============================
Security Send/Receive:       Supported
Format NVM:                  Supported
Firmware Activate/Download:  Supported
Namespace Management:        Not Supported
Device Self-test:            Supported
Directives:                  Not Supported
NVMe-MI Send/Receive:        Not Supported
Virtualization Management:   Not Supported
Doorbell Buffer Config:      Not Supported
Get LBA Status:              Not Supported

Im grateful for every advice what to do with it. I didnt intend to replace the drive, so I will keep the nvd(4) driver temporary.

sko · Nov 15, 2023

twllnbrck said:
Unfortunately nvmecontrol(8) reports that namespace management is not supported on the Samsung drive

Samsung doesn't support that (and more and more often even reformatting to different sector sizes) on consumer drives. After cheaping out on everything hardware-related, it seems they now turned to the firmware to further maximize their profits...
Best advise I could give: avoid Samsung. There are FAR better options on the market, especially in the "premium" price segment.

msplsh · Nov 15, 2023

sko said:
better options on the market

Sell me. I consider Samsung, as much as I don't trust them, top of the heap.

msplsh · Nov 15, 2023

twllnbrck said:
namespace management is not supported on the Samsung drive

After reading up on this, it's probably for the best, as this feature allows you to break off the Read-Only tab on the SSD and make it effectively a very expensive ROM.

Phishfry · Nov 15, 2023

sko said:
on consumer drives.

You hit the nail on the head. You don't see this on thier mid enterprise PM983 or high end PM1723.
Select-able sector size as found in smartctl.

I have alot invested in Samsung in my servers. So far so good. I would buy them again.
I recently bought a new old stock Intel 2TB NMVe AIC. I figure its good to diversify.
Samsungs speeds really impress. Toshiba/Kioxa Enterprise NVMe are close.
Intels seem to lagg. Data center drives they say.

sko · Nov 15, 2023

msplsh said:
Sell me. I consider Samsung, as much as I don't trust them, top of the heap.

Micron, Kioxia and most brands that use their flash. contrary to samsung their endurance ratings are going up, while Samsung has reduced and sometimes even halved them with every new generation during the last few years...
Sandisk has been a solid contender in the enterprise segment, but I have absolutely no experience with their consumer drives.

Phishfry said:
You don't see this on there mid enterprise PM983 or high end PM1723.
Select-able sector size as found in smartctl.

True - but in the enterprise-segment pretty much all vendors support proper namespace management and different sector sizes - given the shitshow we saw with Samsung firmware (consumer AND enterprise) lately, I'd also rather invest my money on Micron or Kioxia there...
Our SAS SSDs are mainly Sandisk (because we got a *really* good deal on them) and they also have been absolutely reliable for several years now.

Depending on the use case WD is also quite good - we've been using WD blue NVMe since the 500 series and they have always been reliable and extremely power efficient and hence perfect for low-power and/or fanless desktops or embedded systems. The red NVMe (M.2) are OK given their price point, but run quite hot. I run 4 of those in my home server (jail/vm and poudriere pool), but in hindsight I'd rather go with Micron 7450s if I had to choose again.

msplsh · Nov 15, 2023

sko said:
Micron, Kioxia and most brands that use their flash.

Early SSDs and SandForce chipsets have taught me that simply making good flash doesn't mean you can make a good flash controller chipset. I view Kioxia as a flash vendor, not a reliable controller vendor. Apple uses them, but provides their own controller. I don't know anybody else I would consider reliable. Even Samsung and Micron have screwups, but the key is issuing updates quickly. Micron I'll agree with, but their stuff is merely... ok. Their product lines have become what I view as difficult to understand, which seems ominous.

Anyone who merely packages "good" flash with their own controller doesn't get a pass as "good" from me.