ZFS 4k disk sectors cost 33% diskspace - indeed ?!?

PMc · Jun 12, 2019

Today I had to dump and restore a postgres database. On restore I was wondering how the database had become so big, and took a closer look:

Code:

 ls -l 1387899*
-rw-------  1 postgres  postgres  1073741824 Jun 12 20:12 1387899
-rw-------  1 postgres  postgres  1073741824 Jun 12 18:52 1387899.1
-rw-------  1 postgres  postgres  1073741824 Jun 12 19:14 1387899.2
-rw-------  1 postgres  postgres  1073741824 Jun 12 19:29 1387899.3
-rw-------  1 postgres  postgres  1073741824 Jun 12 19:47 1387899.4
root@edge:/var/db/postgres/tblspc2/PG_10_201707211/1387448 # du -sk 1387899*
1409034 1387899
1409013 1387899.1
1408970 1387899.2
1408981 1387899.3
1409013 1387899.4

These are standard postgres table files, they are 1g size. But on disk they take quite precisely 1.33g (albeit with slight variations). No compression, no copies.
The same thing is also visible from the zfs statistics:

Code:

gr/pgsql/tblspc2  used                  11.9G                     -
gr/pgsql/tblspc2  referenced            11.9G                     -
gr/pgsql/tblspc2  compressratio         1.00x                     -
gr/pgsql/tblspc2  written               11.9G                     -
gr/pgsql/tblspc2  logicalused           8.93G                     -
gr/pgsql/tblspc2  logicalreferenced     8.93G                     -

I was very certain that this was not the case earlier, so I checked the backups, and indeed it was not the case earlier:

Code:

gr/pgsql/tblspc2  used                  9.65G                              -
gr/pgsql/tblspc2  referenced            9.65G                              -
gr/pgsql/tblspc2  compressratio         1.00x                              -
gr/pgsql/tblspc2  written               9.65G                              -
gr/pgsql/tblspc2  logicalused           9.54G                              -
gr/pgsql/tblspc2  logicalreferenced     9.54G                              -

I compared the options on pool and filesystems with the backup, they are all identical, nothing changed. I only had recreated the pool and moved it onto an encrypted partition.

Then, while already writing this posting, I figured what more I had changed at that time: I had changed vfs.zfs.min_auto_ashift to 12 in order to accomodate for newer disks.

SirDice · Jun 13, 2019

ZFS uses block sub allocations, so 4K blocks don't have as much slack as you think. Besides that, slack really only comes into play when you have, for example, thousands of 3K files. Assuming there's no sub allocation that would indeed leave thousands of 1K sections open. But with a few large files there's no such slack. The 4K blocks are all filled up except the last block of the file. A 10K file for example would use 3 4K blocks, with the last block only half filled.

PMc · Jun 13, 2019

SirDice said:
ZFS uses block sub allocations, so 4K blocks don't have as much slack as you think. Besides that, slack really only comes into play when you have, for example, thousands of 3K files.

Nope! It's not what I think, it's what I measure; and I don't have 3k files, I have 1g sized files.

And these 1g sized files grow by 33% after switching from 512b to 4k ashift.
Thats the point here, and I didn't find this mentioned anywhere.

There are lots of strange calculations around, none is coherent with the other, but they tend to speak about some 3-5% loss by stripe allocation.
This here is more interesting, it talks about up to 60% loss per stripe allocation, but it references a document from delphix that doesn't exist anymore.

But all these docs talk about stripe allocation for checksumming, and checksumming should be accounted on the pool level, i.e. these losses should go into the difference between the figures from zfs() and from zpool(). (Maybe I'm wrong with this, I didn't find it precisely described anywhere.)

But what I'm perceiving is a loss on the file level: from the actually useable space (as reported by zfs()), the written file takes a huuuge amount more than its real size.

Yes, this only happens with raidz.
Yes, this only happens with database files (recordsize 8k).

Assuming there's no sub allocation that would indeed leave thousands of 1K sections open. But with a few large files there's no such slack.

It seems to do that. And it seems the files can be as big as they want, that it irrelevant - the thing that matters is the recordsize (as configured for the individual filesystem). And with postgres, the recordsize must be 8k (because otherwise the thing will happily read in 128k from disk for each and every 8k block requested).

The 4K blocks are all filled up except the last block of the file. A 10K file for example would use 3 4K blocks, with the last block only half filled.

Then why do my 1048576K files use 350000+ 4K blocks, instead of 262144?

It is not a matter of the file size. Files don't get written once at a time. Specifically database files are only partially rewritten - a database update does not take the i/o to write the whole gigabyte everytime, and the copy-on-write scheme must take care of this.

I suppose the allocation happens on the recordsize level: each chunk of recordsize is allocated individually.
And while it would seem that a 4K sectorsize+ashift, an 8K recordsize, and a 2+1 raidz should match up nicely and without loss, in fact this isn't the case.

SirDice · Jun 13, 2019

PMc said:
Nope! It's not what I think, it's what I measure; and I don't have 3k files, I have 1g sized files.

That's exactly my point. Because you only have large files there's very little slack even if there was no subblock allocation.

PMc said:
Then why do my 1048576K files use 350000+ 4K blocks, instead of 262144?

Not sure but slack certainly isn't the issue. One thing that came to mind though, one is a sparse file and the other not. If a sparse file is converted to a non-sparse file it's going to take up more space. The usage difference is going to depend on the amount of "empty" space in the data. Another option could be compression but your data shows 1:1 in both cases, so that rules out compression.

Sparse file - Wikipedia

en.wikipedia.org

PMc · Jun 13, 2019

Okay, lets put it to test:

Code:

#! /bin/sh

F=_A

ST=`sysctl -n vfs.zfs.txg.timeout`
ST=`expr $ST + 1`

S=0
touch $F
sleep $ST
A=`du -k $F | awk '{print $1}'`
rm $F
echo "$S K size -> $A K alloc"

for S in 1 2 3 4 5 6 7 8 9 11 12 13 15 16 17 23 24 25 31 32 33 63 \
        64 65 127 128 129 254 255 256 257; do
    dd if=/dev/zero bs=1k count=$S of=$F 2> /dev/null
    sleep $ST
    A=`du -k $F | awk '{print $1}'`
    rm $F
    echo "$S K size -> $A K alloc"
done

Run this on ZFS filesystem with compression=off.
Interesting parameters: recordsize of the filesytem, ashift of the pool, layout of the vdev.

PMc · Jun 13, 2019

recordsize=128K, ashift=12, mirror:

Code:

0 K size -> 1 K alloc
1 K size -> 5 K alloc
2 K size -> 5 K alloc
3 K size -> 5 K alloc
4 K size -> 5 K alloc
5 K size -> 9 K alloc
6 K size -> 9 K alloc
7 K size -> 9 K alloc
8 K size -> 9 K alloc
9 K size -> 13 K alloc
11 K size -> 13 K alloc
12 K size -> 13 K alloc
13 K size -> 17 K alloc
15 K size -> 17 K alloc
16 K size -> 17 K alloc
17 K size -> 21 K alloc
23 K size -> 25 K alloc
24 K size -> 25 K alloc
25 K size -> 29 K alloc
31 K size -> 33 K alloc
32 K size -> 33 K alloc
33 K size -> 37 K alloc
63 K size -> 65 K alloc
64 K size -> 65 K alloc
65 K size -> 69 K alloc
127 K size -> 129 K alloc
128 K size -> 129 K alloc
129 K size -> 265 K alloc
254 K size -> 265 K alloc
255 K size -> 265 K alloc
256 K size -> 265 K alloc
257 K size -> 393 K alloc

recordsize=8K, ashift=12, mirror:

Code:

0 K size -> 1 K alloc
1 K size -> 5 K alloc
2 K size -> 5 K alloc
3 K size -> 5 K alloc
4 K size -> 5 K alloc
5 K size -> 9 K alloc
6 K size -> 9 K alloc
7 K size -> 9 K alloc
8 K size -> 9 K alloc
9 K size -> 25 K alloc
11 K size -> 25 K alloc
12 K size -> 25 K alloc
13 K size -> 25 K alloc
15 K size -> 25 K alloc
16 K size -> 25 K alloc
17 K size -> 33 K alloc
23 K size -> 33 K alloc
24 K size -> 33 K alloc
25 K size -> 41 K alloc
31 K size -> 41 K alloc
32 K size -> 41 K alloc
33 K size -> 49 K alloc
63 K size -> 73 K alloc
64 K size -> 73 K alloc
65 K size -> 81 K alloc
127 K size -> 137 K alloc
128 K size -> 137 K alloc
129 K size -> 145 K alloc
254 K size -> 265 K alloc
255 K size -> 265 K alloc
256 K size -> 265 K alloc
257 K size -> 273 K alloc

recordsize=2K, ashift 12, mirror:

Code:

0 K size -> 1 K alloc
1 K size -> 5 K alloc
2 K size -> 5 K alloc
3 K size -> 17 K alloc
4 K size -> 17 K alloc
5 K size -> 21 K alloc
6 K size -> 21 K alloc
7 K size -> 25 K alloc
8 K size -> 25 K alloc
9 K size -> 29 K alloc
11 K size -> 33 K alloc
12 K size -> 33 K alloc
13 K size -> 37 K alloc
15 K size -> 41 K alloc
16 K size -> 41 K alloc
17 K size -> 45 K alloc
23 K size -> 57 K alloc
24 K size -> 57 K alloc
25 K size -> 61 K alloc
31 K size -> 73 K alloc
32 K size -> 73 K alloc
33 K size -> 77 K alloc
63 K size -> 137 K alloc
64 K size -> 137 K alloc
65 K size -> 141 K alloc
127 K size -> 265 K alloc
128 K size -> 265 K alloc
129 K size -> 269 K alloc
254 K size -> 517 K alloc
255 K size -> 521 K alloc
256 K size -> 521 K alloc
257 K size -> 525 K alloc

recordsize=128K, ashift=12, raidz 2+1:

Code:

0 K size -> 1 K alloc
1 K size -> 6 K alloc
2 K size -> 6 K alloc
3 K size -> 6 K alloc
4 K size -> 6 K alloc
5 K size -> 11 K alloc
6 K size -> 11 K alloc
7 K size -> 11 K alloc
8 K size -> 11 K alloc
9 K size -> 17 K alloc
11 K size -> 17 K alloc
12 K size -> 17 K alloc
13 K size -> 17 K alloc
15 K size -> 17 K alloc
16 K size -> 17 K alloc
17 K size -> 22 K alloc
23 K size -> 27 K alloc
24 K size -> 27 K alloc
25 K size -> 33 K alloc
31 K size -> 33 K alloc
32 K size -> 33 K alloc
33 K size -> 38 K alloc
63 K size -> 65 K alloc
64 K size -> 65 K alloc
65 K size -> 70 K alloc
127 K size -> 129 K alloc
128 K size -> 129 K alloc
129 K size -> 267 K alloc
254 K size -> 267 K alloc
255 K size -> 267 K alloc
256 K size -> 267 K alloc
257 K size -> 395 K alloc

recordsize=8K, ashift=12, raidz 2+1:

Code:

0 K size -> 1 K alloc
1 K size -> 6 K alloc
2 K size -> 6 K alloc
3 K size -> 6 K alloc
4 K size -> 6 K alloc
5 K size -> 11 K alloc
6 K size -> 11 K alloc
7 K size -> 11 K alloc
8 K size -> 11 K alloc
9 K size -> 33 K alloc
11 K size -> 33 K alloc
12 K size -> 33 K alloc
13 K size -> 33 K alloc
15 K size -> 33 K alloc
16 K size -> 33 K alloc
17 K size -> 43 K alloc
23 K size -> 43 K alloc
24 K size -> 43 K alloc
25 K size -> 54 K alloc
31 K size -> 54 K alloc
32 K size -> 54 K alloc
33 K size -> 65 K alloc
63 K size -> 97 K alloc
64 K size -> 97 K alloc
65 K size -> 107 K alloc
127 K size -> 182 K alloc
128 K size -> 182 K alloc
129 K size -> 193 K alloc
254 K size -> 352 K alloc
255 K size -> 352 K alloc
256 K size -> 352 K alloc
257 K size -> 363 K alloc

recordsize=2K, ashift 12, raidz 2+1:

Code:

0 K size -> 1 K alloc
1 K size -> 6 K alloc
2 K size -> 6 K alloc
3 K size -> 22 K alloc
4 K size -> 22 K alloc
5 K size -> 27 K alloc
6 K size -> 27 K alloc
7 K size -> 33 K alloc
8 K size -> 33 K alloc
9 K size -> 38 K alloc
11 K size -> 43 K alloc
12 K size -> 43 K alloc
13 K size -> 49 K alloc
15 K size -> 54 K alloc
16 K size -> 54 K alloc
17 K size -> 59 K alloc
23 K size -> 75 K alloc
24 K size -> 75 K alloc
25 K size -> 81 K alloc
31 K size -> 97 K alloc
32 K size -> 97 K alloc
33 K size -> 102 K alloc
63 K size -> 182 K alloc
64 K size -> 182 K alloc
65 K size -> 187 K alloc
127 K size -> 352 K alloc
128 K size -> 352 K alloc
129 K size -> 358 K alloc
254 K size -> 688 K alloc
255 K size -> 693 K alloc
256 K size -> 693 K alloc
257 K size -> 699 K alloc

PMc · Jun 13, 2019

Lets have a closer look:

each file gets on creation 1 kB metadata.
as soon as there is content in the file, it also allocates one disk sector (ashift).
when that disk sector is full, another one is allocated. And so on, until recordsize is reached.
when reaching recordsize, the file occupies exactly 1 recordsize + 1kB metadata
when the file grows beyond 1 recordsize, two things happen:
- additional 8 kB metadata is allocated.
- an additional recordsize is allocated: the allocation is now no longer per sector/ashift, but per recordsize. So, in the default config (128k recordsize), when the file grows beyond 128k, the allocation is 1kB initial metadata + 2 x 128kB recordsize + 8 kB extra metadata = 265kB. Then the file can grow up to 256kB without further allocation, and then the next recordsize is allocated.
There is no difference between 8K and 128K recordsize, only the additional 8kB metadata is already allocated for files > 8kB. A file of 256 kB will always use 256 + 9 kB.

When recordsize < sectorsize/ashift, then nevertheless each record will occupy one sector. So, with ashift = 12 and recordsize = 2K, everything allocates about twice the space it needs, no matter the filesize.

With raidz things are a bit different:

an empty file still used 1kB metadata.
the first sector now allocates 5 kB (instead of 4), the second also, and the third sector allocates 6 kB. The 4th sector then allocates nothing at all, so a 16 kB file still fits into 17 kB. Also 64kB file fit into 65kB, and 128kB file fit into 129kB (so it is still 1 recordsize + 1 kB metadata).
Only when surpassing recordsize, the additional 8kB of metadata are not 8kB, but 10 kB. The metadata is allocated as a separate entity, and analogously the first and second sector each need 5 kB.
Over all, this is only a moderate overhead of a small single-digit percentage.

But then when changing recordsize to 8kB:

an 8kB file does still occupy 11 kB: 1kB metadata + 5kB first sector + 5kB second sector.
beyond 8kB we need:
- the additional 8kB metadata
- an additional recordsize of 8kB
- this together (8 kB + 8 kB) makes 22 kB - for whatever reason.
when getting beyond 16kB, another 8 kB recordsize is needed, which again allocates 10 kB.
and so on: each further 8 kB record allocates either 10 or 11 kB. While beforehand, with recordsize 128k, the allocation lined up at every 16 kB, these 16 kB are now never reached: each 8 kB record seems to be treated as a separate entity and each one overconsumes 2 or 3 kB.
Taken together, 2 or 3 of 8 is precisely the observed 33%.

PMc · Jun 13, 2019

Proof of concept: as it appears that this specific raidz stripe allocation scheme aligns after every 16 kB, the overconsumption should disappear with a recordsize=16k

Code:

0 K size -> 1 K alloc
1 K size -> 6 K alloc
2 K size -> 6 K alloc
3 K size -> 6 K alloc
4 K size -> 6 K alloc
5 K size -> 11 K alloc
6 K size -> 11 K alloc
7 K size -> 11 K alloc
8 K size -> 11 K alloc
9 K size -> 17 K alloc
11 K size -> 17 K alloc
12 K size -> 17 K alloc
13 K size -> 17 K alloc
15 K size -> 17 K alloc
16 K size -> 17 K alloc
17 K size -> 43 K alloc
23 K size -> 43 K alloc
24 K size -> 43 K alloc
25 K size -> 43 K alloc
31 K size -> 43 K alloc
32 K size -> 43 K alloc
33 K size -> 59 K alloc
63 K size -> 75 K alloc
64 K size -> 75 K alloc
65 K size -> 91 K alloc
127 K size -> 139 K alloc
128 K size -> 139 K alloc
129 K size -> 155 K alloc
254 K size -> 267 K alloc
255 K size -> 267 K alloc
256 K size -> 267 K alloc
257 K size -> 283 K alloc

We can see, 256kB file now consumes 267kB space, just as with recordsize=128k. So this change conserves 33% of disk space!

One could now think about compiling postgres with 16 kB blocksize - but I'm not sure if that is feasible, or possible at all.
But then, this behaviour may be entirely different on a different raidz geometry.
I would be interested in such measurements (it is not much effort to create some zfs filesystem, set it to different recordsize and run the script from above).

Eric A. Borisch · Jun 13, 2019

Raidz-n allocations are required to be in multiples of (n+1) sectors (where sector size is what ashift is setting) so at ashift=12 == 4k sectors. For raidz2, each allocation must be a multiple of (2+1)*4k = 12k.

See this blog post.

Eric A. Borisch · Jun 13, 2019

See also this spreadsheet. Where things line up to be efficient depend on sector (ashift) size, redundancy level, the size of the records being saved, and the width of the RAIDz stripe.

The net effect (and overhead over traditional RAID levels) varies depending on the width and size of files you are creating. ZVOLs and filesystems with small block/record sizes (or small files) see the largest impact, and it can be very large. Compression obviously helps to balance things out if the data is compressible, but for very small volblocksize/recordsize/filesize, there is essentially nothing to be done. Avoid very small settings on RAIDZ. (Use at least 32k.)

PMc · Jun 13, 2019

Eric A. Borisch said:
Raidz-n allocations are required to be in multiples of (n+1) sectors (where sector size is what ashift is setting) so at ashift=12 == 4k sectors.

And how do we determine the allocations? It is internal to ZFS what it pleases to allocate. On can figure it out by reading the source or with a script like above, but there's nothing to be done about it.

Eric A. Borisch said:
See this blog post.

RAID-Z spreads each logical block across all the devices

Thats the important point. The "logical block" is the recordsize. As soon as the recordsize comes near the sectorsize, we're in hell.

Bottomline: database volumes (recordsize=8k) and 4k disks and raidZ do not work well together and should be avoided - drop the raidZ or drop the 4k or drop the database (or pay the price in space overhead).

Eric A. Borisch said:
See also this spreadsheet. Where things line up to be efficient depend on sector (ashift) size, redundancy level, the size of the records being saved, and the width of the RAIDz stripe.

That one appears to be accurate. We need AT LEAST recordsize=16k to avoid loss (only works with 6-disk-raidz2 or 3-disk-raidz1).

Eric A. Borisch said:
The net effect (and overhead over traditional RAID levels) varies depending on the width and size of files you are creating. ZVOLs and filesystems with small block/record sizes (or small files) see the largest impact, and it can be very large. Compression obviously helps to balance things out if the data is compressible, but for very small volblocksize/recordsize/filesize, there is essentially nothing to be done. Avoid very small settings on RAIDZ. (Use at least 32k.)

Not an option. Recordsize must be 8k for database. Otherwise: d'accord.

Eric A. Borisch · Jun 13, 2019

PMc said:
Recordsize must be 8k for database.

I would suggest that is should, not must. As in "you should (for performance reasons) use a recordsize that matches the database's IOs".

This simple fact is that you have hit a corner where zfs doesn't solve all the world's problems. While RAIDZ-n is selected (over stripes of mirrors) for increased storage efficiency at the cost of IOPS, in the small record vs. sector size arena, it falls down on the efficiency goal.

* If you need high IOPS, move to a stripe of mirrors. You won't have any unexpected storage inflation, and you're get much better IOPS.
* If you need high storage efficiency (RAIDZn), move to at least 32k sector sizes and use lz4 compression. (You may find less of an impact than you expect if you're using spinning rust; random reading/writing 8k or 32k is not all that different in terms of absolute latency.) Also consider bumping down your DB cache size and allowing ARC to grow instead to try to reduce the impact.

Just my 2c.

PMc · Jun 13, 2019

Eric A. Borisch said:
I would suggest that is should, not must. As in "you should (for performance reasons) use a recordsize that matches the database's IOs".

I'm not certain about the impact of that. I have seen ZFS on occasion do very creepy read-amplifications with the default recordsize=128k (but didn't investigate that further).

This simple fact is that you have hit a corner where zfs doesn't solve all the world's problems.

Oh well, but it should! *biggrin*
Mostly I was surprized by the amount of loss: if 1 GB takes 1.33 GB on disk, then in some cases there is no space gain at all with raidZ, and one can use mirroring right away.

The other thing is, a general recommendation is to switch all things to 4k ashift - and there may be cases where this is not the best advice - at least it may need some additional consideration about the side effects.

semi-ambivalent · Jun 13, 2019

I know just enough of both ZFS and Postgresql to get into trouble but I seem to recall that database has a "fillfactor" setting, user-settable (10-100%), that, when loading a table 'every some number of rows' is an empty row to allow future insertions or edits of data in the same page. Used for indexes too. I never have seen need to touch it. There's a default setting in place and this might be combining with your table's data and ZFS's block size to vary the size of the resultant table after pouring in the dump. So perhaps not exclusively a ZFS thing. Concurrence not being causation and all that, but databases can be a black art although 1:1.33 is quite a jump. Good luck.

rihad · Jun 4, 2023

PMc said:
Recordsize must be 8k for database.

Does it really matter considering that it's just a maximum? Smaller files can very well have record size equal to the size of the file, not less than 4096 and not more than recordsize. At least this is what st_blksize field from stat(1) implies.

I'm currently assessing the possibility to use:
vfs.zfs.vdev.min_auto_ashift=12
on this drive:

nvd0: <SAMSUNG MZQL2960HCJR-00A07> NVMe namespace
nvd0: 915715MB (1875385008 512 byte sectors)

for running PostgreSQL database.

It's probably lying about 512 byte sectors. What are the pros and cons of forcing 4096 disk blocks if the goal is to increase speed, minimize disk writes, not impact on reported file size negatively. Any input would be much appreciated.

rihad · Jun 4, 2023

I've just temporarily disabled zfs compression, created a file 100mb in size using dd if=/dev/urandom on an "ashift=12" zpool, and its size both in du -sh and du -shA and df -h of the filesystem it's on were exactly 100M. Good.

rihad · Jun 4, 2023

Oh, damn, I forgot that it's an existing pool with root FS on it, and it was created with ashift=0 (meaning query disk for its block size) when installing FreeBSD, so got no options here. But still it would be nice to know of pros and cons. Thanks)

Phishfry · Jun 4, 2023

rihad said:
It's probably lying about 512 byte sectors.

Have you dug into the details of your device.
Query the namespace with SMARTmontools.

rihad · Jun 5, 2023

Phishfry said:
Have you dug into the details of your device.
Query the namespace with SMARTmontools.

There's no such information it seems.

Code:

$ sudo smartctl -a /dev/nvme0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.2-RELEASE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZQL2960HCJR-00A07
Serial Number:                      S64FNT0W301256
Firmware Version:                   GDC5902Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 960,197,124,096 [960 GB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.4
Number of Namespaces:               32
Local Time is:                      Mon Jun  5 05:26:57 2023 UTC
Firmware Updates (0x17):            3 Slots, Slot 1 R/O, no Reset required
Optional Admin Commands (0x005f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     80 Celsius
Critical Comp. Temp. Threshold:     83 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   14.00W       -    0  0  0  0       70      70
 1 +     8.00W    8.00W       -    1  1  1  1       70      70

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    7,893,827 [4.04 TB]
Data Units Written:                 7,689,582 [3.93 TB]
Host Read Commands:                 205,377,513
Host Write Commands:                207,790,371
Controller Busy Time:               118
Power Cycles:                       41
Power On Hours:                     863
Unsafe Shutdowns:                   39
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               38 Celsius
Temperature Sensor 2:               48 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Phishfry · Jun 6, 2023

It should be up there with power states.
On Micron 2400

Code:

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

Code:

=== START OF INFORMATION SECTION ===
Model Number:                       Micron 2450 NVMe 256GB
Serial Number:                      22433D$$$$$
Firmware Version:                   24500007
PCI Vendor/Subsystem ID:            0x1344
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 013d6c9e2f
Local Time is:                      Mon Jun  5 23:03:35 2023 EDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x00d7):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         64 Pages
Warning  Comp. Temp. Threshold:     83 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.30W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.92W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3     1000    1000
 4 -   0.0050W       -        -    4  4  4  4    10000   40000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        56 Celsius
Available Spare:                    100%
Available Spare Threshold:          50%
Percentage Used:                    0%
Data Units Read:                    176,602 [90.4 GB]
Data Units Written:                 265,762 [136 GB]
Host Read Commands:                 1,409,212
Host Write Commands:                3,736,703
Controller Busy Time:               20
Power Cycles:                       18
Power On Hours:                     1
Unsafe Shutdowns:                   15
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               56 Celsius

Warning: NVMe Get Log truncated to 0x200 bytes, 0x200 bytes zero filled
Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

Phishfry · Jun 6, 2023

I see that is an enterprise drive. Surprising how limited the firmware seems to be. Somewhat limited Power States.
I have prior gen PM983 in all configs (U.2 and M.2) and they all had not only sectors -but the ability to choose.
Notice the column with "fmt +"
On drives with 2048 as well as 512 it would not have the + plus symbol
(depending on how you set it up (it is selectable with nvmecontrol))

Jose · Jun 6, 2023

rihad said:
It's probably lying about 512 byte sectors. What are the pros and cons of forcing 4096 disk blocks if the goal is to increase speed, minimize disk writes, not impact on reported file size negatively. Any input would be much appreciated.

You're probably right:
https://forums.FreeBSD.org/threads/best-ashift-for-samsung-mzvlb512hajq-pm981.75333/post-528982

I did some digging on three different (but only two different models) NVMe drives here:

BIOS Booting a ZFS Root on an MBR Partition

WARNING: YOU WILL LOSE ALL YOUR DATA. Following this guide will destroy all data on your disks. Make sure you have backups. This will not work on 2 TB or larger boot drives, of course. That's the MBR partition limit. Rationale I have two machines that duel boot Freebsd; one duels with Linux, the...

forums.freebsd.org

rihad · Jun 6, 2023

Yeah, if most files or their final part are much smaller than 4096 bytes it's obviously that much wasted space per entry, unless OS/HW do some tricks to minimize that. So I've decided to play it safe and stay with 512 (ashift=9). Which is quite natural considering I no longer have a choice))

ZFS 4k disk sectors cost 33% diskspace - indeed ?!?

PMc

SirDice

Administrator

PMc

SirDice

Administrator

Sparse file - Wikipedia

PMc

PMc

PMc

PMc

Eric A. Borisch

Eric A. Borisch

PMc

Eric A. Borisch

PMc

semi-ambivalent

rihad

rihad

rihad

Phishfry

rihad

Phishfry

Phishfry

Jose

BIOS Booting a ZFS Root on an MBR Partition

rihad