ZFS on AF drives today and partitioning

ctengel · Mar 2, 2012

I've found a lot of info about using ZFS with 4k blocksize drives. Mine is a Seagate ST2000DL003. (2TB "Green" 5900RPM) I think that is 4k blocksize, and I think it pretends to be 512. (like the WD "EARS" I've read alot of bad about.)

My understanding is that this is best practice to use gnop to fool zfs into thinking the apparently 512 drives are really 4k (which they are really anyway

), and from there it's basically smooth. ashift gets set at 12, and everyone is happy. (As an aside, I wonder whether ashift can be made a tunable parameter at zpool create time. While Solaris I don't think has this option, I've heard that the LLNL Linux port allows this. That way the gnop hack wouldn't be necessary.)

OK so now that leads me to my question, which I am not clear on. Initially I would have been happy with giving ZFS the whole disk, as is standard best practice on Solaris. However I recently had an incident at work that made me start thinking more. (basically had to migrate data from one SAN storage unit to another; new SAN unit had slightly smaller LUNs; bad news) I'm thinking it would be best to maybe shave off 1GB or so off from the 2TB (enough to be safe, but small enough not to notice) in case I need to replace or mirror to a slightly smaller drive.

The best/only way I can think of to do this is with partitioning. Obviously with that now, the partitions should definetely be laid out on 4k boundaries. What partitioning scheme is best for this? (GPT, BSD, DOS-style FDISK, EFI, SMI, etc) Ideally would be something portable, but at the very least I'd like to understand how it works so if I did have to change I could create another partition scheme pointing the same place. Does this relate at all with a given labelling scheme? And most importantly, do I do gnop trick on the whole drive, then partition? Or do I partition the drive, then do gnop trick on only the partition?

Thanks for any advice!

Sebulon · Mar 2, 2012

@ctengel

# gpart create -s gpt daX
# gpart add -t freebsd-zfs -l diskX -b 2048 -a 4k daX
("-a" value is available from 8-STABLE and up)
# gnop -S 4096 gpt/diskX
# zpool create poolX gpt/diskX.nop

/Sebulon

ctengel · Mar 2, 2012

@Sebulon,
I take it you consider GPT to be the way to go, and the gnop trick should be done to the partition, not the whole disk.
Am I correct in understanding that that would create a partition starting at the 8MiB point, with 4KiB block size.
To limit the size (in my case probably 1999GB I think would be safe), I guess I would put a -s 488037109 (=1999*1000^3/4096).
Just out of curiosity though, why have it not start until 8MiB? Is that where the bootloader customarily goes? Or the GPT info? (or wait I thought that was at the end...)
Thanks,
Chris

Sebulon · Mar 3, 2012

@ctengel

GPT is the way of the future. MBR is deprecated because of its many limitations, max 2TB partition size, just to name one.

-b 2048 Does set the partition start boundary, but not at 8MiB:

Code:

2048x[B]512[/B]/1024000=[B]1MiB[/B]

You have to use the old block size because they still present themselves that way to the OS, even though they aren't.

-a 4k Sets the end boundary of that partition evenly dividable by 4096, so that your writes go through evenly all through to the end.

There is no such thing as a partition with 4k block size. Block size is something the filesystem decides. In ZFS e.g. block size is set by the ashift value:

Code:

[B]ashift=9[/B] : 2^9 =512 (for drives with 512b physical block size)
[B]ashift=12[/B]: 2^12=4096 (for drives with 4k physical block size)

/Sebulon

gkontos · Mar 3, 2012

@Sebulon,

quoting from man gpart(8)() on a FreeBSD 9-STABLE r232156:

Code:

     Create a 15GB-sized freebsd-ufs partition to contain a UFS filesystem and
     aligned on 4KB boundaries:

           /sbin/gpart add -s 15G -t freebsd-ufs -a 4k da0

So, maybe -b 2048 is not necessary anymore?

kpa · Mar 3, 2012

That example from the man page is when you only want the alignment on 4k (8 sectors) boundaries. This -b 2048 is one way of creating a slightly smaller partition than what is available on the disk to prepare for situation where the replacement disk is smaller than the current disk. It also happens to have the same effect on alignment because 2048 is divisible by 8. I would probably do the opposite and let the partition start from the earliest 4k boundary and limit the size of the partition with -s to a suitable size.

Don't worry about the GPT partition metadata, it's properly hidden from the user and you don't have to reserve space for it manually. The bootloader if one is used is installed on a separate partition of type freebsd-boot.

ctengel · Mar 6, 2012

OK so I ended up doing this.
I took 2TB, then subtracted 10MiB (1 MiB for stuff at the beginning, 1 MiB for stuff at the end, plus 8 MiB for the mysterious partition 8 (I have no idea what this actually is, but on Solaris, if you give ZFS the whole disk it always makes one, and I figured I'd leave it in case I want to move to Solaris).
Then I rounded that number down to the nearest multiple of 256 MiB beneath it (just so I'd get a round #; I don't know)
Anyway, I ended up with a ~1.8 TiB partition (which is about what i started with, 2.0 TB is about 1.8 TiB).

I created it with something like:

Code:

# gpart add -t freebsd-zfs -a 2048 -s 1907200M -l fzf000 ada2

and got

Code:

# gpart show ada2
=>        34  3907029101  ada2  GPT  (1.8T)
          34        2014        - free -  (1M)
        2048  3905945600     1  freebsd-zfs  (1.8T)
  3905947648     1081487        - free -  (528M)

OK, so far so good. Did the gnop trick, created my zpool (using the GPT labelled vdev names, like "gpt/fzf000.nop", verified that ashift=12.)
Then I exported the pool, and destroyed the gnop devices.

When I went to re-import it though, the vdev's came in as their names (like "ada2"). This actually makes perfect sense to me (the original names, which would have been cached, no longer exist, but upon issuing the import command, it's clever enough to find them by using their metadata)

However, I'm wondering if there is a way to "force" it to import them with the GPT label names. I tried importing with "-d /dev/gpt," and that worked once, but then after exporting and importing again they came in under the ada names. Any ideas?

(It would be nice, but it wouldn't be a major issue if I couldn't. The main reason why I didn't just do ZFS on the whole disk was the issue of moving to smaller disks. I picked GPT, because I realized GPT is the same thing as EFI which is best practice for ZFS on Solaris (except on bootdisks/rootdisks, where you need SMI/VTOC), and it seems pretty good. I know glabel can also be used for labelling, but my understanding is that tends to overwrite things at the end.)

kpa · Mar 6, 2012

ZFS on FreeBSD does not create any "mysterious" partitions, if you give zpool(8) a whole disk it will use it only for ZFS, no partitions of any kind. That also means there is no need to reserve space at the beginning or at the end if using partitions other than shifting the beginning of the partition to match a 4k boundary or for leaving the partition slightly smaller than the disk as you have done.

This behaviour of forgetting the GPT labels after export-import smells like a real bug to me. I just tested creation of a pool using GPT labels on a virtual machine and I saw exactly what you saw, the device names revert to ada*.

Sebulon · Mar 6, 2012

Nice to see more people noticing this other than myself. My workaround is as follows:

# gpart create -s gpt ada0
# gpart create -s gpt ada1
# gpart create -s gpt ada2
# gpart create -s gpt ada3
# gpart add -t freebsd-zfs -l disk0 -b 2048 -a 4k ada0
# gpart add -t freebsd-zfs -l disk1 -b 2048 -a 4k ada1
# gpart add -t freebsd-zfs -l disk2 -b 2048 -a 4k ada2
# gpart add -t freebsd-zfs -l disk3 -b 2048 -a 4k ada3

For retarted reasons, you need to create "fake" drives to create your pool with. You shouldnÂ´t have to do this really, but itÂ´s the only workaround IÂ´ve found for the gnop-trick to work and the labels to show up. The recipe for dd is to seek forwards to the total size of your drives, minus one MB, for the partition start- and stop boundary. Use diskinfo to see the size of one of your drives in bytes:

Code:

[CMD="#"]diskinfo -v ada0[/CMD]
	512         	# sectorsize
	2000398934016	# mediasize in bytes (1.8T)
	3907029168  	# mediasize in sectors
        ...

[CMD="#"]echo "2000398934016 / 1024000 - 1" | bc[/CMD]
1953513

# dd if=/dev/zero of=/tmp/tmpdsk0 bs=1m seek=1953513 count=1
# dd if=/dev/zero of=/tmp/tmpdsk1 bs=1m seek=1953513 count=1
# dd if=/dev/zero of=/tmp/tmpdsk2 bs=1m seek=1953513 count=1
# dd if=/dev/zero of=/tmp/tmpdsk3 bs=1m seek=1953513 count=1

# mdconfig -a -t vnode -f /tmp/tmpdsk0 md0
# mdconfig -a -t vnode -f /tmp/tmpdsk1 md1
# mdconfig -a -t vnode -f /tmp/tmpdsk2 md2
# mdconfig -a -t vnode -f /tmp/tmpdsk3 md3

# gnop create -S 4096 md0
# zpool create pool raidz md0.nop md{1,2,3}
# zpool export pool
# gnop destroy md0.nop
# zpool import pool

# zpool offline pool md0
# mdconfig -d -u 0
# rm /tmp/tmpdsk0
# zpool replace pool md0.nop gpt/disk0

# zpool offline pool md1
# mdconfig -d -u 1
# rm /tmp/tmpdsk1
# zpool replace pool md1 gpt/disk1

# zpool offline pool md2
# mdconfig -d -u 2
# rm /tmp/tmpdsk2
# zpool replace pool md2 gpt/disk2

# zpool offline pool md3
# mdconfig -d -u 3
# rm /tmp/tmpdsk3
# zpool replace pool md3 gpt/disk3

/Sebulon

kpa · Mar 6, 2012

For what it's worth, labels created with glabel(8) work just fine.

ctengel · Mar 7, 2012

@Sebulon: So basically you're creating it with fake disks, using gnop to get ashift set correctly, and then replacing with the actual labelled disks so that those labels get stored in the zpool.cache. I'm wondering if you could "replace" the ada names with the labelled disks...

gkontos · Mar 7, 2012

Yes you can.

[CMD=""]#gnop create -S 4096 /dev/gpt/disk0[/CMD]
[CMD=""]#gnop create -S 4096 /dev/gpt/disk1[/CMD]
[CMD=""]#gnop create -S 4096 /dev/gpt/disk0[/CMD]

[CMD=""]#zpool create pool raidz1 /dev/gpt/disk0.nop /dev/gpt/disk1.nop /dev/gpt/disk2.nop[/CMD]

[CMD=""]#zpool export pool[/CMD]

[CMD=""]#gnop destroy /dev/gpt/disk0.nop[/CMD]
[CMD=""]#gnop destroy /dev/gpt/disk1.nop[/CMD]
[CMD=""]#gnop destroy /dev/gpt/disk2.nop[/CMD]

[CMD=""]#zpool import pool[/CMD]

naguz · Mar 8, 2012

Is the issue with 4k ENTIRELY solved with the gnop procedure? ie. no performance or other issues apart from the naming of the disks? (I can live with that, as long as there are no other problems.)

Would it be just as easy using geli?

ctengel · Mar 8, 2012

@gkontos: When I did exactly that (on FreeBSD 9 RELEASE), the import went smoothly, but zpool status shows adaX instead of gpt/label-name. What I'm trying to see is if there's a straightforward way to do this (Sebulon provided a method that appears to be workable, but quite long.), but I'm also beginning to think I don't really need labels.

@naguz: Aside from gnop, you do need do also set partitions along 4k boundaries (with gpart/gpt, you can use -a), but yes that seems to do it. Use zdb to verify ashift=12. I don't know about geli. I'm guessing your performance when dealing with files between 1 and 3584 bytes might not be as good in general with 4k drives, but on ZFS as long as you have those things set up as discussed above you should be good, no performance issues.

lockdoc · Mar 8, 2012

naguz said:
Would it be just as easy using geli?

As far as I know, you don't need to worry about all this when using geli.
I just encrypted the following disks using geli with geli blocksize 4096

Code:

camcontrol devlist
<SAMSUNG HD103SJ 1AJ100E5>         at scbus0 target 0 lun 0 (ada0,pass0)
<SAMSUNG HD103SJ 1AJ10001>         at scbus1 target 0 lun 0 (ada1,pass1)
<SAMSUNG HD103UJ 1AA01118>         at scbus2 target 0 lun 0 (ada2,pass2)
<Corsair Force 3 SSD 1.3>          at scbus4 target 0 lun 0 (ada3,pass3)
<HL-DT-ST RW/DVD GCC-4481B E106>   at scbus6 target 0 lun 0 (pass4,cd0)
<SAMSUNG HD103UJ 1AA01118>         at scbus8 target 0 lun 0 (ada4,pass5)
<SAMSUNG HD103SJ 1AJ10001>         at scbus9 target 0 lun 0 (ada5,pass6)
<SAMSUNG HD103SJ 1AJ10001>         at scbus10 target 0 lun 0 (ada6,pass7)
<SAMSUNG HD103SJ 1AJ10001>         at scbus11 target 0 lun 0 (ada7,pass8)
<SAMSUNG HD103SJ 1AJ10001>         at scbus12 target 0 lun 0 (ada8,pass9)

Code:

>zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 18.4M in 0h0m with 0 errors on Fri Mar  2 14:53:48 2012
config:

        NAME          STATE     READ WRITE CKSUM
        tank          ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada0.eli  ONLINE       0     0     0
            ada1.eli  ONLINE       0     0     0
          mirror-1    ONLINE       0     0     0
            ada5.eli  ONLINE       0     0     0
            ada6.eli  ONLINE       0     0     0
          mirror-2    ONLINE       0     0     0
            ada7.eli  ONLINE       0     0     0
            ada8.eli  ONLINE       0     0     0
          mirror-3    ONLINE       0     0     0
            ada2.eli  ONLINE       0     0     0
            ada4.eli  ONLINE       0     0     0

errors: No known data errors

Code:

zdb |grep ashift
            ashift: 12
            ashift: 12
            ashift: 12
            ashift: 12

I have neither used gpart nor gnop and it seemed to work out of the box.

Sebulon · Mar 8, 2012

ctengel said:
@Sebulon: So basically you're creating it with fake disks, using gnop to get ashift set correctly, and then replacing with the actual labelled disks so that those labels get stored in the zpool.cache.

Yepp, thatÂ´s about it.

ctengel said:
I'm wondering if you could "replace" the ada names with the labelled disks...

No because ZFS complains that "the disk you are trying to replace is an active part of <pool>" bullcrap. ItÂ´s as if it senses that the label points to the partition and the partition points to the label. ThatÂ´s why I had to create the "fake" disks to build the pool with.

@kpa

I have tried to do it "normally" with glabels instead but with the same result; ZFS reverts back to device names after the reimport.

/Sebulon

gkontos · Mar 8, 2012

ctengel said:
@gkontos: When I did exactly that (on FreeBSD 9 RELEASE), the import went smoothly, but zpool status shows adaX instead of gpt/label-name. What I'm trying to see is if there's a straightforward way to do this (Sebulon provided a method that appears to be workable, but quite long.), but I'm also beginning to think I don't really need labels.

No, you need labels!

Code:

gkontos@mail:~> zpool status zroot
  pool: zroot
 state: ONLINE
 scan: scrub repaired 0 in 0h8m with 0 errors on Tue Feb 28 03:11:46 2012
config:

	NAME        STATE     READ WRITE CKSUM
	zroot       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    ada0p2  ONLINE       0     0     0
	    ada1p2  ONLINE       0     0     0

If you check with gpart(8)() command you will see that labels are there!

Code:

gkontos@mail:~> gpart show -l
=>        34  1953525101  ada0  GPT  (931G)
          34         128     1  (null)  (64k)
         162  1953524973     2  [B]disk0[/B]  (931G)

=>        34  1953525101  ada1  GPT  (931G)
          34         128     1  (null)  (64k)
         162  1953524973     2  [B]disk1[/B]  (931G)

gkontos · Mar 8, 2012

Sebulon said:
No because ZFS complains that "the disk you are trying to replace is an active part of <pool>" bullcrap. ItÂ´s as if it senses that the label points to the partition and the partition points to the label. ThatÂ´s why I had to create the "fake" disks to build the pool with.

You can always gnop the labels like in my example above. And in theory, only the first disk needs to be aligned. The remaining disks added to the pool should catch up with the 4K alignment.

KrusT · Mar 8, 2012

This seems to work to keep my disks labeled (mirrored pool in a virtual environment):

Code:

# gpart create -s gpt ada1
# gpart create -s gpt ada2

# gpart add -t freebsd-zfs -l zdisk1 -b 2048 -a 4k ada1
# gpart add -t freebsd-zfs -l zdisk2 -b 2048 -a 4k ada2

# gnop create -S 4096 gpt/zdisk2

# zpool create tank mirror gpt/zdisk1 gpt/zdisk2.nop

# zpool detach tank gpt/zdisk2.nop

# gnop destroy gpt/zdisk2.nop

# zpool attach tank gpt/zdisk1 gpt/zdisk2

Sebulon · Mar 8, 2012

gkontos said:

If you check with gpart(8)() command you will see that labels are there!

Code:

gkontos@mail:~> gpart show -l
=>        34  1953525101  ada0  GPT  (931G)
          34         128     1  (null)  (64k)
         162  1953524973     2  [B]disk0[/B]  (931G)

=>        34  1953525101  ada1  GPT  (931G)
          34         128     1  (null)  (64k)
         162  1953524973     2  [B]disk1[/B]  (931G)

Yes, when you check what gpart thinks, it's looks good, but if you:
# ls -lah /dev/gpt
*Poof* disappeared! Can't use them if they aren't there

It's as if ZFS has layed ontop of them, covered them. If you export the pool and ls, they are back again, strangest thing...

I noticed this magic a while back on 8-STABLE:
Labels "disappear" after zpool import

/Sebulon

gkontos · Mar 8, 2012

Sebulon said:
Yes, when you check what gpart thinks, it's looks good, but if you:
# ls -lah /dev/gpt
*Poof* disappeared! Can't use them if they aren't there It's as if ZFS has layed ontop of them, covered them. If you export the pool and ls, they are back again, strangest thing...

I noticed this magic a while back on 8-STABLE:
Labels "disappear" after zpool import

/Sebulon

Yes, I know this is the default behavior now. The good thing though is that even if your ada1 becomes ada3, meaning that you change a SATA port, the pool will be imported back with no problems.

Also, what I like to do in installations with many disks, is to literally label a disk with a sticker.

ctengel · Mar 10, 2012

Sebulon said:
No because ZFS complains that "the disk you are trying to replace is an active part of <pool>" bullcrap. ItÂ´s as if it senses that the label points to the partition and the partition points to the label. ThatÂ´s why I had to create the "fake" disks to build the pool with.

If it's anything like Solaris it actually is just looking at the block device itself and sees metadata that seems to indicate it's part of the already mounted pool.

gkontos said:
No, you need labels!

Why?
I know I'm playing "devil's advocate" here, but I really had never heard of people always using labels to identify disks like this until I came to investigate the BSD world.
It's a good idea, but what makes it necessary?

gkontos said:
If you check with gpart(8) command you will see that labels are there!

Good point]zpool status[/FILE] though.

KrusT said:
This seems to work to keep my disks labeled (mirrored pool in a virtual environment):

That's certainly fewer steps and looks like it would work.

Sebulon said:
*Poof* disappeared! Can't use them if they aren't there It's as if ZFS has layed ontop of them, covered them. If you export the pool and ls, they are back again, strangest thing...

I don't claim to be an expert on this, but it looks like that is just GEOM doing some magic. Apparently if one of it's providers/consumers (I forget) under a certain name is in use it removes the special files for the other names. This is done to protect you against I guess things like mounting the same filesystem twice. ZFS protects you in a sense too by the trick I mentioned above.

Yes, I know this is the default behavior now. The good thing though is that even if your ada1 becomes ada3, meaning that you change a SATA port, the pool will be imported back with no problems.

Wouldn't ZFS be able to figure this out anyway by scanning for metadata? I've exported a zpool on SAN LUNs of one system, provisioned them to another, and been able to import just by saying zpool import name. (EDIT: forgot to mention important detail that this was on SOLARIS. Still a n00b to FreeBSD)

phoenix · Mar 10, 2012

ctengel said:
If it's anything like Solaris it actually is just looking at the block device itself and sees metadata that seems to indicate it's part of the already mounted pool.

Yes, ZFS adds its own metadata to drives, so you can import pools regardless of where the drives are physically located in a system (meaning you can export a pool, re-arrange all the cables, and still import it with just "zpool import poolname").

Why?
I know I'm playing "devil's advocate" here, but I really had never heard of people always using labels to identify disks like this until I came to investigate the BSD world.
It's a good idea, but what makes it necessary?

It's not mandatory, nor is it necessary. However, once you get beyond a handful of drives in a system (like a 12-bay, or 24-bay, or 48-bay chassis), it's nice to have labelled devices. It makes figuring out which drive is dead, and which physical drive to pull out of a system a whole lot simpler. Especially if the box dies/reboots for any reason after you've pulled the old drive but haven't added the new drive yet, and everything has been renumbered behind your back. Or you've replaced controllers, or updated the BIOS, and now things are numbered in reverse. Etc.

Seeing "gpt/disk10", or "label/chassis01-disk05", or "gpt/disk-a4" makes it so much simpler to figure out which drive to remove than "da23", or "ada15", or "c0t3d2" (or whatever the Solaris syntax is).

gkontos · Mar 10, 2012

phoenix said:
Yes, ZFS adds its own metadata to drives, so you can import pools regardless of where the drives are physically located in a system (meaning you can export a pool, re-arrange all the cables, and still import it with just "zpool import poolname").

Are you absolute 100% sure of this?

phoenix · Mar 10, 2012

Yep.

Depending on how big of a change you make to the drive layout (as in, completely rewire everything so that the zpool.cache is completely wrong), you may have to point zpool to the devices, but it will import:
# zpool import -d /dev/gpt/ <poolname>
# zpool import -d /dev <poolname>
The first will force it to search via GPT labels. The second will force it to search the raw device nodes.

# zpool import poolname
This should work. It's very rare for it to fail. The 2 variations above are needed if you want the devices listed in "zpool status" output to better match how it was before.