ZFS ZFS - whole disk vs GPT slice

Oko · Oct 16, 2017

Recently I came a cross two posts

https://forums.freebsd.org/threads/61643/#post-362839
https://forums.freebsd.org/threads/62475/#post-360834

of very a respected member of this forum ralphbsz whose highly competent answers on various storage related questions made me rethink the previously hold point of view. Namely I am one of those people who used to provision entire raw disks to a ZFS pool. I am no longer sure that my approach which was originally recommended by Sun Solaris ZFS documentation (IIRC originally Solaris didn't support GPT slices but there are some other reasons for Sun to make that recommendation) is the correct one. While it is true that one can create ZFS pool out of raw disks (no labeled and glabeled), GPT slices (most famously root on ZFS when we have boot, swap, and / slices also FreeNAS which slices every storage disk to two slices zfs-swap, and zfs-data), even a BSD partition inside ZFS slice (I have not seeing this one for a long time). It is the year 2017 and the above question seems to be the object of many holy Internet wars

https://serverfault.com/questions/628632/should-i-create-zfs-zpools-with-whole-disks-or-partitions
https://www.freebsddiary.org/zfs-with-gpart.php
https://forums.freenas.org/index.php?threads/zfs-on-partitioned-disks.37079/
https://www.reddit.com/r/zfs/comments/45bp65/zfs_on_a_partition_vs_a_whole_disk/
http://www.unix.com/solaris/185141-zfs-whole-disk-vs-slice.html
https://github.com/zfsonlinux/zfs/issues/94

so what would be the final recommendation and what would be the reasoning behind it? Michael Lucas FreeBSD mastery series seems that recommends the same slicing (I refuse to use word partition for GPT slice to accommodate Linux fan boy ignorance of BSD partitions) scheme as FreeNAS?

rigoletto@ · Oct 16, 2017

Oko

What may be a problem¹ of this poll, are the votes from people (me included) without technical knowledge to actually say (or properly argue) with some authority of what is better solution, and so voting based basically on personal preference, and people who actually have knowledge to do that.

Other than that it is a nice initiative.

And yes, I was the first to vote!

¹ the result of the poll could bring misleading understatement for who do now read entire discussion later and find it later.

Sebulon · Oct 16, 2017

Using raw disks in Solaris could be managed because the name of disk also revealed it´s location if you knew how the HBA's and JBOD's were wired, like c0t0d0s0 is Controller 0, Target 0, SCSI target (ID) 0, and Slice 0.

But there is nothing stopping you from GPT-partitioning hard drives and label them, either with straight up GPT-labels (which I like) or use glabel (why use glabel when it´s "built in" to gpart?).

Why you use labels is to know, if you have a disk called e.g. disk-03-21, that is the twenty first disk in the third JBOD, so you know which hard drive to physically pull when one of them has died. The thing to take extra care of when partitioning is alignment, so that writes are evenly aligned with the "physical" sectors of the hard drive. Typically, I set my freebsd-zfs partition at 2048 (1 MiB) so that it aligns to just about anything writing from that point of the drive.

/Sebulon

sko · Oct 16, 2017

I'm using a single GPT-slice for physical disks and full "disk" when using volumes over iSCSI or FC.
Besides the reason Sebulon already mentioned (meaningful labels to locate disks), there is also always the discrepancy of slightly varying disk sizes from different vendors and even different models or revisions from the same vendor. By using GPT-slices and leaving a few dozen MB unused at the end, this problem is circumvented.

On volumes shared via iSCSI or FC I usually use the whole "disk" - this makes it much quicker and easier to grow a pool (especially with autoexpand=yes set) by simply resizing the zvol on the target machine and reloading the cam target layer.

usdmatt · Oct 16, 2017

I will usually use a GPT labelled partition for ZFS. It could be the only partition or there may be more than one depending on the situation. The labels allow me to give a useful name to the device, and it's configured using a standard method supported by more than just FreeBSD.

I don't use bsdlabel/slices anymore as its the "old" method and GPT provides lots of benefits. I also don't use glabel as it makes no sense to use a FreeBSD specific labelling system when I can just use GPT labels, especially when you end up with two devices on the system (/dev/daXpY and /dev/label/xyz) pointing to the same partition, with one exposing the label as part of the usable device.

mefizto · Oct 16, 2017

Hi sko,

sko said:
I'm using a single GPT-slice for physical disks . . ., there is also always the discrepancy of slightly varying disk sizes from different vendors and even different models or revisions from the same vendor. By using GPT-slices and leaving a few dozen MB unused at the end, this problem is circumvented.

This is what I have been doing. But I always wonder how much to leave unused. I left "a few dozen MB unused" from the smallest hard disk. However this is not very systematic, what if the next disk I buy will be the smallest one. Thus, I wonder if there is more systematic approach, e.g., a percentage, or fraction thereof, of the total capacity. Any input?

Kindest regards,

M

sko · Oct 16, 2017

mefizto said:
Hi sko,

This is what I have been doing. But I always wonder how much to leave unused. I left "a few dozen MB unused" from the smallest hard disk. However this is not very systematic, what if the next disk I buy will be the smallest one. Thus, I wonder if there is more systematic approach, e.g., a percentage, or fraction thereof, of the total capacity. Any input?

Kindest regards,

M

I usually round down to the next full 100M or more if this gives a "nice round number". I know this is not really sophisticated, but it worked so far. Variation on 4TB disks I've deployed so far were somewhere around ~20-30MB, so I think subtracting 50MB from the total size and then rounding to the next lower 100M step should be safe even for larger variations.

phoenix · Oct 16, 2017

The reason the Solaris docs recommend full-disks for ZFS is due to their disk caching sub-system only enabling the write-cache on drives when passed a raw disk. If you use a partition, then Solaris disables the write-cache on the disk, severely impacting performance. FreeBSD's GEOM system has always allowed the drive's write-cache to be enabled regardless of how the drive is accessed, which made this a non-issue on FreeBSD.

In the early days of ZFS, every device in a vdev had to have the exact same number of sectors. If they were off by even a single sector, any zpool add or zpool create commands would fail. Since 2 disks of the same size could have different numbers of sectors, the recommendation became "use a partition of a set size to work around this issue". Eventually, ZFS started to handle this automatically, internally, by reserving up to 1 MB of space at the end of the device for "slack" to make all the devices use the same number of sectors.

Nowadays, it's more a matter of convenience to be able to get nice, human-readable information in zpool list -v or zpool status output. Makes it much easier to locate failed/problem drives when it tells you exactly where to look for it.

mefizto · Oct 17, 2017

Hi sko,

thank you for the answer. I would argue that if your observation has been over a statistically significant sample, adding additional reserve of about 50% and further "rounding up" as you did is a good strategy.

As an aside, I wonder if it is a "great minds think alike" case or just some unconscious preference for round numbers, but I had subtracted 20MB and then rounded up to a multiple of 1024.

Kindest regards,

M

sko · Oct 17, 2017

mefizto said:
[...] if your observation has been over a statistically significant sample [...]

Unfortunately I can't exactly call it "statistically significant", given the vast amount of vendors/labels, models and drive/firmware versions. I'm working on a small-ish network/infrastructure, so I've only deployed ~5 or 6 different variants (vendor / model) of "4TB" drives. Mainly WD, some HGST and a few Seagate. In total those were ~25-30 drives over the last years I think.
I usually only get the budget for expansion/replacement of drives "on urgent demand" (i.e. when there already is something on fire), so I mostly order and deploy single drives or 2-3 at a time and end up with a "colorful" variation of drives in our systems

ralphbsz · Oct 20, 2017

My concerns are really about the manageability of the system, in particular with multiple humans, and multiple operating systems. And the fact that the average IQ of human sys admins is "room temperature in Fahrenheit", and the average IQ of operating systems is "room temperature in Celsius". More data is lost due to user error than all other causes, and making systems clear and understandable prevents stupid mistakes.

To begin with: ZFS, like many other modern storage systems, will label the disks internally: it is capable to recognize that a disk is a certain device in its pools. I don't actually know the internals of how it does that, but the biggest component of that has to be that ZFS writes a label onto the disk itself. One of the problems with this type of labelling is: If one gets a second physical disk of the same size, and copies the original disk to the second one bit-by-bit (for example with dd), ZFS will see two instances of the same vdev. I don't know how it resolves this contradiction. And I also don't know whether ZFS takes hardware characteristics of the devices into account. For example, in the real world of SAS and SATA disks, and of SCSI standards, all disks have worldwide-unique names (typically 64-bit numbers called "WWN"), and in theory a storage system like ZFS could just use those, ignoring everything else. In practice, such an approach is not perfect: disk vendors can screw up (ship two disks with the same ID number), during prototyping phases one tends to see lots of disks whose serial number is "0" or "-1", and reading these unique names is quite OS- and hardware-specific. Where it gets really ugly is simulated disks. What is the serial number or WWN of a virtual disk on a VM client, if the disk is really being emulated by the VM host, and is in reality stored using RAID and served via a protocol like NFS or iSCSI? How do you make sure all these virtual disks remain worldwide unique? These are very hard problems, and there is no single unified solution for them. But to a single system like ZFS that doesn't matter: It has its disk labels, and as long as the vdev's are reachable, it will recognize them.

Where the problem comes in is stupid humans, and stupid other operating systems. Say you carefully partition (or slice) your disks, and then create exactly the suitable ZFS vdev's on those slices, and then you get run over by a bus. The next administrator comes in, reboots the machine, and sees a crazy mix of /dev/ada4p3 and so on, and has no clue what is which. Sure, he can spend an hour making lists, matching things, and decoding it. He might even get it right (probably but not guaranteed), and perhaps he can even guess the intent of the original designer. Maintaining and modifying the system will be hard for the next guy. To make his life easier, it's nice to leave short hints in the form of text strings. For example, the /home file system on my server at home is stored on two Hitachi disks that were bought in 2014 and 2016, and the partitions for that mirrored ZFS pool are called /dev/gpt/hd14_home and /dev/gpt/hd16_home. My boot disk is an intel ssd (I have two of them, the second as an emergency backup boot device that's occasionally updated, and used during upgrades to switch between OS versions), and the partitions are called things like intel1_boot and intel2_var_log. In the system administration logs, there are notes explaining the naming convention, and the type and use of the various partitions. So if something gets unplugged and all the /dev/adaX numbers get scrambled, I can still find things by looking in /dev/gpt. And even if I forget half the naming convention, I can reconstruct it quickly by looking at what I find in /dev/gpt.

Another open problem is geographic labelling. Sebulon hinted at that: In the old days, certain SysV derived OSes used geographic names like c0d1t2s3, and by tracing boards in slots and then cables from the boards one could physically find the disk. First off, today operating systems no longer do this consistently (in Linux it can be done by using the /sys/ file system, but requires heuristics and trickery, and lacks atomicity). Second, on a modern SAN there is no geographic location any longer; saying "the 137th disks connected to this Brocade switch" is not really useful. What we really need is a system to say "in the second rack from the left, find the third JBOD from the bottom, then go to the upper drawer, middle row, 7th disk from the left". That is completely OS- and device-specific, and implementing systems to locate and manage physical locations of disks at that level in systems that handle tens of thousands of disks is a massive task (been there, got the T-shirt). We have no general solution that comes with free software to find out where a particular disk is. And even if we had one, with emulated (virtual) disks, this doesn't always make sense.

Where labelling things gets really important is when other OSes come into play. Say you have some form of SAN, which may be as simple as a small JBOD with a half-dozen disks, connected to both your FreeBSD machine and to a Windows box. If you don't use any partition tables, the Windows machine will see the FreeBSD ZFS disks as "unused" (no partition table!), and will helpfully offer to label them and use them as NTFS volumes. That's because neither Windows nor Linux bother to read the ZFS labels that are stored *inside* the disks (nor the GFS, GPFS, GFS, Ceph, and GFS labels, there are various file systems all called GFS), and will consider a disk without partition table to be blank and free for the taking. I used to work on a storage system often used dozens to thousands of disk drives on large SANs, and we frequently found disks that we considered "lost" (because we couldn't find any piece of hardware that had our label on it) to have been overwritten with ReiserFS, because some helpful but clueless Linux administrator decided to take a free disk on the SAN and put a file system on it. Obviously, this leads to cheap jokes about Hans Reiser murdering more than just his wife, also the competitors file systems.

So the best way to make the disks manageable, in the presence of massive stupidity (among human sys admins, storage system designers, and operating systems) is to leave as much information around as possible. That means to take every disk, and put a standard-conforming partition table on it; today that means GPT. And then put in the GPT table information that's as clear as possible, and that says: this partition is in use, and its purpose is to be such and such. And put that information both in computer-readable binary form (for example, the GPT will contain a long magic number that says "this is a FreeBSD ZFS disk"), and a human-readable string (like this is part of the RAID-Z2 vdev for the pool "tank").

Long story short: this is why I strongly advocate for putting everything in a GPT partition, and then labeling the partitions.

Sebulon · Oct 20, 2017

ralphbsz said:
What we really need is a system to say "in the second rack from the left, find the third JBOD from the bottom, then go to the upper drawer, middle row, 7th disk from the left".

Agreed on everything except this, I believe you are taking it a bit too far, let me explain why.

ZFS is a scale-up system, so the JBODs aren´t likely to be more than 10-15 meters away, don't think SAS cables can be much longer and so you don´t have to look too far to find the specific JBOD in which the disk is located, usually somewhere directly above or below. So in practice you only need to label internally to the system you are looking at, e.g. on the server itself you put a sticker called "ZFS1", for the system head, then on each JBOD you put "ZFS1_JBOD1(2,3,4,5,etc)". Stickers are then put on each drive carriage to reflect that with e.g. "disk-01-01" for the first disk in the first JBOD. System -> JBOD -> Disk. Then make the labeling inside the pool match the physical. The next system may be called "ZFS2" and have "ZFS2_JBOD1(2,3,3,4,5,etc)" but the disks can internally still be called "disk-<NUM>-<NUM>", they can´t "collide".

For the physical mapping of these systems and JBODs, there are other documentation systems, like RackTables that helps you keep track of what´s what.

Then in scale-out systems like Ceph, the way you name (like a label) the OSD's, you can dig yourself down to where it actually is, no matter how big of data center, or how many for that matter

ceph osd tree

Code:

# id    weight  type name       up/down reweight
-1      3       pool default
-3      3               rack mainrack
-2      3                       host osd-host
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2       1                               osd.2   up      1

As a prime example, this is how the guys at Cern has done with their Petabyte Ceph cluster:

Extremely interesting and well worth watching the entire thing!

P.S. I wanted the video to start at 7:22, but the embed didn´t let me

/Sebulon

sko · Oct 20, 2017

Sebulon said:
ZFS is a scale-up system, so the JBODs aren´t likely to be more than 10-15 meters away, don't think SAS cables can be much longer and so you don´t have to look too far to find the specific JBOD in which the disk is located, usually somewhere directly above or below.

With NVMe over Fabrics the disk shelf "could" be as far as a few km away from the JBOD. Not that it would make sense in any way, but as soon as company politics or beancounters get involved, "everything is possible"™

The scenarios ralphbsz described could be also boiled down to (THE) one crucial rule for everyone managing more than his single private machine: document _everything_ you do and frame guidelines for everything that is done more than once - this is *especially* important for everything related to core infrastructure.
In case of our "disk problem" a short step-by-step list for disk deployment and -replacement would be sufficient - define how the disks/slices are named; how they can be physically located; what physical labels can be expected to be found on the JBOD and the disk/caddy and what commands have to be run before/after replacement of the disk. Write this so your boss can understand it (except if this would involve using crayons and figured dancing...), or at least in a way some random MCSE the company might ask for help could perform these tasks without burning everything to the ground.
It doesn't matter if you put all this in a sophisticated wiki with colorful graphics or just plain text files with some ASCII art. The important thing is: keep it safe - these are your crown jewels! But have copies at the right (safe!) places where the right people have access to and know there is something that is invaluably helpful if disaster strikes.

ralphbsz · Oct 20, 2017

Sebulon: This is not the correct thread to argue whether automated systems are better or worse than applying stickers to devices. And the reality is that the best managed systems use a combination of tools: factory stickers that show which order the numbering works (front row is disks 1-12, second row 13-24 and so on), user-applied stickers, the user taping a hand-drawn map and list on the side of the rack, combined with automated systems, which in the best case include history logs (this disk is here now, but until a week ago it was there, and it moved at the same time that field service came to replace a faulty cable). To a user with a single disk, this may all seem ridiculous. And to a user with 5 disks is may seem like complete overkill. But for a system with 10,000 disks, being really well organized is necessary if one wants to survive. And systems of that scale are not uncommon today. Today, a PB isn't very much any more: with redundancy, it's about 120 disks, which easily fit in two JBODs; the largest JBODs available today handle 84 and 98 disks in 4U and 5U enclosures.

Thank you for the CERN video. The CERN people are incredibly competent, and very friendly (although their situation is very different from other research and commercial data centers, so some of their techniques don't translate well to other sites). I've spent some time in that data center, including once sitting on the "bench" of their Cray X-MP.

ralphbsz · Oct 20, 2017

sko said:
It doesn't matter if you put all this in a sophisticated wiki with colorful graphics or just plain text files with some ASCII art.

Many years ago (about 20!), at a previous employer, we started using inline documentation in Java source code (early Javadoc). Since the documentation can be written in HTML, we decided to add links to pictures which show class diagrams (this was when we still used clouds, before UML was invented). Initially, we used Visio to draw the diagrams, but that turned out to be way too much work (and Rational Rose was a complete piece of crap in those days, only useful to crash your computer). So instead our group bought a digital camera (those were still expensive and unusual), and we drew diagrams on whiteboards, photographed them, and checked the pictures into the source code control system. It turns out that hand-drawn sketches are a very suitable tool for documenting high-tech systems.

Sebulon · Oct 22, 2017

ralphbsz said:
I've spent some time in that data center, including once sitting on the "bench" of their Cray X-MP.

Oh, cool! I´d love to go there and see how their data center looks like. Talk about people with, lets say, challenges unlike most

/Sebulon

sko · Oct 23, 2017

ralphbsz said:
So instead our group bought a digital camera (those were still expensive and unusual), and we drew diagrams on whiteboards, photographed them, and checked the pictures into the source code control system. It turns out that hand-drawn sketches are a very suitable tool for documenting high-tech systems.

I really like https://www.draw.io and am also a fan of hand-drawn sketches to explain/document complicated systems or visualizing the context of systems and services. I've never used these directly for documentation though - mainly because everything I produce in handwriting would most likely be confused with some ancient inscriptions of an unknown language

To produce raw braindumps for myself however, I always keep some paper and a few pens in various colors at my desk and at the machine room as nothing to this date can beat the efficiency of this.

For documentation I like/try to condense or break down descriptions of complicated systems into simpler parts, for which some simple ASCII diagrams are often easily sufficient. A lot of times this procedure _really_ helps to get a deeper understanding of the system (even if you built it yourself) and often during this phase of documentation I find parts that could be simplified or optimized on that systems just by trying to make the explanations as clear, precise and simple as possible.

_martin · Oct 28, 2017

On my FreeBSD box I use GPT slices in rpool so I can separate boot, ZFS and swap partitions. But for the data itself I'm using raw disks in pools.

I don't manage big FreeBSD systems, have only my home ones. But I do manage big commercial Unix systems -- majority of them being HP-UX servers. There all disks come from disk storages, either P9500, XP or cheaper 3pars, all in SAN. Some are close to each other, some are spread 100+ km away. Disks used there are not sliced but directly used for either LVM or VxVM.

Identifying disk is easy -- it always boils down to the WWN. And as 2nd check you have a HW path to the device that you can verify you are connecting to a correct storage. 11.31's agile DSFs and better native multipathing deals perfectly when you move storage around -- changed HW path doesn't change the disk name ( e.g. it still is /dev/rdisk/diskN ).

I personally don't like labels. In my opinion being able to query the disk, verify the HW path is more than enough not to make any mistake.

ralphbsz · Oct 29, 2017

If your OS has a database of what disks it has seen before, and knows how to identify them (ideally, by their WWN), that's great. This is the reason why people buy HP-UX, AIX, and such operating systems. But Linux and FreeBSD don't have such a mechanism. They also don't have a hardware path (although in Linux, a hardware path can be constructed out of the information in the /sys/ file system easily, but the OS doesn't give you one by default). There, something may be /dev/sdax or /dev/ada5 today, and /dev/sdby or /dev/ada9 tomorrow.

ZFS ZFS - whole disk vs GPT slice

ZFS - whole disk vs GPT slice

Whole disk

Glabeled whole disk

A single GPT slice on a single-sliced disk (no glabel)

A single GPT slice on a single-sliced glabeled disk

A single GPT slice on a multi-sliced disk (no glabel)

A single BSD partition inside a GPT slice on a single-sliced disk (glabeled or not)

Oko

rigoletto@

Sebulon

sko

usdmatt

mefizto

sko

phoenix

mefizto

sko

ralphbsz

Sebulon

sko

ralphbsz

ralphbsz

Sebulon

sko

_martin

ralphbsz