ZFS ZFS FAQ

I thought perhaps it would be useful to start a generic FAQ to help explain some common questions about ZFS in a somewhat non-technical medium ..
feel free to add your own or add incite into the thread.


Q: Who should use ZFS?

A: the development focus for zfs is: performance, capacity, and data integrity. The project its self is based on the sole concept of "the importance of DATA" .. However, zfs is not for EVERY work load. It has many uses. IE a UFS raid stripe will almost ALWAYS be faster than a zfs strip, simply put there is no overhead to ensure integrity, or copy on write or to ensure zfs permissions and such. So if raw data squishy power is what you need, perhaps UFS is a better choice.

To find out if zfs is right for you, you only really need to ask yourself one question.

"Is my data important?" if you answer yes than you probably will benefit from a zpool.

some examples of where zfs shines:

high availability across multiple machines/locations
improved syncing with zfssend/recieve
snapshots and recovery
replication
automated interaction with jails and bhyve instances


Q: I have 12x8TB drives in a pool.. should I not have 96TB?

A: NO drives in a zpool are still based on the formatted capacity of the raw drive.. for example, an 8TB drives formatted capacity regardless of the file system is 7.27TB. The total zpool capacity will also factor in the zpool type (aka raidz2 or zrad3) with use 2 or 3 vdevs for redundancy.

Example:

Code:
root@abyss:/usr/local/sbin # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
abyss 87.2T 37.7T 49.6T - - 1% 43% 1.00x ONLINE -

12x7.27 = 87.2T ... less 7.27x3 (for zraid3) = 65.45T of net useable space after "tax"


Q: Why is my free space different between zpool list and zfs list?

A: zpool list shows the total raw capacity of the drives AND the required overhead. For example your redundancy is accounted for, as well as ALL of the pointer data associated to snapshots. VS zfs list shows the free space that is available to the pool

Example:

Code:
zfs list
NAME USED AVAIL REFER MOUNTPOINT
abyss 23.8T 37.6T 256K /zroot

when you add 23.8 + 37.6 = 61.4T .. which is actually the total raw free space from the original 65.45T .. this include ALL of the raw data associated required for the ENTIRE pool. (so in my case there is about 4T of snapshot and file system information)

vs

Code:
root@abyss:/usr/local/sbin # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
abyss 87.2T 32.8T 54.5T - - 1% 37% 1.00x ONLINE -

the zpool list command only reporting the total, formatted capacity of the entire pool. the total allocated includes all the overhead of the filesystem for example snapshot data less the space marked as free.

It is best to always refer to ZFS LIST for the most accurate human readable free space


Q: I deleted a file, but no space was recovered

A: zfs never actually deletes files. It deletes pointers and marks the space as free. that been said if the file exists within a snapshot, the file still actually consumes space.

Example:

before
Code:
root@abyss:/usr/local/sbin # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zpool 87.2T 42.6T 44.7T - - 1% 48% 1.00x ONLINE -

root@abyss:/usr/local/sbin # zfs list -H -o name -t snapshot | grep -i '2020-03' | xargs -n1 zfs destroy
AKA - Destroy snapshots for 2020-03 (all march 2020 snapshots)

after
Code:
root@abyss:/usr/local/sbin # zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
abyss 87.2T 37.7T 49.6T - - 1% 43% 1.00x ONLINE -

As you can see 4.9T of space was released (aka marked as free) when the snapshots were actually merged/deleted


Q: What is the deal with memory?
A: as a rule the more memory the better. This is because zfs uses a variety of caching methods, all of which can take advantage of system resources. The most common one is called the ARC cache.

Code:
ARC: 89G Total, 10G MFU, 78G MRU, 600K Anon, 193M Header, 28M Other
85G Compressed, 86G Uncompressed, 1.01:1 Ratio

ARC on its own deserves an entire book. its confusing and often misunderstood. In short .. we care about 2 settings.

MFU - Most Frequently Used
MRU - Most Recently Used

when ever your pool accesses data it creates pointers to spaces on the disk. This cache is updated all the time and allows the system to use the pool in an efficient way. Its also important to ensure you have enough memory.

for example, zfs writes to disk in the largest possible chunks. so if you have 8GB of ram .. by default, it will write in about 7GB chunks. if you have 128GB then your writing 112 GB continuous writes will be drastically larger. This has many benefits such as better contiguous writes ..

do not work on 40GB red video files on a zpool with 8GB of ram and expect good performance... if your working on large files. make sure you have tons of ram.


Q: Do I need ECC Memory?

A: the short answer is "Yes and No". There are several threads that discuss this topic in depth. The long and the short of it is. ECC is always preferred however DDR ram will work in most cases. As stated above its way more important to have MORE ram.

ECC ram helps protect against memory bit flips and other issues that could arise by unchecked values in memory that change. ZFS as a filesystem uses those values in respect to a pointer in ARC cache. but when data is written, itself contained and will hash all data independently of what is in memory. As some people have noted bit flipping and having a cache pointer changed as EXCEEDINGLY rare.

I would generally live by the rule of .. if your pool is 24/7 prod environment. You should be using ECC ram.. but If its for a home user, or a nas or plex server. chances are you will never notice any issue between the two.


Q: HELP!!! ARC Cache is using ALL of my memory!

A: relax! that’s by design. By default the ARC cache will consume as much system memory as possible, leaving (I believe 5%) for the system at all times .. so, as your system requests memory the ARC will by default release it as needed. For the most part you do not want to mess with the settings, however if you want to specifically limit it. see https://docs.freebsd.org/en/books/handbook/zfs/#zfs-advanced


Hopefully this helps..
stay tuned for an update on understanding pool types and workloads in plain engrish.
 
Last edited:
Some questions I have,
Is the openzfs port usefull ?
Is zfs usefull on only one drive ?
Is zfs usefull on a partition of only 100MB ?
Which blocksize should you choose for your root filesystem, for a bittorrent storage , for a mysql database , for a postgresql database ?
Do the ARC-Cache and TMPFS not fight for memory ?
 
Is zfs usefull on only one drive ?
Good thread at the iX forum regarding that.
 
is zfs usefull on a partition of only 100MB ?
size doesnt mater :)
Which blocksize should you choose for your root filesystem, for a bittorrent storage , for a mysql database , for a postgresql database ?
this is more of a question about pool types and design and configuration between zfs and the application.. (I'm still writing that addon). In regards to SQL
my approach here is to make a zvol the same block size as the db .. ie you dont want to mix an 8k db with a 4k files system. Zvols are a great solution to that problem.

if your not familiar with zvols.. esentially its a dedicated allotment of blocks that you can mount as vdevs or as logical drives . ie in a vm or what not .. the good part is that you can create them with any blocksize you like .. so you could have 2 drives at 4k, mount those in a bhyve vm as the "root" drive.. then create 4 8k zvols and mount them as seperate drives in the vm.. then configure postgres to use 8k sectors ..

or mount them on the zfs host, esentially they will act as a normal dataset .. and your off and running ..

Do the ARC-Cache and TMPFS not fight for memory ?
the short answer is ARC will always purge and release memory by defualt to ensure the host does not run out of memory. If you monkey around with sysctl settings your asking for trouble. In addition apply best practice such as setting the max ram for your vms to ensure the memory is reserved whent he system starts up.

ARC is designed to be left alone.. so unless its necessary .. dont poke the bear :p
 
ECC is always preferred however DDR ram will work in most cases.
Non-ECC RAM will always work just fine. You can put it in a very simple way: ZFS does a lot of things to ensure data integrity. In a nutshell, the only "loophole" left open is corruption happening in RAM. ECC RAM will protect against this. But even without ECC RAM, ZFS offers better protection than most other filesystems.
For the most part you do not want to mess with the settings, however if you want to specifically limit it. see https://www.freebsd.org/doc/handbook/zfs-advanced.html
On a system with only 8GB or fewer RAM, at least if it is to be used interactively (e.g. as a desktop), I'd recommend limiting the ARC size. In my experience, the system becomes unresponsive over time (doing a lot of swapping) if you don't. E.g. on my 8GB desktop, I limit ARC size to 3GB:
Code:
vfs.zfs.arc_max=3221225472
 
On a system with only 8GB or fewer RAM, at least if it is to be used interactively (e.g. as a desktop), I'd recommend limiting the ARC size. In my experience, the system becomes unresponsive over time (doing a lot of swapping) if you don't. E.g. on my 8GB desktop, I limit ARC size to 3GB:
for sure that makes sense especially considering a laptop is generally not going to have a giant multi vdev pool on it that going to benifit from sucking all your ram up into arc .. so limiting it would be a great solution in that case.
 
One issue is RAM use for a DB (any, really) running on ZFS. Both will do a lot of caching. Isn't some of that cache redundant?
 
Note1, I checked the blocksize of my bhyve zfs volume. The volblocksize is 8K. I don't kow what is special about 8K. I.e. why not 4K or 16K I think there is a reason.
Note2 , Currently I use a conservative 2.5Gb max arc size on a 8Gb PC. The actual arc size use drops down to 500MB when I only surf the internet.
 
I wonder that myself. Some people used that for a SYSVOL for samba ad-dc, because of some problems with ACLs on ZFS, but I think that's solved (at least my SYSVOL works fine directly on ZFS).

sometimes you have a "special" ufs file system .. lets say you create a 100gig zvol, then mount it with bhyve.. install windows 10 on it .. during the install process you select ntfs as normal. as far as the windows volume is concerned its a normal ntfs drive ..

on the host.. you can snapshot it zfs send it to another server or site and when the windows vm gets hit with ransomware .. you simply rollback and reboot the vm for instant recovery.
 
Note1, I checked the blocksize of my bhyve zfs volume. The volblocksize is 8K. I don't kow what is special about 8K. I.e. why not 4K or 16K I think there is a reason.

this is a totaly a DBA question, although I understand the basics.

in short.. you dont want to have a database read/writing at 8k blocksizes onto a 4k file system. or visa versa

they should match .. the problem with aset default in 4k is that the entire pool defaults to 4k .. to get around that using a zvol will allow you to create a segment of blocks that can be mounted as a device or dataset with a differant block size..

zvols get around that problem by allowing you to have differant blocksizes on the same pool.
 
Note2 , Currently I use a conservative 2.5Gb max arc size on a 8Gb PC. The actual arc size use drops down to 500MB when I only surf the internet.

on a single vdev .. your not really going to noitce any differance, especially on an ssd .. when you have multiple rusty vdevs .. then it really helps.

the only real issue with limiting arc on a single vdev is that like you pointed out, your cache size is variable .. if you say .. limit it to 3GB .. then its always going to be 3GB..

thats the double edge sword..

personally on my laptop I have it set to 1GB on a 480SSD .. as you can see from the quotes, my sever with 12 rust buckets and 100TB has 100 gigs of ram dedicated to arc .. perfomance wise it helps a lot.. the other 148gigs of ram on it is alloted to vms and system ram.
 
sometimes you have a "special" ufs file system .. lets say you create a 100gig zvol, then mount it with bhyve.. install windows 10 on it .. during the install process you select ntfs as normal. as far as the windows volume is concerned its a normal ntfs drive ..
Sure, but that's NTFS, not UFS. I'd say the usecases for UFS on a zfs vdev are very limited.
 
Q: I have 12x8TB drives in a pool.. should I not have 96TB?

Actually you will get 96TB. That is 96000000000 bytes. Since disks as sold normally have it's size measured in 1000's and not 1024's (like RAM). However the ZFS tools print sizes using 1024-base just to confuse things. (Sigh).

> % zpool list
> NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
> DATA 9.06T 59.0G 9.00T - - 0% 0% 1.00x ONLINE -
> % dc
> 2 k
> 10 1000 * 1000 * 1000 * 1000 * 1024 / 1024 / 1024 / 1024 / p
> 9.09

DATA is a HGST 10TB drive.
 
However the ZFS tools print sizes using 1024-base just to confuse things. (Sigh).
All the others confused things for the purpose of getting higher numbers for the marketing departments. I prefer the technically correct approach. But I agree that it is unlucky to have different ways of speaking.
 
All the others confused things for the purpose of getting higher numbers for the marketing departments. I prefer the technically correct approach. But I agree that it is unlucky to have different ways of speaking.

Well, disks have always (well, at least since the early 80ies - before that I don't remember :) been measured in SI units (1000) and not 1024-based.

Case in point, from one of my all-time favourite drives - the venerable Fujitsu M2351 "Eagle" (473MB in a 6U 19" rack box, that sounds like a vaccum cleaner when you power it up. I actually still have a couple of them sitting here gathering dust :)


474.21MB, 28160 bytes/track, 20 heads, and 842 tracks = 28160*20*842 = 474214400 bytes. Usable capacity is a bit less since that's unformatted capactity (add about 65 bytes per 512b sector or 100 bytes per 4096b sectors (for modern 4K drives).

(Or 452 MiB:s :)

1.8MB/s (also 1000-based :) transfer rate. Wrooom. Btw network transfer rates are also 1000-based.

Ah well, back to the real topic - the ZFS FAQ :)
 
All the others confused things for the purpose of getting higher numbers for the marketing departments.
It has to do with certifications, not marketing. If you don't use SI units your device simply doesn't get certified and you're not allowed to sell it.
 
Is zfs usefull on only one drive ?

Good thread at the iX forum regarding that.

Circling back to this, I notice the iX forum posts are 2015. Is this info still valid about single drives given updates to ZFS?

I really like the convenience that the FreeBSD installer includes 1-click geli encryption when using ZFS whereas I think for UFS it's still done manually.

But the message I'm seeing is that running ZFS on an older laptop's (for me, T430s and T61s) internal drive is at best pointless and at worst detrimental. Can anyone shed light?
 
Circling back to this, I notice the iX forum posts are 2015. Is this info still valid about single drives given updates to ZFS?

I really like the convenience that the FreeBSD installer includes 1-click geli encryption when using ZFS whereas I think for UFS it's still done manually.

But the message I'm seeing is that running ZFS on an older laptop's (for me, T430s and T61s) internal drive is at best pointless and at worst detrimental. Can anyone shed light?

I use ZFS on single-drives all the time (well, if I can I use mirrors but where that isn't an option I use single disks). Sure, you won't get any redundancy (well, you can use "zfs set copies=2 (or 3)" to get multiple copies of user data but that still doesn't help with complete disk failures) but you'll get the other benefits like security (good checksums of stored data, ACLs, snapshots/clones, "easy" filesystems (no need to repartition disks), boot environments etc..

Currently running Ubuntu20 on an old Intel NUC with a single internal SSD with ZFS for example. Sure, there is a little bit more overhead compared to UFS but really - it's not noticable for normal users.
 
Circling back to this, I notice the iX forum posts are 2015. Is this info still valid about single drives given updates to ZFS?
I don't think anything major has changed since that 2015 thread.
This crucial question went unanswered in that thread.
"If checksumming is enabled. What happens if it detects an error? What do you do?"
 
What happens if it detects an error? You know that you have lost data. I think that's better than the data being silently corrupted, and you getting wrong data. It is also better than your file system crashing or acting weird, if the error was in metadata.

What do you do? Depends on the situation. I think in most cases you restore from backup. If you don't have a backup, then you have made a deliberate choice to tolerate data loss, and live with it. Or use "informal" backups; for example you might have e-mailed an earlier copy of the same file to a friend, so you ask your friend to send a copy back to you.
 
I use ZFS on my desktop for all data I can loose. :) In my case on /usr/ports & /usr/src .
On one disk backup is advised, as there is no fsck, and if you run into a ZFS bug or meta-data corruption you have problems.
 
Back
Top