bhyve How to set up storage for a bhyve hypervisor? ZFS or UFS for host? Guest?

DaLynX · May 8, 2024

Hello,

I am considering migrating a few VMs to a new hosts and would love to make it a freebsd one.

I am not sure however of the best way to set up storage. From the Handbook I read that using ZFS on the host and ZFS volumes for the guests would be optimal. But a blog post from vermaden tells me simple files offer better performance.

ZFS for the host seems to be the easiest way to set up RAID. But to be honest I have trouble understanding what actually happens if I used ZFS on both sides. Which machine does what? Does the guest benefit from the RAID offered by ZFS on the host over its drives?

Or should I just use UFS for guests to make them lighter, if they already benefit from the host's ZFS layer underneath anyway?

Could you please help me understand my options here and their consequences?

rootbert · May 8, 2024

ZFS is a huge advantage regarding safety and consistency of your data. Furthermore, ZFS snapshots are awesome when you have to upgrade a VM - using them is easy and straightforward. I suggest using ZFS on the host, UFS in the guest. The host system then takes care of the ZFS cache. If you have ZFS in each VM it will probably eat quite a bit more RAM cause every VM then does caching for itself. While you can also do ZFS inside the VMs, be aware that this dramatically increases your write load (~4x) and thus decreases responsiveness/increases latency - read up on copy on write filesystems if you want to know what's behind this.

bakul · May 8, 2024

You have most of these choices: (host: zfs-volume, zfs-file, zfs-filesystem, ufs, RAID, rawdisk) x (guest: ufs, zfs, nfs) each with its own pros and cons. zfs-file is where you use a single file on the host zfs. zfs-filesystem is really a single file on a zfs filesystem (so that you can snapshot it as zfs doesn't allow snapshot of a single file). Obviously nfs on guest can't be used with RAID or rawdisk. (zfs-filesystem,zfs) is the most flexible & featureful but not the most performant. If you are willing to snapshot on the host, use (zfs-filesystem, ufs). zfs-volume may be faster than what I call zfs-filesystem but AFAIK snapshots on zfs volumes are more restricted. With zfs-filesystem, you can even mount snapshots as read-only filesystems from the same or other guests (theoretically -- I haven't actually tried this). Almost as good as plan9's cached WORM filesystem but faster & clunkier!

cmoerz · May 8, 2024

We recently updated the handbook on zfs use with bhyve in the virtualization chapter. If you're using zfs for the host as well as for the guest, you should limit caching on the host to metadata only:

Code:

zfs set primarycache=metadata <name>

In theory, zvols should be faster, though YMMV. I too am using file based backing storage on zfs and have the impression it is delivering more stable I/O performance. I have not measured it, so it may just be subjective opinion...

DaLynX · May 9, 2024

Thanks a lot for your answers!

cmoerz said:
We recently updated the handbook on zfs use with bhyve in the virtualization chapter. If you're using zfs for the host as well as for the guest, you should limit caching on the host to metadata only:

Code:

zfs set primarycache=metadata <name>

In theory, zvols should be faster, though YMMV. I too am using file based backing storage on zfs and have the impression it is delivering more stable I/O performance. I have not measured it, so it may just be subjective opinion...

Yes I read that. Thank you. But after reading about zvols and snapshots (meaning snapshots are limited by the size of the zvol, if I understand correctly), I think doing what bakul recommends fits me better:

bakul said:
You have most of these choices: (host: zfs-volume, zfs-file, zfs-filesystem, ufs, RAID, rawdisk) x (guest: ufs, zfs, nfs) each with its own pros and cons. zfs-file is where you use a single file on the host zfs. zfs-filesystem is really a single file on a zfs filesystem (so that you can snapshot it as zfs doesn't allow snapshot of a single file). Obviously nfs on guest can't be used with RAID or rawdisk. (zfs-filesystem,zfs) is the most flexible & featureful but not the most performant. If you are willing to snapshot on the host, use (zfs-filesystem, ufs). zfs-volume may be faster than what I call zfs-filesystem but AFAIK snapshots on zfs volumes are more restricted. With zfs-filesystem, you can even mount snapshots as read-only filesystems from the same or other guests (theoretically -- I haven't actually tried this). Almost as good as plan9's cached WORM filesystem but faster & clunkier!

So what you mean by zfs-filesystem is:

create a zfs filesystem dataset for each VM
following cmoerz's, and the manual's advice, I set the primarycache to metadata for that dataset to avoid performance drag
inside them create the disk file (.img)
use zfs or not in my guest VMs depending on the need for zfs features (e.g. snapshots) or not

Is that right?

Is that #2 step still needed and appropriate if I don't do zfs inside the guest? (e.g. ufs or linux guest)

bakul · May 9, 2024

DaLynX said:
Is that right?

RIght! Though note that I am unclear on the effect of #2 (caching only the metadata) as I haven't used it nor thought much about it. [Edit: as always, you should do your own benchmarking. Never blindly believe random people on the Internet like me!]

vermaden · May 9, 2024

DaLynX said:
I am not sure however of the best way to set up storage. From the Handbook I read that using ZFS on the host and ZFS volumes for the guests would be optimal. But a blog post from vermaden tells me simple files offer better performance.

Do you plan to start with this machine in Formula 1 races? If not - then it mostly does not matter - use something that fits your workflow and what seems more natural for you. When I quoted these benchmarks - the 'flat' files were faster - but that may have been fixed and its the same now (or even faster). Same as for UFS inside VMs - for example I use ZFS on the host and ZFS inside machines as that gives me more flexibility - including ZFS Boot Environments inside VMs that I can send between these VMs if needed.

GogoFC · Jul 16, 2024

cmoerz said:
We recently updated the handbook on zfs use with bhyve in the virtualization chapter. If you're using zfs for the host as well as for the guest, you should limit caching on the host to metadata only:

Code:

zfs set primarycache=metadata <name>

In theory, zvols should be faster, though YMMV. I too am using file based backing storage on zfs and have the impression it is delivering more stable I/O performance. I have not measured it, so it may just be subjective opinion...

Hey I was just reading the Handbook, specifically that part:

"If you are using ZFS for the host as well as inside a guest, keep in mind the competing memory pressure of both systems caching the virtual machine’s contents. To alleviate this, consider setting the host’s ZFS filesystems to use metadata-only cache."

In my case I use zvols for VMs, most my VM's are Ubuntu with ext4 fs, so non ZFS.

The above explanation doesn't really make sense, or it doesn't really mean anything, to me anyway. "'Pressure' of both systemc caching" I guess 'pressure' would be the pressure on the IO Scheduler to put the cache in the queue. I'm not saying I'm not grateful to whoever wrote this, I am, and if I was in that situation I would do as advised and wouldn't worry about what it means, but I'm not. So in the advice they say consider caching metadata-only from the Host and inside the VM metadata and data will both be cached. Why is that? Why not completely turn off the cache from inside the host?

If you know the answers to any of these questions, please do elaborate.

While back I've heard people say don't use ZFS on VM's because of write overhead. OK so I don't. But in the advice above we keep the metadata cache, doesn't that create overhead? I'm not worried about how much RAM there is. I'm worried about that while back I also read host ZFS doesn't really know what's in RAM in the VM, would that be relevant to my case? I know host ZFS knows what's on the zvol when it's written to it.

I mean I could test these things, I might soon, but it would be nice to understand what ZFS knows from inside a VM with ext4 fs for example, and would disabling caching help. I'm very limited at knowing what ZFS actually does and this handbook part really explains nothing. It just says what to do and that there could be a double cache for the same files (pressure) but don't say why that's a bad thing. If the cache is done both inside VM and on the host, RAM is wasted, I don't care much about that, I do care about latency.

Yeah this is very complicated for me. I'm reading FreeBSD Mastery Advanced ZFS and I'm on the 'Performance' chapter. Still gotta finish the Book so that I can better understand it.

cracauer@ · Jul 16, 2024

ZFS has its own cache in addition to the filesystem buffer cache and de-duping that doesn't work for some operations such as writes through mmap(2). So if you have host and guest on ZFS you can end up with each cached filesystem location in RAM 4 times.

PMc · Jul 16, 2024

GogoFC said:
Hey I was just reading the Handbook, specifically that part:

"If you are using ZFS for the host as well as inside a guest, keep in mind the competing memory pressure of both systems caching the virtual machine’s contents. To alleviate this, consider setting the host’s ZFS filesystems to use metadata-only cache."

In my case I use zvols for VMs, most my VM's are Ubuntu with ext4 fs, so non ZFS.

Then this is of not so much concern, as there is only one ZFS in the game.

GogoFC said:
So in the advice they say consider caching metadata-only from the Host and inside the VM metadata and data will both be cached. Why is that? Why not completely turn off the cache from inside the host?

Turning off the cache in ZFS is not a good idea. If you do that, and then read 512 byte from a file, ZFS will read the corresponding 128k record, verify it's checksum, deliver your 512 bytes and then throw away the record because caching is disabled. If you then read the next adjacent 512 bytes, it will again read that 128k record, verify the checksum, deliver your 512 bytes, -- and so on, you get the point. Not funny.

Restricting the ZFS cache to metadata makes sense where the consumer has their own cache, so that this pathological pattern will not happen.

But the really interesting thing here is if you create a zfs volume as the virtual disk for your guest, and then inside the guest again create a zpool on that volume. Because then there are questions: which side should do the caching? Which side should do the compression? (that is very much dependent on the blocksizes in use) Which side should do dedup? (obviousely the outside, as it doesn't make any sense on the inside) And then also, does dedup even work properly if you disable caching?

So there are many delicate questions, and no I don't have the answers. But somebody who is really bored could spend quite a while with benchmarking...

GogoFC said:
It just says what to do and that there could be a double cache for the same files (pressure) but don't say why that's a bad thing.

It's a waste of ressources - daisy-chaining two caches with exactly the same characteristics makes no sense.

malavon · Jul 17, 2024

GogoFC said:
"If you are using ZFS for the host as well as inside a guest, keep in mind the competing memory pressure of both systems caching the virtual machine’s contents. To alleviate this, consider setting the host’s ZFS filesystems to use metadata-only cache."

Anyone know why the recommendation isn't the exact opposite: disable cache in the ~~jail~~guest altogether. Aside from having to do it multiple times, I'd say it allows for more flexibility. Possibly even shared cache in the case of null-mounts? It means the ~~jail~~guest is a lot less memory-hungry and more predictable, no?
I have never tested these assumptions, but I guess they should hold.
edit: way too late to ask this; confused vm & jail

PMc · Jul 17, 2024

malavon said:
Anyone know why the recommendation isn't the exact opposite: disable cache in the jail altogether.

Sure - shorter path. Cache in the guest is directly accessed. Cache in the host goes thru the ARC in the guest, through the virtual disk driver, thru the bhyve emulator and only then to the ARC on the host.

BTW, it's not about jails, it's about bhyve guests. Jails use the filesystem layer of the host anyway.

But then also, there is no universal truth. One might say, we have a dozen identical guests, and significant dedup, so why cache the same data a dozen times in the guests instead only once in the host?

GogoFC · Jul 17, 2024

PMc said:
Then this is of not so much concern, as there is only one ZFS in the game.

So ok, thanks, I don't need to do anything regarding this. I would like to do some type of tuning. Maybe I'll have time to do all the benchmarking at some point.

They should have made season 3 of Altered Carbon.

GogoFC · Jul 17, 2024

So I did some beginner testing which might not mean much.
I used CrystalDiskMark inside a Windows VM that's on a SSD datastore.

This is not a zvol. It's a .img because it's Windows. Didn't try Linux yet.

As I interpret this if Caching all is enabled the VM thinks it's Disk is very fast as it shows 1400MB/s.
Of course that doesn't happen on underlying storage. zpool iostat shows about 25 MB/s constant every second, and about 150-250MB/s every 20 seconds for one second total, then it goes to 25MB/s.

I also tried to disable Windows Disk caching but it wouldn't let me, so I did something that isn't intuitive, looks like the options are mutually exclusive or opposite.
I saw people talking about iSCSI and disabling Disk cache on Windows here ... https://github.com/openzfs/zfs/issues/7897 .. so I thought I'd try and see if something happens. It did, but just a bit.

So here's the screenshots:

Caching metadata only. VM thinks it's 200MB/s fast

Caching all. VM thinkks it's 1500MB/s fast.

Caching all with turning off the Windows write-cache buffer flushing to device.
- This one was a bit different. Apart from it being 10 better than the last one, the VM didn't freeze during the testing and the testing only took about a minute whereas the other ones took over 5 minutes. I didn't count the time but I had to wait a long time.

This wouldn't disable, maybe because the Win PRO isn't registered.

I know these aren't D-trace tests. I'll do those when I learn how.

And I don't know what these tests actually mean, I can just interpret them as best as I can.

I can say with certainty that the last option, with zfs cache all and Windows buffer flushing disabled, makes the VM feel more responsive, apart from not freezing during the tests. RDP connection got disconnected so I had to go over to VNC, but even VNC feels more responsive now.

Edit:
Yeah this Windows VM is basically flying after turning off the buffer flushing.

Erichans · Jul 17, 2024

GogoFC said:
Hey I was just reading the Handbook, specifically that part:

The Handbook is mainly intended as a source of practical guidance for users, you won't find deep technical explanations there. The same holds for the two specialised ZFS books, though they hint at some of the technical underpinnings of ZFS.

With specialised usage such as bhyve VMs where ZFS is deployed you quickly enter the area of bhyve and ZFS tuning. Both of them are complex with lots of tuning knobs. You would likely need a lot more techinical guidance & knowledge where to do what and last but not least a lot of performance testing to verify the best tuning parameter setup for a specific type of load.

As it seems, you're looking for more technical information. For ZFS (and a small bit in connection with bhyve), I suggest you have a look at:

Design and Implementation of the FreeBSD Operating System, The, 2nd Edition
- by Marshall Kirk McKusick, George V. Neville-Neil, Robert N.M. Watson
The ZFS chapter is probably the sole detailed description of ZFS' internal structures and inner workings in one place, informed by the source: Marshall Kirk McKusick consulted Matthew Ahrens.
ZFS Internals Overview by Kirk McKusick - OpenZFS Developer Summit 2015 - SLIDES-link
ELI5: ZFS Caching by Allan Jude - FOSDEM 2019 - video and slides.

GogoFC said:
"If you are using ZFS for the host as well as inside a guest, keep in mind the competing memory pressure of both systems caching the virtual machine’s contents. To alleviate this, consider setting the host’s ZFS filesystems to use metadata-only cache."

In my case I use zvols for VMs, most my VM's are Ubuntu with ext4 fs, so non ZFS.

Not exactly the structure as described in the Handbook, but you have a stack like:

~~ZFS - guests~~ -> Linux-ext4-s on ZVOLs
VM sub-layer
ZFS - host

That still leaves two competing caching systems: ZFS versus multiple ext4 guests. That justifies much of the same considerations as when you have a DB's cache competing with the ZFS ARC for example. Initially there was the consideration that a DB knows best how to cache its internal data, therefore giving much latitude to the DB cache and minimizing the ARC. With the introduction of the compressed ARC that view was changed, as Allan Jude discusses; he also (briefly) discusses caching in the context of bhyve VMs (slide #19 at about here).

In the situation without a VM set up, you "just" have the ZFS ARC that takes all the RAM it can lay its hands on, but gives in to every other application's request for memory quickly*. There you have a sort of relation of indirect communication: a program requests more memory and the ARC frees it. That is different in the case of a VM setting with competing caching systems (host versus guest) where you have the "VM sub-layer" in between.

GogoFC said:
It just says what to do and that there could be a double cache for the same files (pressure) but don't say why that's a bad thing. If the cache is done both inside VM and on the host, RAM is wasted, I don't care much about that, I do care about latency.

Latency is (very) much correlated with cache memory efficiency, so I think you should care about not wasting memory resources. BTW, unless you have a severely underutilised server load, not considering memory a scarce resource would be a first.

The Design and Implementation of the FreeBSD Operating System, 2nd Edition - 6.10 The Pager Interface

Historically, the BSD kernel had separate caches for the filesystems and the virtual memory. FreeBSD has eliminated the filesystem buffer cache by replacing it with the virtual-memory cache. [...]

The ZFS filesystem integrated from OpenSolaris is the one exception to the integrated buffer cache. ZFS has its own set of memory that it manages by itself. Files that are mmap’ed from ZFS must be copied to the virtual-memory managed memory. In addition to requiring two copies of the file in memory, extra copying occurs every time an mmap’ed ZFS file is being accessed through the read and write interfaces. As detailed in Section 10.5, ZFS would require extensive restructuring to integrate its buffer cache into the virtual-memory infrastructure.

The Design and Implementation of the FreeBSD Operating System, 2nd Edition - 10.5 ZFS Design Tradeoffs

Integrating ZFS’s ARC into the unified-memory cache would require massive changes. The problem is easily seen in Figure 10.1. The unified-memory cache operates at the vnode interface level and the ARC operates at the physical block level. [...]

The dual caching structures of ZFS and the existing (pre-ZFS) virtual-memory cache of FreeBSD in parallel is a weak point, but as stated a design tradeoff. You can see these pictured in the Kernel I/O structure figure as discussed in Gunion(8): a new GEOM utility in the FreeBSD Kernel By: Marshall Kirk McKusick - BSDCAN 2023. In a VM setting this dual caching problem is multiplied, as cracauer stated.
___
* When you dive deeper into the inner workings of the ZFS ARC: there are situations where the freeing of memory by the ARC does not happen quickly enough and restriction of the size of the ARC is called for.

Edit: If you are interested, I suggest you also have a look at the freebsd-virtualization mailing list. Fairly recent: bhyve disk performance issue

GogoFC · Jul 17, 2024

Hey thanks, that's really nice that you took time to reply in detail explaining concepts, and suggesting Books. It's much more clear now. I appreciatte it.

I'm not going to reply to anything, there's not much I could add anyway until I read and understand the Books you mentioned.

In short term I do have a DB to think about on one of those VM's, but for long term, hopefully, I plan to read those.

ZFS is for many personal reasons exciting and I respect it, probably because I respect quality among other things.