UPDATE : I have enabled the SMP support for FreeBSD guest OS,virtualized with qemu-bhyve and I've launched some benchmarks.
System:
CPU : Intel Core i9-9900K (8C/16T, base 3.6 GHz)
Host OS : FreeBSD 16.0-CURRENT (amd64)
Guest OS: FreeBSD 15.0-RELEASE (amd64)
VM cfg : QEMU + bhyve accelerator, 8 vCPU, 4 GB RAM, virtio-blk disk
Date : 2026-06-09
==============================================================================
ANALYSIS AND NOTES
==============================================================================
1. CPU INTEGER (+8%)
The overhead is minimal because integer instructions execute directly on
hardware without hypervisor interception. VT-x/EPT does not introduce
overhead for normal user/kernel instructions. The slight slowdown comes
from additional context switching by the host scheduler, which multiplexes
guest vCPUs onto physical hardware threads.
2. SYSCALL getpid() (~4x)
Every syscall in the guest triggers a VM exit into bhyve, which handles it
and re-enters the guest. The VM exit/entry round-trip on modern hardware
costs approximately 100-200 ns (VMLAUNCH/VMRESUME + context save/restore).
On the native host, getpid() is an extremely fast syscall (~46 ns), often
further optimized via vDSO/vsyscall. Inside the guest this shortcut does
not exist: every getpid() crosses the guest/hypervisor boundary.
Result: 46 ns -> 182 ns (+136 ns fixed overhead per VM exit).
3. FORK()+WAIT (~66x)
The most striking result. Fork inside a VM is expensive because:
a) The fork() syscall must copy the process page tables and mark all pages
as Copy-on-Write. With EPT enabled, this requires INVEPT/INVVPID
operations (extended TLB invalidation), which are privileged instructions
that cause additional VM exits.
b) The newly created child process must be scheduled on a guest vCPU by the
guest scheduler, which must then itself be scheduled by the host
scheduler. This two-level scheduling introduces significant latency.
c) Process creation inside the guest involves manipulation of kernel
structures (pmap, vmspace) that trigger numerous EPT page faults.
87 us (native) -> 5728 us (VM) reflects the real cost of virtualization
for process-intensive workloads.
4. MEMORY: dd /dev/zero (same)
Host and guest bandwidth are nearly equal because dd /dev/zero -> /dev/null
measures how fast the kernel fills memory buffers with zeros (memset speed).
256 MB far exceeds the L3 cache (16 MB on i9-9900K), so this measures real
DRAM bandwidth. The guest accesses its own physical RAM (actual DRAM mapped
by EPT) via the same hardware path as the host for sequential accesses.
EPT overhead for sequential access is negligible because the hardware
prefetcher covers EPT TLB misses before they stall the pipeline.
5. MEMORY: sysbench (-36% write, -48% read, -41/-59% at 8 threads)
sysbench memory uses malloc()+memmove() in a tight loop with many
random-ish accesses. This stresses the TLB at two levels simultaneously:
- Guest page table (guest virtual -> guest physical)
- EPT (guest physical -> host physical)
A TLB miss in the guest requires an Extended Page Table Walk that can touch
up to 24 memory addresses instead of the 4 required by a native page walk
(4 host levels x 4 guest levels = up to 16 accesses, plus overhead).
Read overhead is higher than write because memmove reads before writing,
amplifying the miss penalty.
With 8 threads the percentage overhead is larger (-59% read) because
contention on EPT structures increases with more vCPUs.
6. DISK: write+fsync (~19x)
Large but expected overhead for virtual I/O:
- Each guest write() generates a virtio-blk request
- bhyve in the host kernel processes the request and writes to the image file
- Guest fsync() translates to fdatasync() on the host image file
- Each operation requires multiple VM exit/entry round-trips
The host disk is a raw file on a UFS filesystem. On the native host, fsync()
goes directly to the NVMe controller. Inside the guest the path is:
guest fsync -> virtio-blk -> bhyve -> host UFS -> NVMe.
413 MB/s (native) vs 22 MB/s (VM) shows the full cost of I/O layering.
7. DISK: cached read (guest faster)
The guest reads /tmp/tf from its own page cache (guest RAM, which is already
host RAM). The host reads the same file through the host UFS page cache.
Both are fully cached, but the guest read path is shorter: guest VFS ->
guest page cache -> done, without traversing the virtio layer because the
data is already in the buffer cache. This explains the slightly higher
throughput on the guest side.
==============================================================================
CONCLUSIONS
==============================================================================
Workloads with negligible virtualization overhead:
- Pure integer compute: -8%, essentially transparent
- Large sequential memory: same as native
Workloads with moderate overhead:
- Syscall-heavy code: ~4x (fixed ~136 ns per VM exit)
- Memory bandwidth (alloc): -35% to -48% (EPT TLB miss penalty)
- Memory bandwidth (MT): -40% to -59%
Workloads with severe overhead (avoid in VM for performance-critical use):
- fork() / process creation: ~66x (EPT invalidation + double scheduling)
- Synchronous disk I/O (fsync): ~19x (virtio layering + VM exit per I/O)
Summary: QEMU+bhyve is transparent for pure compute but introduces significant
overhead for anything that crosses the guest/hypervisor boundary: frequent
syscalls, process creation, synchronous disk I/O, and high-frequency memory
allocation patterns. The ideal workload for this configuration is compute-bound
with sequential memory access and asynchronous or in-RAM I/O.
==============================================================================