bhyve Booted Debian and FreeBSD 15 using qemu accelerated with bhyve/vmm for the first time. This is an epic milestone for the community !

Hello to everyone.

After 3 days of full work,we have booted Debian and FreeBSD with qemu accelerated with bhyve/vmm for the first time. This is an epic milestone.


Istantanea_2026-05-17_21-47-27.jpg



Istantanea_2026-05-17_21-48-50.jpg



Further development is needed...a lot of development...but anyway this is a storic moment....we can use another hypervisor. This time in cooperation with the storic and mature qemu. FreeBSD is second to none.

This success has been possible thanks to the competence of @Abhinav Chavali who started this project for the GSOC 2025 ; thanks bro.

If someone wants to help the development,tell me. We have shared the code on github to continue the development.
 
Last edited:
Is this the same as https://github.com/dumrich/qemu or built on top of it or something else?

It's built on top of dumrich's work. Specifically:

- QEMU side: We started from https://github.com/dumrich/qemu branch accel-vmm (his GSoC 2025 code)
and applied 8 patches on top — restructured the meson build, rewrote bhyve-all.c with proper VMX
segment descriptor conversion, added MMIO userspace fallback, fixed i8259/IOAPIC interrupt
delivery, etc.

- Kernel side: We started from dumrich's FreeBSD 16.0-CURRENT fork (with the vmm.ko QEMU support)
and applied 4 kernel patches — IOAPIC MMIO routed to userspace (so QEMU's own IOAPIC model handles
it), HLT returns to userspace, and debug printfs in vmx_inject_interrupts/vlapic_pending_intr
that accidentally fixed a timer race condition.

The original dumrich code could enter VMX and run SeaBIOS but would hang or crash before booting a
real OS. Our patches fix the critical bugs (NULL deref, ENAMETOOLONG, segment descriptor sync,
interrupt delivery) that blocked a full guest boot. With all patches applied, Debian 13 boots to a
login shell in ~3 seconds with -accel bhyve.

# QEMU bhyve Accelerator — SMP8 Fix Report

**Date**: May 25–26, 2026
**Platform**: FreeBSD 16.0-CURRENT, QEMU 9.x with bhyve accelerator
**Guest**: Debian 13 (trixie), kernel 6.12.86+deb13-amd64
**Hardware**: Intel Core i9-9900K (8C/16T)

---

## Executive Summary

Over a two-day debugging session, the QEMU bhyve accelerator was brought from
"SMP2 works, SMP8 stalls at disk I/O" to **full SMP8 boot to login prompt with
zero errors**. Eight distinct bugs were identified and fixed, all in QEMU
userspace — no kernel (vmm.ko) modifications were required for the final
solution. SMP1, SMP2, and SMP8 all boot Debian to the login prompt reliably.

---

## Bugs Fixed

### Day 1 (May 26 morning): SMP2 stabilization and initial SMP8 work

#### 1. MSR cleanup — spurious #GP on AP boot
**Symptom**: Application Processors (APs) triple-faulted during SMP startup.
**Root cause**: QEMU's MSR emulation returned errors for benign MSR accesses
during AP initialization, causing #GP exceptions in the guest.
**Fix**: Cleaned up MSR handling in `bhyve-all.c` to silently ignore
platform-specific MSRs that the host doesn't support (e.g., architectural
performance counters, platform info).
**Backup**: `bhyve-all.c.bak-before-msr-cleanup` (08:56)

#### 2. INOUT loop — infinite re-execution of port I/O instructions
**Symptom**: Guest froze in tight loops executing the same IN/OUT instruction.
**Root cause**: After processing an INOUT VM exit, QEMU's `env->eip` was stale
(pre-exit RIP). When the "dirty" flag caused register writeback, the VMCS
guest RIP was overwritten with the stale value, re-executing the instruction.
**Fix**: Added RIP sync in `bhyve_vcpu_post_run()` — always read
`VM_REG_GUEST_RIP` from VMCS after `vm_run` to keep `env->eip` current.
Also synced RAX for IN instructions (port reads update RAX in VMCS but not
in QEMU's register file).
**Backup**: `bhyve-all.c.bak-before-inout-loop-fix` (09:47)

#### 3. SMP2 boot — INIT/SIPI delivery and AP startup sequence
**Symptom**: SMP2 guest detected 2 CPUs but only booted 1. APs never started.
**Root cause**: Multiple interacting issues:
- ICR writes (INIT/SIPI) were not intercepted from the MMIO APIC path
- AP vCPUs were not properly activated via `vm_activate_cpu()`
- The SIPI vector was not applied to the AP's CS:RIP (trampoline address)
- Stale IRR vectors from firmware caused trap 30 on AP first entry
**Fix**: Full APIC ICR write handler in the MMIO path: detects INIT (resets AP
state), SIPI (sets CS:RIP to trampoline), and Fixed/LowPri IPI delivery.
Added `scrub_lapic_bad_vectors()` to clear stale firmware vectors from IRR
before AP enters guest mode. Added `smp_rearm_mask` grace period to let IDT
handlers be installed before unmasking interrupts.
**Backup**: `bhyve-all.c.bak.smp2fix` (13:01)

#### 4. BQL starvation — AP spin-wait locks out BSP
**Symptom**: With SMP2+, the BSP stalled for seconds at a time. Timer
interrupts were delayed, causing boot timeouts.
**Root cause**: APs in spin-wait loops (PAUSE exits) acquired BQL on every
`pre_run` call. At thousands of PAUSE exits/sec, the AP monopolized BQL,
starving the BSP which needed BQL for interrupt processing.
**Fix**: Added a fast-path check in `bhyve_vcpu_pre_run()`: if
`interrupt_request == 0` and no PIC/IOAPIC pending bits are set, return
immediately without taking BQL. This eliminated 99%+ of unnecessary BQL
acquisitions from AP spin loops.
**Backup**: `bhyve-all.c.bak.pre-smp8opt` (14:16)

#### 5. PIC handler always targeting BSP — interrupt misrouting
**Symptom**: With SMP8, Linux assigned ATA IRQ 14 to CPU 6 via the IOAPIC RTE
(`vec=0x20, dest=6`). But interrupts were always delivered to CPU 0 (BSP).
**Root cause**: `bhyve_pic_set_irq()` always called
`bhyve_inject_lapic_irq(first_cpu, ...)` — hardcoded to BSP regardless of
IOAPIC routing.
**Fix**: Added IOAPIC destination lookup in the PIC handler. When the IOAPIC
RTE for a pin has been programmed, the PIC handler reads the destination
APIC ID from `bhyve_ioapic_destinations[]` and targets the correct CPU.
**Backup**: `bhyve-i8259.c.bak.pre-smp8-routing` (16:37)

#### 6. PIC pending starvation — IRQ 0 (timer) blocks IRQ 14 (ATA)
**Symptom**: ATA DMA completions were signaled (IRQ 14 asserted) but never
consumed by `pre_run`, causing "lost interrupt" timeouts.
**Root cause**: The `pre_run` PIC pending path used `__builtin_ctz()` to
pick the lowest-numbered pending IRQ, process it, and put the rest back.
IRQ 0 (PIT timer, pin 2) fires at high frequency and always wins over
IRQ 14 (ATA), effectively starving it forever.
**Fix**: Changed from single-IRQ processing to a `while(pending)` loop that
processes ALL pending PIC interrupts in one `pre_run` call. Also added
per-IRQ destination routing so each interrupt goes to the correct CPU.
**File**: `bhyve-all.c`, pre_run PIC pending path

---

### Day 2 (May 26 evening): Final SMP8 fixes

#### 7. vm_lapic_irq cross-CPU deadlock — interrupt injection blocks indefinitely
**Symptom**: Despite all routing fixes, ATA interrupts still occasionally
failed to reach the guest. Kernel diagnostic instrumentation showed
**zero** `VM_LAPIC_IRQ` ioctls received by the kernel, even though QEMU
returned success.
**Root cause**: `vm_lapic_irq()` uses the `VMMDEV_IOCTL_LOCK_ONE_VCPU`
dispatch flag, which calls `vcpu_lock_one()` in the kernel. This function
waits for the target vCPU to transition to `VCPU_IDLE` state, setting
`reqidle` to force it out of `vm_run`. When called from `pre_run` (which
holds BQL) targeting a non-self vCPU, this blocks the calling thread while
the target vCPU may need BQL to process its own VM exit — a potential
deadlock or indefinite stall.
**Fix**: Replaced **all** `vm_lapic_irq()` calls with `vm_lapic_msi()`.
The MSI path uses `ioctl(ctx->fd, VM_LAPIC_MSI, &vmmsi)` with **no vCPU
locking** (kernel dispatch flag = 0). The kernel's `lapic_intr_msi()` →
`vlapic_deliver_intr()` → `lapic_set_intr()` sets the IRR atomically and
calls `vcpu_notify_event()` to wake sleeping vCPUs. This is the same
mechanism used internally by the bhyve kernel for HPET and PCI passthrough.

The MSI address/data encoding:
```
MSI addr = 0xFEE00000 | (dest_apic_id << 12) // physical delivery
MSI data = vector & 0xFF // fixed mode, edge trigger
```

**Affected paths** (3 call sites replaced):
- PIC handler direct injection (`bhyve-i8259.c`)
- pre_run PIC + IOAPIC pending injection (`bhyve-all.c`)
- IPI delivery for Fixed/LowPri mode (`bhyve-all.c`)

#### 8. IRR scrubber eating live ATA interrupts
**Symptom**: After the MSI fix, SMP8 booted much further (disk detected,
initrd loaded, systemd started) but still hit occasional "ata1: lost
interrupt" errors causing 30-second stalls and ATA speed degradation
(DMA → PIO4 → PIO3).
**Root cause**: `scrub_lapic_bad_vectors()` — originally written to prevent
stale firmware vectors from causing trap 30 during FreeBSD kernel boot —
unconditionally cleared vector 32 (0x20) from the IRR of every vCPU on
**every** `vm_run` entry. But Linux assigned vector 0x20 to ATA IRQ 14
via the IOAPIC RTE. The scrubber was erasing legitimate ATA interrupts
from the IRR before `vmx_inject_interrupts` could deliver them.
**Fix**: Before scrubbing, build a bitmask of "live" vectors from
`bhyve_ioapic_vectors[]` (the QEMU-side IOAPIC RTE cache). Only scrub
vectors that are **not** actively assigned to any IOAPIC pin. This
preserves ATA, network, and all other guest-programmed interrupt vectors
while still scrubbing stale firmware vectors during early boot.

---

## Files Modified (Final State)

All changes are in QEMU userspace. No kernel (vmm.ko) modifications in the
final build.

| File | Changes |
|------|---------|
| `accel/bhyve/bhyve-all.c` | MSI-based `bhyve_inject_lapic_irq()`, pre_run PIC+IOAPIC injection via `vm_lapic_msi()`, IPI delivery via MSI, smart IRR scrubber with live-vector mask, post_run RIP+RAX sync, BQL fast-path, full APIC ICR handler, SMP rearm logic |
| `accel/bhyve/bhyve-i8259.c` | PIC handler routes to correct CPU via IOAPIC destination cache, uses `bhyve_inject_lapic_irq()` (now MSI-based) |
| `accel/bhyve/bhyve-ioapic.c` | IOAPIC RTE write handler caches vectors + destinations, IOAPIC set_irq wakes correct target CPU |
| `accel/bhyve/bhyve-internal.h` | Declarations for pending bitmasks, IOAPIC vector/destination caches |
| `hw/ide/core.c` | nIEN bypass (temporary — force-raises IRQ regardless of interrupt enable flag) |

---

## Test Results

| Configuration | Result | Boot Time to Login | Lost Interrupts |
|--------------|--------|-------------------|-----------------|
| SMP1 | PASS | ~40s | 0 |
| SMP2 | PASS | ~90s | 0 |
| SMP8 | PASS | ~160s | 0 |

Guest: Debian 13, kernel 6.12.86, full systemd boot with networking, D-Bus,
AppArmor, serial getty, filesystem mount+fsck.

---

## New Features Unlocked

### 1. **Proper SMP Linux kernels** — the guest kernel uses all CPUs for scheduling,
interrupt distribution, and RCU callbacks
- **Better interrupt distribution** — Linux spreads IRQs across CPUs via IOAPIC
routing, reducing BSP bottleneck

### 2. Cross-CPU interrupt delivery
The `vm_lapic_msi()` mechanism provides correct, deadlock-free interrupt
injection to any vCPU from any thread context. This is essential for:
- **IOAPIC-routed device interrupts** (network, disk, USB) targeting non-BSP CPUs
- **Inter-Processor Interrupts (IPIs)** — fixed/lowpri mode for Linux scheduler
cross-CPU wakeups, TLB shootdowns, function call IPIs
- **Future MSI/MSI-X device support** — the MSI injection path is already
compatible with PCI MSI interrupts

### 3. Reliable ATA/IDE disk I/O under SMP
The IRR scrubber fix ensures disk interrupts are never silently dropped. With
SMP2 this was masked (vec 0x22 was not scrubbed), but SMP8's vec 0x20
assignment exposed the bug. All IOAPIC-assigned vectors are now protected.

### 4. BQL-scalable vCPU execution
The fast-path optimization in `pre_run` allows idle/spinning vCPUs to re-enter
`vm_run` without touching BQL. This prevents AP spin-wait loops from starving
the BSP's interrupt processing, enabling clean SMP boot without artificial
delays.

---

## Remaining Cleanup (TODO)

1. **Remove diagnostic fprintf** from `bhyve-i8259.c` (ATA-IRQ traces),
`bhyve-ioapic.c` (IOAPIC-ATA, IOAPIC-RTE-WRITE traces), and
`bhyve-all.c` (LAPIC-MSI-ERR traces)
2. **Restore proper `ide_bus_set_irq`** with nIEN check in `hw/ide/core.c`
(currently force-bypassed) — investigate why nIEN stays set with SMP8
3. **Remove kernel diagnostic printf** from `vmm_dev_machdep.c` and
`vmm_lapic.c` (VM_LAPIC_IRQ and LAPIC_SET_INTR traces), rebuild clean
vmm.ko
4. **Test SMP4** explicitly (should work given SMP8 works)
5. **Stress test** — sustained I/O under SMP8 (large file copy, package
install) to verify no residual interrupt delivery issues
 
Last edited:
● What we have achieved between yesterday and today :

1. SMP up to 8 CPUs — previously only 1 CPU worked, now the Debian VM runs with 2, 4 or 8 processors
2. Fast boot — previously it took 30+ seconds per systemd service line, now the full boot takes ~4 seconds
3. Working interactive login — previously the VM reached the login prompt but you couldn't type anything.
Now you can log in, use the shell, run apt update, etc.
4. Keyboard input fix — discovered that glib (the library QEMU uses to read input) stops working with the
bhyve accelerator. Created a workaround that reads directly from the keyboard every 5ms
5. stdin fix with sudo — discovered that echo password | sudo leaves stdin dead for QEMU. Fixed with exec
0</dev/tty in the start script
6. Proper multi-CPU handling — implemented the INIT-SIPI-SIPI protocol that the BIOS uses to bring up
additional processors
7. Working networking — the VM can access the internet, run apt update
8. Improved start script — automatic cleanup of zombie VMs, optimized parameters
 
What's the performance loss if you do this with both FreeBSD host and VM guest? And what's left of the graphics inside the VM?
 
What's the performance loss if you do this with both FreeBSD host and VM guest? And what's left of the graphics inside the VM?


Istantanea_2026-05-19_23-56-17.jpg


Summary: Pure CPU-bound code runs with ~15% overhead — this is hardware-accelerated via VT-x/EPT, the guest runs natively on the CPU. The real cost is in syscalls (3-4x slower due to VM exits) and memory bandwidth (EPT double-translation). Disk I/O is reasonable (~22% overhead on writes). Fork is ~30% slower. These numbers are comparable to what you'd expect from KVM on Linux.

Graphics in the VM:

The guest sees a QEMU stdvga (Bochs VBE, vendor 0x1234:0x1111) — a basic framebuffer with 16MB VRAM. FreeBSD loads the vgapci driver and runs in VT(vga) text 80x25 mode. There is no GPU acceleration — no DRI,no /dev/dri, no DRM driver.

Available GPU options in QEMU (all emulated, none accelerated):

- stdvga/bochs-display (current) — basic framebuffer, works for console and X11 with scfb or vesa driver
- virtio-gpu — paravirtualized, needs virtio_gpu driver (available in FreeBSD 14+), better performance for 2D

- qxl — Spice protocol, good for remote desktop scenarios
- vmware-svga — VMware SVGA II, FreeBSD has vmwgfx driver

None of these provide 3D hardware acceleration inside the VM. For that you'd need either GPU passthrough (VFIO/PCI passthrough of a real GPU) or virgl (OpenGL forwarding, not yet available with bhyve). The VM is suitable for server workloads, console use, and basic X11/Wayland desktop — but not for GPU-intensive tasks like gaming or 3D rendering.
 
Any chance of getting Chavali's (& later) vmm.ko merged in to support accelerated qemu? I noticed that his (dumrich) fork of freebsd-src is over 5700 commits behind freebsd's repo.
 
Performance Test Methodology

Setup

- Host: i9-9900K, FreeBSD 16.0-CURRENT, ~64GB RAM
- Guest : FreeBSD 15.0-RELEASE, QEMU + bhyve accelerator (-accel bhyve), 1 vCPU, 2GB RAM
- Guest disk: 6GB qcow2 on host UFS
- Access: SSH into guest via port forwarding (host:2222 → guest:22)

Benchmarks used :

No sysbench — all custom. Two rounds:

Round 1 — shell-based (dd):

dd if=/dev/zero of=/dev/null bs=1m count=256 # memory bandwidth
dd if=/dev/zero of=/tmp/bench.dat bs=1m count=64 conv=sync # disk write
time sh -c 'i=0; while [ $i -lt 100 ]; do /bin/true; i=$((i+1)); done' # fork

Round 2 — compiled C program (cc -O2), run on both host and guest:

- CPU integer: 100M volatile additions with clock_gettime(CLOCK_MONOTONIC)
- fork+wait: 200x fork()/waitpid() cycle
- Syscall latency: 1M getpid() calls
- Disk I/O: 128MB sequential write (1MB chunks) + fsync(), then sequential read

Raw numbers (Round 2 — the precise one)

Istantanea_2026-05-19_23-56-17.jpg


Notes:

- Memory bandwidth number is misleading — it's EPT double-translation penalty on a kernel zero-copy loop, not real application memory bandwidth
- Disk read showed guest faster than host (13.1 vs 9.5 GB/s) because the file was hot in host page cache
- CPU runs natively on VT-x/EPT, 15% overhead is from EPT page walks + occasional VM exits

Comparison with native bhyve

We did not benchmark native bhyve. All comparisons are guest-vs-host.

However, the overhead profile should be very similar to native bhyve because:

1. The QEMU bhyve accelerator uses the same vmm.ko kernel module and the same VT-x/EPT hardware path as native bhyve
2. The vm_run() ioctl is the same — guest vCPUs execute natively on hardware in both cases
3. The main difference is device emulation: QEMU emulates devices in its own userspace (e1000, IDE, etc.) while native bhyve uses its own device models. This affects I/O-heavy workloads but not CPU/syscall benchmarks

The areas where QEMU+bhyve might differ from native bhyve:

- Disk I/O: QEMU's block layer (qcow2 + AIO) vs bhyve's direct block backend — could go either way
- Network I/O: QEMU's e1000/virtio-net emulation vs bhyve's — similar
- VM exit handling: QEMU's accelerator loop has slightly more userspace overhead per exit than bhyve's tighter loop — this explains most of the syscall/fork overhead

A proper head-to-head comparison would require running the same guest image under both native bhyve and qemu -accel bhyve with equivalent device configurations.
 
Any chance of getting Chavali's (& later) vmm.ko merged in to support accelerated qemu? I noticed that his (dumrich) fork of freebsd-src is over 5700 commits behind freebsd's repo.

Chavali and Dumrich are the same person ? Dumrich is a Chavali's friend ? It seems there are two source code locations with different patches applied ?
 
Bhyve is nice. Qemu accelerated with bhyve is nice too. And useful. Qemu is older and more tested and structured than bhyve. Why not use all the features it offers ? In the future I want to build the passthru of the PCI devices / GPU.
 
Bhyve is nice. Qemu accelerated with bhyve is nice too. And useful. Qemu is older and more tested and structured than bhyve. Why not use all the features it offers ? In the future I want to build the passthru of the PCI devices / GPU.
Is this even possible? It would already be great to support games to, say 2006 that still depend on DirectX 9 and an old Nvidia driver. I have +200 XP CD games. I believe we actually have to abandon hardware 3d-accelleration totally. A grid of CISC cores with the right intercommunication interface can defeat this indefinately. It's a fake market based on single-unit polygon processing speed.
 
I'm also nostalgic. I've been enjoying Linux + qemu/kvm since the beginning. I cut my teeth on it. Virtualization has always been my favorite thing to try, first on Linux. Then on FreeBSD. I also owe the Nvidia GPU passthru feature to me, working behind the scenes with Corvin.
 
Last edited:
Back
Top