bhyve bhyve with more than 8 vCPU slow down the startup of Windows guest

Hi there,

I tried to setup Windows 10 guest on a FreeBSD workstation with two AMD EPYC 7642 48-Core Processor.

When I just set up vCPU with `-c 16` or anything large than 2 without specify sockets, cores, and threads, then the Windows 10 guest startup quickly and display with only 2 vCPU.

If I setup the vCPU as:
-c 8,sockets=1,cores=4,threads=2
-c 8,sockets=1,cores=8,threads=1
-c 8,sockets=2,cores=2,threads=2
-c 8,sockets=2,cores=4,threads=1
The Windows 10 guest could startup quickly and displayed the correct vCPU.

If I set up the vCPU more than 8 by specify sockets, cores, and threads, then Windows 10 guest will startup slowly, and the more the vCPU, the slower the startup. After a long time, the Windows 10 guest could startup and display the corrected vCPU. However, It should not be so slow during startup.

My FreeBSD is the latest RELEASE, 14.3-RELEASE-p3 with all package updated. I use vm-bhyve to manage the setting up of Windows 10 guest.

My Windows 10 guest is the Windows 10 Pro and Windows 10 Pro for Workstations (22H2).

Any hints? Thanks in advance.
 
The above post is based on the observation of Windows 10 Pro guest.

The Windows 10 Pro for Workstation have small different.

If only set `-c 16`, the Windows 10 guest will display it have 4 sockets and 4 vCPU.

With Windows 10 Pro, the guest display it has 2 sockets and 2 vCPU.
 
Note that specifying just a CPU count will simply appear as X number of single core processors. Windows 10 will only accept 2 physical processors, and 10 Pro for Workstations apparently allows 4 so this will be why it displays the "wrong" count. These desktop versions are intentionally designed to only enable that number of CPUs.

One of my bhyve guests with 4 CPU just specified using the basic total cpu count to bhyve -
Code:
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 4 package(s) x 1 core(s)

I can't comment on why it's slower when going above 8 logical processors. I've heard mentions that Windows on bhyve often seems to perform worse with higher CPU counts but I'm not sure how widespread or drastic the slowdowns are.
 
Thank you for clarifying my confusion.

I want to add some additional observations. When the number of vCPUs exceeds 8, the time required to boot the Windows 10 guest increases significantly. I tested scenarios with 12 and 16 vCPUs (configured via sockets, cores, and threads), where Windows 10 guest boot exceeded 10 minutes. During this period, the Windows 10 guest's CPU load remained around 100% (observed via the host's top command; when vCPUs were set to 12, top -I showed bhyve CPU usage at approximately 1200%). After boot completes, vCPU load returns to normal levels, and the guest appears to function correctly.

However, upon shutdown and subsequent restart, this process repeats.

I do not know why this behavior occurs.
 
Interesting...

I am running a Windows 11 Pro guest with 16 vCPUs (1 socket, 16 cores, 1 threads). The guest takes only a few seconds to boot.
 
What CPU type does your host have? Some clues (on Linux host) I've seen seem to indicate that AMD CPU might have similar issues, especially the EPYC type I'm using. However, I still haven't figured out the exact cause.

Now, I am running the Widows guest with 8 vCPUs.
 
I'm running a Win Server 2019 (i.e. Win 10) VM with a 1x12 core configuration and also don't have any issues with slow booting. Host is a dual-socket intel xeon E5 system.

AMD had (has) various problems related to their Infinity Fabric which needed fixes/workaraounds via firmware (and many even on OS/software level). Did you check for BIOS-Updates?
Also in general the way the IF and cache is built in those CPUs is often 'sub-optimal' - i.e., if AMD specs the CPU at 64MB of L3 cache, that number is divided by the actual CCX units, so it has actually 8x8MB caches. If code runs on multiple CCXs, it often is replicated across all the associated caches. 1MB of cache on each CCX is shared with the others (reducing those 8MB to 7MB available exclusively to the CCX) but L3-to-L3 latency between CCX is abysmal due to the way it is implemented...
This is all on a single socket - if you bring in a second socket, those problems scale up. As said: a lot of this needed mitigation via Firmware fixes and software workarounds.
I wouldn't be surprised if you trigger one/some of the scaling problems with IF when assigning more cores than physically present in one CCX. Assign even more cores and it may get even worse...

(there's this comment on HN also explaining those problems inherent to all zen-architecture AMD CPUs: https://news.ycombinator.com/item?id=17518844)
 
Back
Top