Intel Xeon 6 with high number of threads : kernel panic

Hi,

We have two brand new servers with 2x Intel 6787p, 86 cores / 172 threads each, total of 344 threads.
We would like to install FreeBSD but during the initialization we end up with a kernel panic, it seems to be related to the high number of threads and/or msi_map unsupported.

We tried both 14.3 and 15.0.

14.3 :

2025.08.29.10.18.01.png


2025.08.29.10.18.35.png


15.0 :
2025.08.29.10.11.01.png


2025.08.29.10.11.26.png


Is there maybe some parameters to tweak on the BIOS or on kernel options on FreeBSD ?
I have seen that there had recently some work to support largest number of CPUs, so I hope we'll have some luck but we haven't found a way to make it work so far...

Many thanks
 
Looks more like an error while attaching PCIe network adapters.

You could try disabling them in the BIOS for a test.
 
Originally I had a 2x Ports 10GbE Base-T I was suspicious about and I removed them (the 14.3 pictures had still them connected).
But got the same error, there is another 4x Gigabit port I can try to disable to test remotely through the IPMI.
Will report my findings, thanks for your suggestion
 
I tried to disable the 3x PCI slots but got the same result. (I don't see any other options, but I am not 100% sure it really disabled the right one)
image.png


When I am back in the datacenter next week, I will physically remove the card to test.
 
blt2b Thank you for providing plenty of useful information.

There are at least two problems here.

The first problem is that there is a bug in the code of if_em and/or iflib which after a certain rare failure can lead to iflib_irq_free() being called twice on the same IRQ resource (once by em_if_msix_intr_assign() and once by em_free_pci_resources()).

The second problem is that an attempt is made to use APIC IDs larger than 255. With 2 CPUs, each having 86 cores and each core having 2 threads, the number of CPUs as seen by FreeBSD becomes quite large. With each CPU having its own APIC, APIC IDs larger than 255 have to be used. There are some problems reported with that. See <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=288122>, <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=287492> and <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=273022>.

I suggest the following three independent experiments:
  1. Disable hyperthreading from the BIOS. This should bring down the number of CPUs to a more manageable number. If disabling the second physical CPU is allowed, you can try that too.
  2. Disable attaching the igb0 device by running set hint.igb.0.disabled=1 at the loader prompt. This should allow the boot to succeed.
  3. Disable MSI-X interrupts for igb0 by running set dev.igb.0.iflib.disable_msix=1 at the loader prompt.
After you try these, I recommend creating a PR and also posting to the freebsd-current@ mailing list. When creating the PR, attach the output of acpidump -dt. I also recommend using a serial console to capture all the messages from the kernel in text form. FreeBSD 15.0 will be released soon so there is no time to waste.
 
JordanG many thanks for your suggestions and your analysis !

I will work on 1. 2. & 3. during the week-end as I have a remote access to the IPMI, and post my findings here, and I will reference this post to the bugzilla ticket.
I would probably also be able to post the output of the acpidump -dt if I am able to boot, for the console I would need a physical access, perhaps to append the ticket later on.
 
All the tests are made with 15.0 (that the USB stick connected right now to the unit)

1. Disabling Hyperthreading (BIOS → Advanced → Hyper-Threading (ALL) Disabled) is lowering the number of detected CPU by FreeBSD from 344 to 172. Same behavior with the kernel panic at the end (probably due to the igb0 issue). I am not sure if I can disable a physical CPU, the closest option I see in the BIOS is “CPU2 Core Disable Bitmap”, by default the values are 0 and the comment says : “Core Disable Bitmap(Hex) For every bit : set to Disable or clear to Enable. NOTE : Any use core disable will force static SST-PP. At least one core per CPU must be enabled. Disabling all cores is an invalid configuration."
If you think it is a right option, I can try.

For both 2. and 3. I did a show to check that the option were well set before hitting boot

2. set hint.igb.0.disabled=1 at boot loader then boot has the same end with kernel panic:
2025.08.30.16.32.25.png


To test, I tried to disable Hyperthreading AND applying this hint but got the same result with same kernel panic.
I am surprised as this is supposed to disable the entire card, specifically with my result bellow.

3. set dev.igb.0.iflib.disable_msix=1, is throwing a bunch of errors where the msi_map has some error but on nvme1... still not sure if the drive is fully functional or not :
2025.08.30.17.03.24.png


But I got the system to boot ! (the igb0 is gone from the system, not so convenient to use it :)

I was curious to see if this would change anything if I apply 3. with or without Hyperthreading disabled, I was able to boot on both case.

Any more ideas to pursue/focus the troubleshooting, and before I open a ticket ?
Many thanks for your help again very appreciated
 
All the tests are made with 15.0 (that the USB stick connected right now to the unit)

1. Disabling Hyperthreading (BIOS → Advanced → Hyper-Threading (ALL) Disabled) is lowering the number of detected CPU by FreeBSD from 344 to 172. Same behavior with the kernel panic at the end (probably due to the igb0 issue). I am not sure if I can disable a physical CPU, the closest option I see in the BIOS is “CPU2 Core Disable Bitmap”, by default the values are 0 and the comment says : “Core Disable Bitmap(Hex) For every bit : set to Disable or clear to Enable. NOTE : Any use core disable will force static SST-PP. At least one core per CPU must be enabled. Disabling all cores is an invalid configuration."
If you think it is a right option, I can try.

Don't play with the "CPU2 Core Disable Bitmap" in the BIOS. It may have some effect but in my opinion it will not be worth the time.

2. set hint.igb.0.disabled=1 at boot loader then boot has the same end with kernel panic:


To test, I tried to disable Hyperthreading AND applying this hint but got the same result with same kernel panic.
I am surprised as this is supposed to disable the entire card, specifically with my result bellow.

Instead of just hint.igb.0.disabled=1 try these four together: hint.igb.0.disabled=1, hint.igb_if.0.disabled=1, hint.em.0.disabled=1 and hint.em_if.0.disabled=1.
The source code file sys/dev/e1000/if_em.c logically contains four drivers and maybe igb is not the correct one. An alternative explanation is that you've got more than one network adapters.

3. set dev.igb.0.iflib.disable_msix=1, is throwing a bunch of errors where the msi_map has some error but on nvme1... still not sure if the drive is fully functional or not :

But I got the system to boot ! (the igb0 is gone from the system, not so convenient to use it :)

Great!

Any more ideas to pursue/focus the troubleshooting, and before I open a ticket ?

Try setting both hw.pci.enable_msi=0 and hw.pci.enable_msix=0 at the loader prompt.
Also separately try setting kern.smp.disabled=1.

When filing the PR don't forget to attach the output of acpidump -dt for the case where Hyperthreading is enabled.
 
Don't play with the "CPU2 Core Disable Bitmap" in the BIOS. It may have some effect but in my opinion it will not be worth the time.
OK 👍 for not playing with CPU2 Core Disable Bitmap.

Instead of just hint.igb.0.disabled=1 try these four together: hint.igb.0.disabled=1, hint.igb_if.0.disabled=1, hint.em.0.disabled=1 and hint.em_if.0.disabled=1.
The source code file sys/dev/e1000/if_em.c logically contains four drivers and maybe igb is not the correct one. An alternative explanation is that you've got more than one network adapters.
I tried all them four together, same kernel panic. I only have one 4x ports Gigabit port (one PCI slot) I will add a 2x10GbE ports this week and also will test another 4x ports Gigabit, almost the same but not the exact same reference but I know it is working with FreeBSD in another server..

Try setting both hw.pci.enable_msi=0 and hw.pci.enable_msix=0 at the loader prompt.
Never ending boot, nvme0 and nvme1 are complaining about msi and got plenty of USB_ERR_TIMEOUT
2025.09.01.15.13.45.png


Also separately try setting kern.smp.disabled=1.
2025.09.01.15.59.46.png


I still don't know if it is not due to igb0 issue as well, but as you asked to do it separately I didn't tried to do the four with this, I did it independently

When filing the PR don't forget to attach the output of acpidump -dt for the case where Hyperthreading is enabled.
using set dev.igb.0.iflib.disable_msix=1 to boot with Hyperthreading enabled, and acpidump -dt : (I don't have any functioning network card now nor serial, I exported to a file I hope I can connect an USB to export the (long) output to the PR
 
JordanG to give you some news as I am today in the datacenter :

I removed the 4x gigabit ports and replaced with a 4x10GbE and it is booting without any issue and without any changes during the loader prompt.

I have now the acpidump -dt and I am opening a ticket to the FreeBSD bugzilla for the large number of threads, which I think is the most urgent before 15.0 as more people could be impacted specifically with the new generation of servers/CPUs available in the market.
 
JordanG to give you some news as I am today in the datacenter :

I removed the 4x gigabit ports and replaced with a 4x10GbE and it is booting without any issue and without any changes during the loader prompt.

I have now the acpidump -dt and I am opening a ticket to the FreeBSD bugzilla for the large number of threads, which I think is the most urgent before 15.0 as more people could be impacted specifically with the new generation of servers/CPUs available in the market.
You created a problem report in FreeBSD's bug database, however it is a very empty one. This forum thread contains 20 times more information than the problem report. And you decided to cite me but only partially, omitting two thirds of my analysis. You need to do better.

When you use the "4x10GbE" NIC do you get messages such as these:
Code:
nvme0: System interrupt issues?
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT
nvme1: System interrupt issues?
usbd_setup_device_desc: getting device descriptor at addr 1 failed, USB_ERR_TIMEOUT

How is the "4x10GbE" NIC described in dmesg?

I see that there are two different panics. You should report them both.
 
Back
Top