Report on running LLM inference on FreeBSD

This week-end, I installed FreeBSD on one of my homelabs and tried to run llama.cpp inferences on it.

TL;DR: Almost there! Actually, with recent GPU, it may already be usable.

My Hardware

I have two homelabs with strictly identical hardware so it makes comparing with Linux easy:

- MZ01-CE1 mobo
- AMD EPYC 7601 processor
- 4x Nvidia P40 GPUs
- 64GB DDR4 RAM
- 1TB M.2 SSD

My Software

My tests have been made with a Llama-3.3 model, 70B Instruct, quantized as IQ4 NL, in GGUF format.

More importantly, I'm using llama.cpp as inference engine. This is a stable C++ engine, allowing to avoid having to deal with the ML Python ecosystem and their dependency graphs breaking every other day.

The Linux box is a Gentoo, with nvidia-drivers-580.95.05, and using nvidia-cuda-toolkit-12.9.0.

On FreeBSD, I'm using the same model, and the same inference engine, but built with Vulkan support instead of CUDA.

The nvidia driver is nvidia-driver-580.119.02_1, and the Vulkan libs are vulkan-headers-1.4.336, vulkan-loader-1.4.336 and shaderc-2025.5_1. The only other specific thing I had to do was to build llama.cpp with -DGGML_VULKAN=1 instead of -DGGML_CUDA=1.

The Result

It works! I could run inferences on pure FreeBSD. Inference was a bit slower, averaging on 5.1 tokens per second, against 5.7 tokens per second on Linux/CUDA with the same prompts, so that's about 10% slower. I could live with that.

… but the deal breaker was prompt processing. It's about 10 times slower with Vulkan. This is not going to work for me, because one of my most common tasks is feeding the model legal documents, RFCs, DND campaign logs, etc and asking it questions about it. And there, it grinds so slow that it's not practical.

Asked about it, Gemini tells me it's a Pascal problem (the architecture of the P40 cards), and that more recent cards don't have it. I won't take its word for it, but I guess we'll see when I'll be rich enough to buy something else.

Still, it means that if you don't need big prompts and / or have more recent hardware, pure FreeBSD inferences may just work.

For me… it's time to learn about Linuxator. Apparently, I'm going to install a Ubuntu. Which is quite a fail for me, migrating from Gentoo to FreeBSD to have a less agitated system. 😂

PS: not sure in which category of the forum I should put this post, none really matches… Server would be the closest one, I guess, since homelabs are usually providing inferences for other machines on the local network. But then none of the subcategories matches. So I put it in off-topic.

Edit: Oh wow. So Linuxator is a translation layer implementing Linux syscall API and converting them to FreeBSD kernel syscalls, am I reading that correctly? That's… some WINE-level amount of work, both initially and at each kernel update, I imagine. Thanks a lot to FreeBSD developers for doing it.

Also, I'm totally installing a Gentoo in that compat directory. :P The chroot method is basically begging for it. But let's install a Ubuntu first to figure out everything there is to figure out in the documented way.
 
Let us know how it goes. It has been some time since somebody had CUDA running through the Linuxulator.

Good thinking about Vulkan instead of CUDA. I'll have to try that.
 
That's… some WINE-level amount of work

No, not exactly. Wine reconciliates two completely incompatible OSes.
Linuxulator is one of many Unix-on-Unix compatibility layers. Early FreeBSD had System V Release 4 compatibility as far as I remember. Linux also had a number of binary-compatibility layers, as did commercial Unix.

For example all Unix OS use system calls, but they don't completely follow the naming or the slot numbering, so the layer will translate in between.

Wine is the biggest-scoped reimplementation ever, I think nothing comes close.

Some time ago I took a box of old SCO OpenDesktop system that claims Windows compatibility. The hardware support for then-standard PC devices I found staggering. It's all supported, video, sound, networking. They have a 3rd party DOS on Unix emulator and a ton of scaffolding around it to enable running Windows 95 apps on the same X desktop as native. It's all on Motif "user friendly" GUI. The graphics capability of their DOS box I found excellent for the software of the age - it was able to run graphics as well as early 486 on a Pentium host.

Take into account all the code SCO and 3rd party made for that purpose, add all code of DOS and Windows 95.
I don't think it even touches 10% of the scope of the Wine project.
 
Maybe this will be of interest to the OP (I saw it here https://vermaden.wordpress.com/2026/02/23/valuable-news-2026-02-23/):

Very useful, thank you. I was still figuring out if my best shot would be running llama.cpp from the Linux chroot and trying to have it use the FreeBSD Nvidia driver through Linuxator, or if I should go the opposite direction, using llama.cpp on the host FreeBSD and trying to make it access CUDA libs installed in the chroot. I went the first way first because it sounds to me way less convoluted, but so far I could not have llama.cpp (nor nvidia-smi) see the GPUs. This doc is doing the same thing, so it should give me a big shortcut for debugging my problem, thanks. 👍
 
I have used "ollama". Works fine. But i did not recognised my NVIDIA stuff to do on GPU.
Yes, running Llama.cpp or similar on CPU (or on GPU inference using Vulkan) is not an issue. The problem is when you want to run CUDA - which sadly is the most efficient way by far to run inferences. Running on GPU is several orders of magnitude faster than running on CPU (provided you have enough VRAM to fit the model), and as I figured above, prompt processing is vastly faster on GPU with CUDA than on GPU with Vulkan (and still noticeably faster at inferences as well). Sadly, there is no support for CUDA on FreeBSD, so it has to go through Linux emulation.

EDIT: by the way, if you run on CPU and have a GPU, you should try building ollama with Vulkan support and offload a few layers on your GPU. Even with little VRAM, if often makes a big difference.
 
Here is my final report on this attempt.

In the end, I decided against going with FreeBSD for those homelabs. While it is indeed possible to run GPU accelerated inferences on FreeBSD with Vulkan, and it's probably possible as well with CUDA with a lot of tinkering, it was not going to work with my very specific set of constraints.

I'm using Nvidia Tesla P40 cards, which use the Pascal architecture, which is an old architecture which Nvidia dropped support for in its drivers starting from version 590, which is starting to hit the package managers of the various OS and distros. Time is running out, for me: soon, only old drivers will support my cards, depending themselves on old dependencies and breaking with more recent versions, which themselves will have their own dependencies, to a point that to be able to keep using my cards in the future, I need to be able to freeze my OS. That part is easy: just don't update it. Except I also want to be able to reproduce it from scratch should a drive die or get corrupted or something.

So I need something that can be installed through a purely offline install, using tarballs I have kept of the working versions of every software on the system, without ever depending on the internet. This could have work with FreeBSD should Vulkan support had been enough for me. Sadly, on the P40, prompt processing on Vulkan is especially slow, 10x slower than CUDA, so it's not an option for me. FreeBSD with using a Linux distro through Linuxator would mean having now to freeze *two* OSes, it would only make things more fragile. Plus, taking on such a challenging setup with an OS I barely know rather than the Linux distro I've been using for more than 20 years sounded unwise indeed. I can make my Gentoo work without the web, and I can remember "how things were set up back then", should web documentation not mention it anymore when comes time to fix something. If I try that with an OS I'm only starting with, it's a recipe for disaster - not because of the OS, but because I will lack the skills needed to take it to such extreme.

In the end, I did back up a copy of current Gentoo's stage-3 (the base system that you uncompress in your root partition), a copy of all distfiles for programs currently installed, of Portage at current state, various configuration files (including kernel .config), sources for llama.cpp at the exact commit that I know was working, and the model files I'm currently using. Then I attempted to use all that to make an install from scratch on the homelab on which I had installed FreeBSD, doing it all offline, from disk partitioning to running LLM inferences. It worked. My process is now failproof, I know I may use that hardware forever.

Does that mean you can't run LLM inference on FreeBSD?

Absolutely not! I had a very specific set of constraints with my P40 cards (in case you wonder why use such an old architecture: they are 24GB VRAM cards, four of them gave me 96GB VRAM for €1,000 total ; a recent RTX 6000 Pro Blackwell 96GB is about €10,000. Can't wait for the wave of AI business bankruptcies to start to get those cheap :P ). But:

  • if you want to run inferences on CPU only, its works out of the box.
  • if you want to run inferences on GPU, Vulkan works quite well, if a bit slower than CUDA (especially on prompt processing), but that may just be good enough, and it's basically as easy as building for CPU, just adding -DGGML_VULKAN=1 in the config command, and installing dependencies (vulkan-headers, vulkan-loader and shaderc)
  • if you want to run inferences on GPU with CUDA, some people managed to do it

I could not make CUDA work myself, but that should not be seen as meaning it can't be done. First, some people did it recently, as shown in the Gist posted by AlfredoLlaquet above. Second, I kind of already decided to go with Gentoo for my frozen system during my last attempt to make CUDA work on FreeBSD, and I had a second homelab with the exact same specs working on Gentoo, so my drive to get it work on FreeBSD was not especially strong. Someone ready to dedicate more time to it may have more success.

Also, it's worth noting that if you "just want to try to see how it works", it's practical nowadays to play with small models using CPU only, even if you have a GPU. You don't need to bother with setting up GPU inference to get you started when we live in a time where you can do near perfect OCR with a Qwen-3.5 9B model taking 5GB or RAM and running at 6 tokens/sec on a decade old CPU.

My laptop is an old Elitebook G6, with an i7-8665U CPU and 32GB of RAM, I've been running a couple of those models through llama-server as init.d services, and started writing programs using llama-server's API, including a mailfiter program to route my mails by asking questions about their content, it works great, I'm having fun. No need for homelabs and GPUs for that.

If you want to try something like that:

  1. download either Qwen3.5-9B-UD-Q4_K_XL.gguf or Qwen3.5-4B-UD-Q4_K_XL.gguf from Unsloth account on huggingface.com (the first one is the 9B model, bigger than the second one, which is a 4B model ; this is the amount of parameters of the model, less means smaller size, faster execution, poorer capabilities)
  2. download llama.cpp from their GitHub repos
  3. build llama.cpp: cd llama.cpp && cmake -B build && cmake --build build --config Release -j $(nproc)
  4. run build/bin/llama-server -m <model_path> --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --port 8080
  5. point your web browser to http://localhost:8080

There is, of course, way more to it if you then want to tweak for optimal performances, but the above instructions are the easiest it can be.

Alright, now I move to trying to run FreeBSD on my Raspberry Pi home server! Once I get that Ethernet/USB adapter. 😅
 
Back
Top