Technical aspects of running local LLMs on FreeBSD

cracauer@ · Apr 22, 2026

This is a continuation on my local LLM thread, which is hard to reply to. This one might be easier.

Thread 'Running a GPU-accelerated LLM on FreeBSD (2-line howto)'

Apr 21, 2026

Code:

pkg install llama-cpp
llama-server \
        --host `hostname` \
        --port 8080 \
        --ctx-size $((64 * 1024)) \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        -hf bartowski/Qwen_Qwen3.5-27B-GGUF:Q6_K_L

This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.

This will install the llama.cpp LLM runner, which is an intermediate difficulty...

I wanted to waffle a bit more about the hardware involved. You basically have two approaches:

Get a fat GPU with a certain amount of VRAM. You will be able to run LLMs up to the VRAM size fast. Overflowing models up to RAM size of the machine will run slow but not necessarily catastrophic if you still have a good chunk in VRAM.
Get a machine with integrated GPU that shares RAM with the GPU. This gets you much more memory for the GPU than you can afford with a dedicated GPU. It will run models up to RAM size at mediocre to OK speed.

Cost-wise it is a wash. For $3300-$3500 you can pick between:

A used NVidia 5090 with 32 GB VRAM
3x NVidia 3090 and a really fat power supply
An AMD Strix Halo (Ryzen AI 395) with 128 GB RAM (shared GPU memory)
An Apple Mac with a M4 Max and 128 GB RAM (shared GPU memory)

The Apple Mac doesn't run all models well. Some models are marked as optimized for Metal (Apple's equivalent to Vulkan).

I have been told that Strix Halo prefers to run Mixture of Expert models and doesn't do so well on dense models. I don't have one of those suckers so I can't comment. There's also the aspect of getting FreeBSD to first run on the thing (we have recently seen a failure) and getting Vulkan up.

I have the single NVidia card, which runs out of the box for FreeBSD. But alas everything bigger than 32 GB is slow. There are many interesting models between 32 and 128 GB. But on the other hand, the models that do fit are very fast. You can also use the performance advantage to increase the context size (although that costs further VRAM). I also plan to do extensive experiments with post-training and agents. Speed is more important in that case than when just running a chat through the web browser.

I do not know whether multi-GPU works for one of the llamas on FreeBSD. On paper 3x 3090 looks really attractive since it gives you 72 GB of VRAM for the same price. And functions as heating in winter. Just for starters it might be that multi-GPU in llama.cpp only works with CUDA, not Vulkan.

Theoretically the NVidia GPUs have another advantage: you can run all that GPU software that only has CUDA backends. At the time of this writing CUDA does not work on FreeBSD through Linuxulator, though.

Finally just a word on commercial LLMs: cost-wise it is clearly best to just use up as many of the $20/month plans. They are heavily sponsored. Buying your own hardware can't compete price-wise. But Anthropic might or might not have kicked Claude Code out of that plan:

[UPDATED] News: Anthropic (Briefly) Removes Claude Code From $20-A-Month "Pro" Subscription Plan For New Users

Executive Summary: * In the later afternoon of April 21 2026, Anthropic removed access to Claude Code for its $20-a-month "Pro" Plans on various pricing pages. * Current Pro users appeared to still have access via the Claude web app. * Claude Code support documents, for a brief period...

www.wheresyoured.at

MG · Apr 22, 2026

First attempt on my Raptor Lake with UHD730 graphics that should work:

Code:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Illegal instruction (core dumped)

Do I need a proprietary Intel driver? Never seen it...
Next attempt can be a Ryzen 7 with GTX1050.

cracauer@ · Apr 22, 2026

MG said:
First attempt on my Raptor Lake with UHD730 graphics that should work:

Code:

ggml_vulkan: Found 1 Vulkan devices: ggml_vulkan: 0 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none Illegal instruction (core dumped)

Do I need a proprietary Intel driver?

Not sure. What does `vulkaninfo --summary` from port vulkan-tools report? Maybe you only have the Mesa software renderer?

MG · Apr 22, 2026

cracauer@ said:
Not sure. What does `vulkaninfo --summary` from port vulkan-tools report? Maybe you only have the Mesa software renderer?

Code:

==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.336


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers:
----------------

Devices:
========
GPU0:
    apiVersion         = 1.3.278
    driverVersion      = 24.1.7
    vendorID           = 0x8086
    deviceID           = 0xa780
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = Intel(R) Graphics (RPL-S)
    driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
    driverName         = Intel open-source Mesa driver
    driverInfo         = Mesa 24.1.7
    conformanceVersion = 1.3.6.0
    deviceUUID         = 868080a7-0400-0000-0002-000000000000
    driverUUID         = f303ef53-4163-bd95-7437-925e093696ce
GPU1:
    apiVersion         = 1.3.278
    driverVersion      = 0.0.1
    vendorID           = 0x10005
    deviceID           = 0x0000
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
    driverID           = DRIVER_ID_MESA_LLVMPIPE
    driverName         = llvmpipe
    driverInfo         = Mesa 24.1.7 (LLVM 15.0.7)
    conformanceVersion = 1.3.1.1
    deviceUUID         = 6d657361-3234-2e31-2e37-000000000000
    driverUUID         = 6c6c766d-7069-7065-5555-494400000000

cracauer@ · Apr 22, 2026

You have the hardware renderer.

Did you install llama and ggml from packages? Maybe this is an issue of being compiled in a context where native CPU was detected during compilation? Although Raptor Lake should not miss much except avx512.

MG · Apr 22, 2026

cracauer@ said:
You have the hardware renderer.

Did you install llama and ggml from packages? Maybe this is an issue of being compiled in a context where native CPU was detected during compilation? Although Raptor Lake should not miss much except avx512.

That's possible. I'm not sure about the origin of everything.
It's also not a GENERIC 15 kernel. Some things are changed.I don't think it can be related but it may be a problem. System was still copying so I didn't want to reboot. I'll try a clean kernel and world in a few hours.

cracauer@ · Apr 22, 2026

Qwen 3.6 is out as a dense model (that means that all parameters are active, as oppose to a Mixture of Experts model).

Here is how to find the file to download. Visit the models homepage on Huggingface:

unsloth/Qwen3.6-27B-GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Can't hurt to read some of the drivel on there about the care and feeding of this particular model.

Click on the tab "files and versions":

unsloth/Qwen3.6-27B-GGUF at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Pick a quantization and size that fits your computer. Copy the link to the gguf file.

fetch https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q6_K.gguf

Then start llama.cpp with --model <thatfileondisk>.

There is also an actual homepage for the model:

Qwen Studio

Qwen Studio offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.

qwen.ai

This particular homepage is not very good. It doesn't have recommendations for parameters like temperature, kv cache and so on.

cracauer@ · Apr 27, 2026

Heads up: version 8895 of llama.cpp has been committed to ports.

This changes the location where llama.cpp stores models automatically downloaded from huggingface (as in post #1 here).
Old cache: ~/.cache/llama.cpp/
New cache: ~/.cache/huggingface/hub

It will automatically move already downloaded models to the new location. If you have different filesystems under there you might be in for a surprise.

In my case the homedir is on NFS and there is a symlink for the old cache to a local NVMe SSD. So it moved things to NFS and afterwards I made to move them back via a new symlink to the SSD. Took 3 hours I'm never getting back.

xibo · Apr 27, 2026

cracauer@ said:
Did you install llama and ggml from packages? Maybe this is an issue of being compiled in a context where native CPU was detected during compilation?

GGML has the ability to detect your CPU at runtime and fallback from whatever it doesn't support. I've patched the ports file to make use of it here:

https://github.com/freebsd/freebsd-ports/compare/main...SchaichAlonso:freebsd-ports:2026.04.23-ggml-avx512.diff

Note the downgrade of llvm from 22 to 21 in order to workaround a llvm22 regression, which is why I haven't PR-ed it.

cracauer@ said:
Although Raptor Lake should not miss much except avx512.

On current processors, the performance degradation of scheduling avx512 code on an efficiency core instead of a performance core is significantly larger then the degradation of scheduling avx2 code on an e-core instead of a p-core, or in other words, the relative performance to p-cores of e-cores is higher when executing avx2 then it is when execuring avx512. Since FreeBSD-15.0 has no awareness of the cpu design, it will run the inference software on whatever cores are least loaded at the time the inference software is launched, and then will keep it sticked on that cores unless other cores have at least kern.sched.steal_tresh less load on them, which means odds are high that llama.cpp starts on an e-core, allocates mostly e-cores and then stays there. The CPU firmware will then identify llama to be avx heavier then most other workloads and generate an advisory asking the OS to reschedule it onto a p-core, however FreeBSD-15.0 will ignore that advisory and keep it on the e-core(s) nevertheless. So until FreeBSD learns how to address scheduler advisories (currently tracked under D54674 for scheduler advisories in general and D44456 for intel specific Intel Thread Director register processing) there's no point in using avx512 even if your CPU does support it.