Running a GPU-accelerated LLM on FreeBSD (2-line howto)

cracauer@ · Apr 21, 2026

Code:

pkg install llama-cpp
llama-server \
        --host `hostname` \
        --port 8080 \
        --ctx-size $((64 * 1024)) \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        -hf bartowski/Qwen_Qwen3.5-27B-GGUF:Q6_K_L

This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.

This will install the llama.cpp LLM runner, which is an intermediate difficulty server. It provides a web server on port 8080 (or as given on the commandline). The FreeBSD port and pkg is compiled with Vulkan by default. The second line downloads the model I like for coding from huggingface.com and starts the server on it. You can use it in a web browser like a commercial LLM on the web. It provides an API on the webserver at the same time, so you can also use it from e.g. Gptel, an Emacs interface to LLMs. I set the temperature pretty low (less creativity and fantasizing) because that is what the model makers recommend for coding. For chit-chat you want to set it to 1.0 (the maximum).

This is a 28 GB model which fits my 32 GB NVidia graphics card. There are hundreds of models of all sizes available on huggingface, let me know what your hardware is and what you want to do for a recommendation.

FreeBSD/Vulkan runs this about 4% slower than Linux/Vulkan on the same hardware, which is 8% slower than Linux/CUDA. At this time I have not succeeded in running CUDA through Linuxulator.

cracauer@ · Apr 21, 2026

Here is how to run the uncensored model that loveydovey was talking about in

Thread 'Completely uncensored Qwen3.6 released for FreeBSD'

Apr 18, 2026

The only limitation is your fantasy now.

The model does not refuse - generates anything you ask it for (in testing it did not refuse to answer any of the 465 dangerous queries).
Can create anything you want - from porn novels to instructions on how to create a nuclear bomb - at your own risk.
Works on almost any hardware - available versions from 11.7 to 30.5 GB.
Works through ollama and other software.
Completely local.

Post your most intriguing answers here.

PS. I think we need a separate forum topic for AI. Don't you all think it's way overdue?

This time with a direct download, not using llama's internal cache. You can store the model anywhere you want.

Code:

fetch https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/resolve/main/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf
llama-server \
        --host `hostname` \
        --port 8080 \
        --temp 0.6 \
        --ctx-size $((256 * 1024)) \
        --model ~/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf

On the surface this is a very similar model of comparable on-disk size to the one in the first post, with bumping the qwen version from 3.5 to 3.6. But it is actually quite different. This model is a mixture of experts model with a higher number of base parameters but a lower number of active parameters. In the first post we had a dense model with all parameters active. Of course it has also been uncensored as the name implies.

The MoE model is less stable than the dense one, I get pretty noticeable variations in output for the same input. Overall I find the dense model to be more useful to me.

It is also more than 3x faster. I use most of that speed advantage to bump up the context size to 256k (which is very high for a local model).

cracauer@ · Apr 27, 2026

Heads up: version 8895 of llama.cpp has been committed to ports.

This changes the location where llama.cpp stores models automatically downloaded from huggingface (as in post #1 here).
Old cache: ~/.cache/llama.cpp/
New cache: ~/.cache/huggingface/hub

It will automatically move already downloaded models to the new location. If you have different filesystems under there you might be in for a surprise.

In my case the homedir is on NFS and there is a symlink for the old cache to a local NVMe SSD. So it moved things to NFS and afterwards I made to move them back via a new symlink to the SSD. Took 3 hours I'm never getting back.

Running a GPU-accelerated LLM on FreeBSD (2-line howto)

cracauer@

cracauer@

Thread 'Completely uncensored Qwen3.6 released for FreeBSD'

cracauer@