Running a GPU-accelerated LLM on FreeBSD (2-line howto)

cracauer@

Developer
Code:
pkg install llama-cpp
llama-server \
        --host `hostname` \
        --port 8080 \
        --ctx-size $((64 * 1024)) \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        -hf bartowski/Qwen_Qwen3.5-27B-GGUF:Q6_K_L

This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.

This will install the llama.cpp LLM runner, which is an intermediate difficulty server. It provides a web server on port 8080 (or as given on the commandline). The FreeBSD port and pkg is compiled with Vulkan by default. The second line downloads the model I like for coding from huggingface.com and starts the server on it. You can use it in a web browser like a commercial LLM on the web. It provides an API on the webserver at the same time, so you can also use it from e.g. Gptel, an Emacs interface to LLMs. I set the temperature pretty low (less creativity and fantasizing) because that is what the model makers recommend for coding. For chit-chat you want to set it to 1.0 (the maximum).

This is a 28 GB model which fits my 32 GB NVidia graphics card. There are hundreds of models of all sizes available on huggingface, let me know what your hardware is and what you want to do for a recommendation.

FreeBSD/Vulkan runs this about 4% slower than Linux/Vulkan on the same hardware, which is 8% slower than Linux/CUDA. At this time I have not succeeded in running CUDA through Linuxulator.
 
Here is how to run the uncensored model that loveydovey was talking about in

This time with a direct download, not using llama's internal cache. You can store the model anywhere you want.

Code:
fetch https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/resolve/main/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf
llama-server \
        --host `hostname` \
        --port 8080 \
        --temp 0.6 \
        --ctx-size $((256 * 1024)) \
        --model ~/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf

On the surface this is a very similar model of comparable on-disk size to the one in the first post, with bumping the qwen version from 3.5 to 3.6. But it is actually quite different. This model is a mixture of experts model with a higher number of base parameters but a lower number of active parameters. In the first post we had a dense model with all parameters active. Of course it has also been uncensored as the name implies.

The MoE model is less stable than the dense one, I get pretty noticeable variations in output for the same input. Overall I find the dense model to be more useful to me.

It is also more than 3x faster. I use most of that speed advantage to bump up the context size to 256k (which is very high for a local model).
 
Back
Top