Running a GPU-accelerated LLM on FreeBSD (2-line howto)

cracauer@

Developer
Code:
pkg install llama-cpp
llama-server \
        --host `hostname` \
        --port 8080 \
        --ctx-size $((64 * 1024)) \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        -hf bartowski/Qwen_Qwen3.5-27B-GGUF:Q6_K_L

This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.

This will install the llama.cpp LLM runner, which is an intermediate difficulty server. It provides a web server on port 8080 (or as given on the commandline). The FreeBSD port and pkg is compiled with Vulkan by default. The second line downloads the model I like for coding from huggingface.com and starts the server on it. You can use it in a web browser like a commercial LLM on the web. It provides an API on the webserver at the same time, so you can also use it from e.g. Gptel, an Emacs interface to LLMs. I set the temperature pretty low (less creativity and fantasizing) because that is what the model makers recommend for coding. For chit-chat you want to set it to 1.0 (the maximum).

This is a 28 GB model which fits my 32 GB NVidia graphics card. There are hundreds of models of all sizes available on huggingface, let me know what your hardware is and what you want to do for a recommendation.

FreeBSD/Vulkan runs this about 4% slower than Linux/Vulkan on the same hardware, which is 8% slower than Linux/CUDA. At this time I have not succeeded in running CUDA through Linuxulator.
 
Back
Top