Technical aspects of running local LLMs on FreeBSD

cracauer@

Developer
This is a continuation on my local LLM thread, which is hard to reply to. This one might be easier.



I wanted to waffle a bit more about the hardware involved. You basically have two approaches:
  • Get a fat GPU with a certain amount of VRAM. You will be able to run LLMs up to the VRAM size fast. Overflowing models up to RAM size of the machine will run slow but not necessarily catastrophic if you still have a good chunk in VRAM.
  • Get a machine with integrated GPU that shares RAM with the GPU. This gets you much more memory for the GPU than you can afford with a dedicated GPU. It will run models up to RAM size at mediocre to OK speed.

Cost-wise it is a wash. For $3300-$3500 you can pick between:
  • A used NVidia 5090 with 32 GB VRAM
  • 3x NVidia 3090 and a really fat power supply
  • An AMD Strix Halo (Ryzen AI 395) with 128 GB RAM (shared GPU memory)
  • An Apple Mac with a M4 Max and 128 GB RAM (shared GPU memory)

The Apple Mac doesn't run all models well. Some models are marked as optimized for Metal (Apple's equivalent to Vulkan).

I have been told that Strix Halo prefers to run Mixture of Expert models and doesn't do so well on dense models. I don't have one of those suckers so I can't comment. There's also the aspect of getting FreeBSD to first run on the thing (we have recently seen a failure) and getting Vulkan up.

I have the single NVidia card, which runs out of the box for FreeBSD. But alas everything bigger than 32 GB is slow. There are many interesting models between 32 and 128 GB. But on the other hand, the models that do fit are very fast. You can also use the performance advantage to increase the context size (although that costs further VRAM). I also plan to do extensive experiments with post-training and agents. Speed is more important in that case than when just running a chat through the web browser.

I do not know whether multi-GPU works for one of the llamas on FreeBSD. On paper 3x 3090 looks really attractive since it gives you 72 GB of VRAM for the same price. And functions as heating in winter. Just for starters it might be that multi-GPU in llama.cpp only works with CUDA, not Vulkan.

Theoretically the NVidia GPUs have another advantage: you can run all that GPU software that only has CUDA backends. At the time of this writing CUDA does not work on FreeBSD through Linuxulator, though.

Finally just a word on commercial LLMs: cost-wise it is clearly best to just use up as many of the $20/month plans. They are heavily sponsored. Buying your own hardware can't compete price-wise. But Anthropic might or might not have kicked Claude Code out of that plan:
 
First attempt on my Raptor Lake with UHD730 graphics that should work:
Code:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Illegal instruction (core dumped)
Do I need a proprietary Intel driver? Never seen it...
Next attempt can be a Ryzen 7 with GTX1050.
 
First attempt on my Raptor Lake with UHD730 graphics that should work:
Code:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (RPL-S) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
Illegal instruction (core dumped)
Do I need a proprietary Intel driver?

Not sure. What does `vulkaninfo --summary` from port vulkan-tools report? Maybe you only have the Mesa software renderer?
 
Not sure. What does `vulkaninfo --summary` from port vulkan-tools report? Maybe you only have the Mesa software renderer?
Code:
==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.336


Instance Extensions: count = 24
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_headless_surface                : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6
VK_LUNARG_direct_driver_loading        : extension revision 1

Instance Layers:
----------------

Devices:
========
GPU0:
    apiVersion         = 1.3.278
    driverVersion      = 24.1.7
    vendorID           = 0x8086
    deviceID           = 0xa780
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = Intel(R) Graphics (RPL-S)
    driverID           = DRIVER_ID_INTEL_OPEN_SOURCE_MESA
    driverName         = Intel open-source Mesa driver
    driverInfo         = Mesa 24.1.7
    conformanceVersion = 1.3.6.0
    deviceUUID         = 868080a7-0400-0000-0002-000000000000
    driverUUID         = f303ef53-4163-bd95-7437-925e093696ce
GPU1:
    apiVersion         = 1.3.278
    driverVersion      = 0.0.1
    vendorID           = 0x10005
    deviceID           = 0x0000
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 15.0.7, 256 bits)
    driverID           = DRIVER_ID_MESA_LLVMPIPE
    driverName         = llvmpipe
    driverInfo         = Mesa 24.1.7 (LLVM 15.0.7)
    conformanceVersion = 1.3.1.1
    deviceUUID         = 6d657361-3234-2e31-2e37-000000000000
    driverUUID         = 6c6c766d-7069-7065-5555-494400000000
 
You have the hardware renderer.

Did you install llama and ggml from packages? Maybe this is an issue of being compiled in a context where native CPU was detected during compilation? Although Raptor Lake should not miss much except avx512.
 
You have the hardware renderer.

Did you install llama and ggml from packages? Maybe this is an issue of being compiled in a context where native CPU was detected during compilation? Although Raptor Lake should not miss much except avx512.
That's possible. I'm not sure about the origin of everything.
It's also not a GENERIC 15 kernel. Some things are changed.I don't think it can be related but it may be a problem. System was still copying so I didn't want to reboot. I'll try a clean kernel and world in a few hours.
 
Qwen 3.6 is out as a dense model (that means that all parameters are active, as oppose to a Mixture of Experts model).

Here is how to find the file to download. Visit the models homepage on Huggingface:

Can't hurt to read some of the drivel on there about the care and feeding of this particular model.

Click on the tab "files and versions":

Pick a quantization and size that fits your computer. Copy the link to the gguf file.

fetch https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q6_K.gguf

Then start llama.cpp with --model <thatfileondisk>.

There is also an actual homepage for the model:

This particular homepage is not very good. It doesn't have recommendations for parameters like temperature, kv cache and so on.
 
Back
Top