cracauer@
Developer
This is a continuation on my local LLM thread, which is hard to reply to. This one might be easier.
This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.
This will install the llama.cpp LLM runner, which is an intermediate difficulty...
I wanted to waffle a bit more about the hardware involved. You basically have two approaches:
Cost-wise it is a wash. For $3300-$3500 you can pick between:
The Apple Mac doesn't run all models well. Some models are marked as optimized for Metal (Apple's equivalent to Vulkan).
I have been told that Strix Halo prefers to run Mixture of Expert models and doesn't do so well on dense models. I don't have one of those suckers so I can't comment. There's also the aspect of getting FreeBSD to first run on the thing (we have recently seen a failure) and getting Vulkan up.
I have the single NVidia card, which runs out of the box for FreeBSD. But alas everything bigger than 32 GB is slow. There are many interesting models between 32 and 128 GB. But on the other hand, the models that do fit are very fast. You can also use the performance advantage to increase the context size (although that costs further VRAM). I also plan to do extensive experiments with post-training and agents. Speed is more important in that case than when just running a chat through the web browser.
I do not know whether multi-GPU works for one of the llamas on FreeBSD. On paper 3x 3090 looks really attractive since it gives you 72 GB of VRAM for the same price. And functions as heating in winter. Just for starters it might be that multi-GPU in llama.cpp only works with CUDA, not Vulkan.
Theoretically the NVidia GPUs have another advantage: you can run all that GPU software that only has CUDA backends. At the time of this writing CUDA does not work on FreeBSD through Linuxulator, though.
Finally just a word on commercial LLMs: cost-wise it is clearly best to just use up as many of the $20/month plans. They are heavily sponsored. Buying your own hardware can't compete price-wise. But Anthropic might or might not have kicked Claude Code out of that plan:
www.wheresyoured.at
Code:
pkg install llama-cpp
llama-server \
--host `hostname` \
--port 8080 \
--ctx-size $((64 * 1024)) \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
-hf bartowski/Qwen_Qwen3.5-27B-GGUF:Q6_K_L
This is assuming you already have a working graphics card driver with Vulkan. You can test with `vulkaninfo --summary` from pkg vulkan-tools. Working Vulkan is in the binary NVidia drivers and the AMD drivers (including CPU-integrated GPU). Dunno about Intel GPUs.
This will install the llama.cpp LLM runner, which is an intermediate difficulty...
- cracauer@
- Replies: 1
- Forum: Howtos and FAQs (Moderated)
I wanted to waffle a bit more about the hardware involved. You basically have two approaches:
- Get a fat GPU with a certain amount of VRAM. You will be able to run LLMs up to the VRAM size fast. Overflowing models up to RAM size of the machine will run slow but not necessarily catastrophic if you still have a good chunk in VRAM.
- Get a machine with integrated GPU that shares RAM with the GPU. This gets you much more memory for the GPU than you can afford with a dedicated GPU. It will run models up to RAM size at mediocre to OK speed.
Cost-wise it is a wash. For $3300-$3500 you can pick between:
- A used NVidia 5090 with 32 GB VRAM
- 3x NVidia 3090 and a really fat power supply
- An AMD Strix Halo (Ryzen AI 395) with 128 GB RAM (shared GPU memory)
- An Apple Mac with a M4 Max and 128 GB RAM (shared GPU memory)
The Apple Mac doesn't run all models well. Some models are marked as optimized for Metal (Apple's equivalent to Vulkan).
I have been told that Strix Halo prefers to run Mixture of Expert models and doesn't do so well on dense models. I don't have one of those suckers so I can't comment. There's also the aspect of getting FreeBSD to first run on the thing (we have recently seen a failure) and getting Vulkan up.
I have the single NVidia card, which runs out of the box for FreeBSD. But alas everything bigger than 32 GB is slow. There are many interesting models between 32 and 128 GB. But on the other hand, the models that do fit are very fast. You can also use the performance advantage to increase the context size (although that costs further VRAM). I also plan to do extensive experiments with post-training and agents. Speed is more important in that case than when just running a chat through the web browser.
I do not know whether multi-GPU works for one of the llamas on FreeBSD. On paper 3x 3090 looks really attractive since it gives you 72 GB of VRAM for the same price. And functions as heating in winter. Just for starters it might be that multi-GPU in llama.cpp only works with CUDA, not Vulkan.
Theoretically the NVidia GPUs have another advantage: you can run all that GPU software that only has CUDA backends. At the time of this writing CUDA does not work on FreeBSD through Linuxulator, though.
Finally just a word on commercial LLMs: cost-wise it is clearly best to just use up as many of the $20/month plans. They are heavily sponsored. Buying your own hardware can't compete price-wise. But Anthropic might or might not have kicked Claude Code out of that plan:
[UPDATED] News: Anthropic (Briefly) Removes Claude Code From $20-A-Month "Pro" Subscription Plan For New Users
Executive Summary: * In the later afternoon of April 21 2026, Anthropic removed access to Claude Code for its $20-a-month "Pro" Plans on various pricing pages. * Current Pro users appeared to still have access via the Claude web app. * Claude Code support documents, for a brief period...
www.wheresyoured.at

