Local LLMs

Hi mates!

Has anyone tried to deploy LLM (Large Language Models) server on the FreeBSD box? Share your experience and how-to, please.
 
I just came here to ask the same question and I found your post. I have a local Manjaro desktop machine running Open WebUI with several LLMs installed and it works really well. But, I'd love to have it running on one of my webservers, which are all FreeBSD.

I could easily set up Open WebUI with one of my domains and connect from anywhere over HTTPS.

I wonder what kind of hardware resources I'd need though. My desktop has an NVIDIA 3070 but I don't put GPUs in the webservers.
 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end

 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end
Beautiful, and there actually already is a port/package for that, so no need to compile yourself:
misc/llama-cpp

Thanks a lot, didn't know about llama-cpp. Will try it as soon as possible myself.
 
I was not aware that there is a port/package, but actually llama.cpp gets updated so frequently (sometimes multiple times per day) that it could make sense to pull from the repo.

It also includes a server to use the llm via rest api and more stuff. Be sure to check out the git repo (readme).
 
now there is misc/ollama too. After watching a video from the latest Valuable News post, I got a bit interested. And now I have questions:
  • Will AMD graphics cards work on FreeBSD (I see that ollama recently got support for AMD)?
  • How much memory should the graphics card have? Will 12 GB be enough, or should I put in more money and get one with 16 GB? (As I understand it, the amount of memory limits how much of the LLM that can be in memory at once)
  • Is the CPU in the machine important too (if I run the LLM on the graphics card), or can I run this on an old machine with say a cheap AMD CPU with 16 or 32 GB RAM?
 
I have not experience with using graphics cards besides the GPU on my M1 mbp.

But a rough calculation about the memory requirements would be to use the parameters of the model and consider the quantization.

Usually llama (and likely ollama) use quantized models. So for a 8-bit quant. it's parameters of the model in bytes. Plus some overhead.

4-bit quantizations - divide parameters by two. etc.

If you use llama.cpp, it outputs the actual size of the model + memory used. So for example llama-3-8B-instruct uses:

llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)

If you need to go with 12Gb vs. 16Gb. depends on the availability of models in that range.

If there are no models available which benefit from the 4Gb more memory it would not make sense.

Hope this helps a bit.
 
Very interesting. I have in my FreeBSD Desktop an Nvidia 2080 Super with 8GB VRAM. It's not such a strong video card for LLM's but I am interested in trying it out at least.

I know next to nothing about running a local LLM. Experiments would pay out of course and it is interesting for me to run a local model to help me create code templates and influence it with models I find useful.

If i ever try this out, I will post an update
 
another GPU question, just in case someone here knows the answer:
- some graphics cards are labeled "LHR" (low hash rate) which I understand makes then unsuitable for crypto mining. Now, does the "LHR" make them unsuitable for running LLM's too?
 
I do this stuff for work and pleasure, so here's some insight.

1. Yes, it's possible on FreeBSD with CUDA using specific Nvidia driver and select generation of their GPUs (reasons of firmware and hardware design), but it's not a simple 1-2-3 set of commands, at least not as of right now.

2. CUDA support is not official, rather it is in PoC stages of development, and Linuxulator is used in the integration stage, so that's... fun.

3. CUDA support is essential for well performing LLMs at present, and only Nvidia cards have CUDA support. It's an industry problem and it's not being resolved any time soon, nor is it a FreeBSD vs whatever problem. a longer conversation is possible, but tends to devolve into pissing matches between fanboy groups.

4. What about not-nvidia? There's a WIP called ZLUDA which endeavors to facilitate CUDA translation to AMD and Intel GPUs; it was recently funded and is in its third iteration: https://github.com/vosen/ZLUDA

5. What about CPUs? Yes, but prepare to have fun with a nearly unusable experience. For the same reason that BTC went from CPU -> GPU -> ASIC, the majority of Ai/ML workflows require GPU compute for reasons of hardware optimization and core design specifications. That said, CPU model and capabilities are not to be ignored on a compute host that has the appropriate GPUs, eg: modern CPUs have extensions which accelerate certain operations (AVX512, BF16, AMX, etc) and those are critical for performance on dataset operations irrespective of the GPUs being used. So a modern Scalable Xeon generation CPU of similar core/clock will immediately be more useful than even the E5-2600 v4 series at their best, without touching on improvements like L1+L2 cache sizes or changes to NUMA and QPI and P+E cores, etcettttera.

6. Choosing hardware: GPUs with more RAM or mode CUDA cores as a priority for spec? here's an example: I have several A4000 16GB and several T40 24GB, and while one may think the 24GB SKUs would be immediately better (they're also double slot and require 2x TDP) well it's not that simple. Some models require more RAM to be paged in/out to/from the CPU and NVMe and RDMA/Infiniband but then we get into things like PCIe bandwidth contention and non-blocking blah blah NVlink etc. Many times over the A4000 will have higher performance in parallel for things like Ollama based models but experience slower operations on image based models like StableDiffusion which love to hold larger working-data in vRAM, which the 24GB cards benefit from, as opposed to benefiting from the higher CUDA core count on the A4000. So ... "it depends", and that's a common answer for a lot of hardware architecture concerns; there is no single best answer, and not everything needs the A100 or GH series just because they're more of X/Y/Z/whatever.

Some fun!
- https://github.com/intel/intel-extension-for-pytorch
- https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
 
Yes, it's possible on FreeBSD with CUDA using specific Nvidia driver and select generation of their GPUs (reasons of firmware and hardware design), but it's not a simple 1-2-3 set of commands, at least not as of right now.
Aw, such a shame that it doesn't work on FreeBSD yet, but it is understandable as FreeBSD focuses another aspect in computing. I guess Linux it is for now

As time goes, I am slightly getting more interested into running a Local LLM and training it to help me some sh scripts for my own FreeBSD usage. I would really appreciate it if you could recommend me some links or resources you found the most helpful about how to setup an LLM and sort of make it "focus" only on code, and not other speech or image generation.

I found it a bit hard to find useful information online, it seems trying to find things about LLM's returns results that are made by other AI's :)
 
another GPU question, just in case someone here knows the answer:
- some graphics cards are labeled "LHR" (low hash rate) which I understand makes then unsuitable for crypto mining. Now, does the "LHR" make them unsuitable for running LLM's too?
No, you can use an LHR card without problems for AI/ML.

AMD support in FreeBSD is only for Vulkan and OpenCL (CLBLast) backends, ROCm isn't supported.
 
As time goes, I am slightly getting more interested into running a Local LLM and training it to help me some sh scripts for my own FreeBSD usage. I would really appreciate it if you could recommend me some links or resources you found the most helpful about how to setup an LLM and sort of make it "focus" only on code, and not other speech or image generation.

From my limited experience i would say, forget about LLM training. I once followed Andrej Karpathy's youtube videos on implementing GPT-2 from scratch.

This works and you can actually train it, but training takes a long time even on my mbp m1max where it uses the gpu. As the model is also quite old, the inference results are

funny but not really usable. There must be a reason companies use data centers full of gpus to train usable models.

For inference, you can actually run smaller models on cpu, I once did run phi-3 on a raspberry pi. It was usable.
 
mstiller: Creating own LLMs, which perform like ChatGPT 34 and above for sure is not doable when counting in the computing power and data required.

What's doable though on available LLM models is using LoRA to fine tune them.

 
I do this stuff for work and pleasure, so here's some insight.

1. Yes, it's possible on FreeBSD with CUDA using specific Nvidia driver and select generation of their GPUs (reasons of firmware and hardware design), but it's not a simple 1-2-3 set of commands, at least not as of right now.

2. CUDA support is not official, rather it is in PoC stages of development, and Linuxulator is used in the integration stage, so that's... fun.

3. CUDA support is essential for well performing LLMs at present, and only Nvidia cards have CUDA support. It's an industry problem and it's not being resolved any time soon, nor is it a FreeBSD vs whatever problem. a longer conversation is possible, but tends to devolve into pissing matches between fanboy groups.



Some fun!
- https://github.com/intel/intel-extension-for-pytorch
- https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model
Is there a how to ?
Im using LLM`s but running on linux and using Oatmeal to connect remotely to my LLMs and if i could use on FreeBSD with GPU it would be great.
 
Well, ollama (the package) works - sort of.
Code:
tingo@locaal:~ $ ollama list
NAME              ID              SIZE      MODIFIED          
llama3.2:1b       baf6a787fdff    1.3 GB    About an hour ago    
mistral:latest    f974a74358d6    4.1 GB    About an hour ago    
tingo@locaal:~ $ ollama run mistral
Error: Post "http://127.0.0.1:11434/api/chat": EOF
just try again
Code:
tingo@locaal:~ $ ollama run mistral
>>> what is FreeBSD
 FreeBSD is a free and open-source Unix-like operating system based on BSD (Berkeley Software Distribution) versions of the Unix source code. It is known for its 
high-performance, stability, security, and compatibility with a wide range of software and hardware platforms. FreeBSD was initially developed at the University of 
California, Berkeley, as an extension of Research Unix, which served as the basis for many commercial Unix systems in the 1970s and 1980s. Since becoming open-source in 
1993, FreeBSD has grown into a popular choice among server administrators, developers, and hobbyists who value its flexibility, reliability, and customization options. It 
provides a variety of features such as a modular kernel, support for the ZFS file system, built-in virtualization with jails, and the Ports Collection, which allows users to 
easily install thousands of third-party applications. FreeBSD is licensed under the BSD License, which permits unrestricted use, modification, and distribution of the source 
code as long as copyright notices are preserved and modifications are clearly marked.

>>>
but it fails with that post error quite often. This on
Code:
root@locaal:~ # freebsd-version -ku
13.4-RELEASE-p1
13.4-RELEASE-p2
and ollama installed with pkg
Code:
root@locaal:~ # pkg -vv | grep url
    url             : "pkg+http://pkg.FreeBSD.org/FreeBSD:13:amd64/quarterly",
root@locaal:~ # pkg info ollama\*
ollama-0.3.6_1
interesting stuff.
 
Any experience with Llamafile here?

I'd like to use a local LLM to get some help to create XLST styles based on a XLS/CSV input and XML output, then I have a converter that does all the hard work, I would only need to create the style.

I hope this to not require to buy a computer "ad hoc"...

Thanks... 🙏
 
I have tried both Llamafile and Ollama...

Ollama has better graphic cards support, for example in my case with Llamafile my GPUs are not being used so its super slow = Tokens Per Second....

With Ollama it uses 3 of my GPUs and you can monitor which one is being used
Code:
nvidia-smi --loop=1
by monitoring the ram usage (I didn't look into the details but apparently Hugging Face Transformers build with Pytorch do the balancing between the cards pipe workload)... My next setup step is piping Open Web-UI so I can have a local GUI.... Currently I use Ollama directly in the CLI....

With Open Web-UI you can pipe RAG documentation/library directly in the GUI.

There is a bug installing it on FreeBSD but there is always bhyve Linux for such obstacle.


tingo the post error means your cards or CPU is not enough to run the model. Run a smaller model if the model is 4 gb you need MINIMUM 1 card with that much Ram if not you're going to face issues with Post Error. How do I know because when I try running Mistral-nemo 12B or the Gwen2.5 32B parameter the same issues happens with me.... But with smaller models like mistral latest I never face this issue and the local llm can maintain context for longer like 24+ hours (depends on number of prompts)....

Code:
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest
Error: Post "http://127.0.0.1:11434/api/chat": EOF
root@Secure_Ollama:/ # ollama run context-mistral-nemo:latest

👆👆
I run this on a FreeBSD Jail and Pass the GPUs to it through the devfs.rules



You can also look at the Ollama log you'll see why the POST error happens in my case with over 32 GB of ram I face constantly when running big parameters model with my hardware (Now if I had H100 or A100 with 40 or 80GB of Ram doubt the post error happens). You lose the context and you will need to initiate the model again when EOF error.


Code:
time=2024-12-11T23:14:28.040Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.543873626 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.166Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.669209817 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
time=2024-12-11T23:14:28.446Z level=WARN source=sched.go:642 msg="gpu VRAM usage didn't recover within timeout" seconds=8.949622423 model=/root/.ollama/models/blobs/sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94

I also edited the context (to 15000 tokens) the default is set to 2048 tokens, some models allows you to go to 100,000 tokens in context...

https://llm.extractum.io/list/ < == for model spec info.

With Ollama you can limit the layers so GPUs can run bigger models... When doing this what Ollama does is that some of the layers are run in the CPU resulting slower token per second BUT at least you can run a bigger billions+ parameter model at speed expense.

:beer:

stay curious....
 
Back
Top