Local LLMs

Hi mates!

Has anyone tried to deploy LLM (Large Language Models) server on the FreeBSD box? Share your experience and how-to, please.
 
I just came here to ask the same question and I found your post. I have a local Manjaro desktop machine running Open WebUI with several LLMs installed and it works really well. But, I'd love to have it running on one of my webservers, which are all FreeBSD.

I could easily set up Open WebUI with one of my domains and connect from anywhere over HTTPS.

I wonder what kind of hardware resources I'd need though. My desktop has an NVIDIA 3070 but I don't put GPUs in the webservers.
 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end

 
This is actually not that hard. You could use llama.cpp:

Running 14.0-RELEASE-p6 on a pi4:

- install gmake
- git clone https://github.com/ggerganov/llama.cpp
- cd llama.cpp; gmake # use -j n_cores
- get a model from huggingface where the ram requirements match your machine, i used phi-2.Q4_K_M
- place the model file into the models/ subdir of llama.cpp

Use this shell script to launch it:

Bash:
#!/usr/local/bin/bash
PROMPT="Instruct: $@\nOutput:\n"
./main -m models/phi-2.Q4_K_M.gguf --color --temp 0.7 --repeat_penalty 1.1 -n -1 -p "$PROMPT" -e

Example:

Code:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp/
doas pkg install gmake
gmake -j4
mv ~/phi-2.Q4_K_M.gguf models/
./run-phi2.sh "Tell me something about FreeBSD"

It's not very fast here but works:
Code:
... initialization output omitted...

Instruct: Tell me something about FreeBSD
Output:
- FreeBSD is an open source, distributed operating system for Unix-like devices.
- It was created in 1995 and is known for its stability, security, and scalability.
- It is used in a variety of settings, from small enterprises to large organizations.
- It has a number of different distributions, each tailored for different tasks and needs.
- It allows for the customization of the operating system, allowing users to modify and improve it.
- It features a strong password policy and advanced security measures.
<|endoftext|> [end of text]


llama_print_timings:        load time =    1187.23 ms
llama_print_timings:      sample time =     121.36 ms /   108 runs   (    1.12 ms per token,   889.94 tokens per second)
llama_print_timings: prompt eval time =    3147.98 ms /    11 tokens (  286.18 ms per token,     3.49 tokens per second)
llama_print_timings:        eval time =   54504.98 ms /   107 runs   (  509.39 ms per token,     1.96 tokens per second)
llama_print_timings:       total time =   57837.63 ms /   118 tokens
Log end
Beautiful, and there actually already is a port/package for that, so no need to compile yourself:
misc/llama-cpp

Thanks a lot, didn't know about llama-cpp. Will try it as soon as possible myself.
 
I was not aware that there is a port/package, but actually llama.cpp gets updated so frequently (sometimes multiple times per day) that it could make sense to pull from the repo.

It also includes a server to use the llm via rest api and more stuff. Be sure to check out the git repo (readme).
 
Back
Top