TLDR for using llama.cpp to run models locally.
Run a browser-based AI chat
To chat with the model in your web browser (e.g, like ChatGPT, duck.ai, etc…):
llama-server -hf ggml-org/Qwen3-4B-GGUF
Then visit http://127.0.0.1:8080
Recognize images
Use a model with vision support:
llama-server -hf ggml-org/gemma-3-12b-it-GGUF
TIP: once you’ve downloaded the model the first time, pass --offline to llama-server to avoid the huggingface network check.
Run an iterative CLI chat
Same thing, but in your terminal:
llama-cli -hf ggml-org/Qwen3-4B-GGUF
Find models
Find more officially supported models at https://huggingface.co/ggml-org
In general, though, you can find plenly of models already in GGUF.
Build from source
For Ubuntu 24:
sudo apt update -y
sudo apt install -y build-essential git ccache cmake libopenblas-dev pkg-config libcurl4-openssl-dev
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release
Build artifacts are in build/bin:
./build/bin/llama-server -h