Mark Delk - Minimal llama.cpp

TLDR for using llama.cpp to run models locally.

Run a browser-based AI chat

To chat with the model in your web browser (e.g, like ChatGPT, duck.ai, etc…):

llama-server -hf ggml-org/Qwen3-4B-GGUF

Then visit http://127.0.0.1:8080

Recognize images

Use a model with vision support:

llama-server -hf ggml-org/gemma-3-12b-it-GGUF

TIP: once you’ve downloaded the model the first time, pass --offline to llama-server to avoid the huggingface network check.

Run an iterative CLI chat

Same thing, but in your terminal:

llama-cli -hf ggml-org/Qwen3-4B-GGUF

Find models

Find more officially supported models at https://huggingface.co/ggml-org

In general, though, you can find plenly of models already in GGUF.

Build from source

For Ubuntu 24:

sudo apt update -y
sudo apt install -y build-essential git ccache cmake libopenblas-dev pkg-config libcurl4-openssl-dev

git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
cmake --build build --config Release

Build artifacts are in build/bin:

./build/bin/llama-server -h

See llama.cpp’s build docs

Run a browser-based AI chat

Recognize images

Run an iterative CLI chat

Find models

Build from source

Further reading