Greg’s blog
  • Blog

Table of Contents

  • The GPU That Started It All
  • The Idle Period
  • From Browsing to Trying
  • From Ollama to llama.cpp: A Three-Stage Journey
    • Stage 1: Ollama for Exploration
    • Stage 2: Ollama as an OpenAI-Compatible Endpoint
    • Stage 3: llama.cpp as My Personal LLM Endpoint
  • The Model Zoo: Testing What Fits (Qwen, Gemma, and GaMS)
    • Quantization: The Art of Compromise (Context)
  • Appendix
    • Example llama-fit-params Output

Down the Rabbit Hole: Setting Up a Local LLM with llama.cpp

tech
LLM
hardware
Author

Gregor Cerar

Published

2026-06-07

Modified

2026-06-07

The GPU That Started It All

My local LLM journey began with a graphics card, specifically an NVIDIA GeForce RTX 3090 with 24 GB of VRAM. I did not buy it from a computer store or unbox it fresh from a retail package. Instead, I got it from a former crypto miner who went late and blindly into crypto and burned himself when Ethereum changed the algorithm. Prices for any decent GPU were still high at the time, as the crypto craze had transitioned to the AI craze.

He had several of these cards, and I picked up one at a price that felt almost charitable. Looking back, I regret not buying the second one he had available.

That single RTX 3090 joined my setup in May 2023, replacing an RTX 2080 Ti I had bought second-hand in November 2020 to work around limited lab access to GPUs. The RTX 2080 Ti carried me through most of my PhD research (I completed my doctorate in August 2021), and now sits retired in a closet. It is a quiet reminder of where this journey started. The 3090, by contrast, mostly saw gaming duty during my postdoc years; I simply did not have time for more research after September 2021.

The GPU stayed the same through all the years that followed. Everything else around it changed, going from an Intel Core i5-6500 to a Ryzen 5 3600, and finally to a Ryzen 7 5800X. But the RTX 3090 remained the constant heart of my workstation.

The Idle Period

After finishing my PhD and entering a two-year-long period as a full-time postdoc, I did not have much time to meaningfully use that GPU beyond casual gaming. It sat there, with this powerful accelerator built for massive parallel computation, rendering game frames and not much else.

For work, I used ChatGPT with subscription the way many people do these days. It became my go-to assistant for generating reports, drafting text from bullet points, summarizing papers, and a dozen other tasks where a capable language model saves hours of effort. It was convenient, powerful, and completely remote. I never thought much about what happened behind that API endpoint.

I did not jump directly into local LLMs though. While I was a full-time postdoc, I tested Galactica and Meta’s Llama models using PyTorch during scarce pockets of free time. I even tried to implement a simple RAG system. But I thought to myself at the time that we are not there yet.

Then came the LLM hype cycle (again), followed by the agent hype. And somewhere in the noise, I started wondering: what if I could run something like this locally?

From Browsing to Trying

I started casually browsing Reddit communities like r/LocalLLM and r/LocalLLaMA. Scrolling through posts of people running impressive setups, where quantized models squeezed through consumer GPUs and custom prompts turned LLMs into code reviewers and research assistants, gradually seeded a question that would not go away: could I do this too?

The barrier to entry felt lower than I expected. People were discussing GGUF formats, quantization schemes, and model architectures like it was a hobby anyone could pick up. The community posts made it look accessible enough that I figured: why not?

From Ollama to llama.cpp: A Three-Stage Journey

Stage 1: Ollama for Exploration

I started with Ollama, largely because it is the path of least resistance. I just pulled the Docker container using a single command, and pulling a model was a simple copy-paste of its name into the OpenWebUI frontend. So, having something running locally with a working chat interface took about five minutes total. This was exactly what I needed at this stage: a low-friction way to get a feel for what local models are capable of.

I experimented with a handful of models, typed in basic prompts, and formed initial impressions. Could a local model actually help with coding tasks? Was the quality anywhere near ChatGPT level? Ollama gave me answers to these questions without requiring any technical deep-dives.

Stage 2: Ollama as an OpenAI-Compatible Endpoint

After getting comfortable with the basics, I wanted to test integrations. Could I plug a local model into tools that expect an OpenAI API? Fortunately, Ollama serves an OpenAI-compliant endpoint out of the box (/v1/chat/completions), so I used it as a drop-in replacement for testing various client applications and scripts.

This stage was short-lived but informative. It confirmed that local models could indeed function as API backends, opening up possibilities beyond just chatting in a web interface. But I also hit the natural limits of this approach: limited control over inference parameters, no visibility into what was happening under the hood, and growing curiosity about whether I could get better performance.

Stage 3: llama.cpp as My Personal LLM Endpoint

I substituted Ollama with llama.cpp, which has evolved significantly since its early days. It now includes its own web-based UI for quick testing and token generation inspection, which effectively bridges the usability gap that Ollama originally filled. But more importantly, llama.cpp offers far greater control over inference: GPU layer offloading, custom batch sizes, KV cache configuration, and a multitude of quantization options.

Today, I use llama.cpp as my personal LLM endpoint. I am the only user. It serves requests to agentic tools like Cline, provides the backend for local code review sessions, and runs experiments whenever I want to test a new model or quantization. The llama.cpp server mode (server binary with --host and --port flags) gives me a reliable API endpoint that any client can talk to, while the built-in web UI at /chat lets me quickly test prompts and observe token generation without writing any code.

The Model Zoo: Testing What Fits (Qwen, Gemma, and GaMS)

Once I had a working inference pipeline, the real experimentation began: which model should I actually run? I tested many models beyond these, but I highlight Qwen, Gemma, and GaMS here for different reasons.

Qwen is Alibaba’s open-weight model family, and it has become one of the default names in local-LLM discussions. The variants that kept showing up in my searches were the 27B dense model and the 35B-A3B Mixture-of-Experts model: large enough to be interesting, but still realistic to attempt on a 24 GB card with quantization.

Gemma is Google’s open-weight model family. Gemma 4 was especially relevant for this setup because it offers a spread of sizes and architectures, from smaller edge-oriented models to larger dense and MoE variants, making it a useful comparison point for both quality and hardware fit.

GaMS (Generative Model for Slovene) is a family, specialized for Slovene language trained within the PoVeJMo research program. The latest iteration GaMS3 is built on a Gemma 3 backbone, and it was quite an exciting project in our local community.

I went for those families because they’re popular. Popular means collective problem detection and resolution. It also means for me less time wasting (re)exploring and dealing with issues.

I include GaMS for context, but the benchmarks below focus on Qwen and Gemma because those were the model families I tested systematically.

Quantization: The Art of Compromise (Context)

For local LLM work, one of the most important practical limits is the effective context window: how many tokens the model can keep “in mind” at once while still running at a useful speed.

Note

There is a small naming trap here. People often use 128k, 131k, 256k, 262k, 131072, and 262144 almost interchangeably when talking about context length. The confusion comes from mixing decimal kilo (1000) and binary kibi (1024) units. For example, the Hugging Face model card for Qwen lists qwen35moe.context_length as 262144. Dividing that by 1024 gives 256, so I will refer to it as a 256K-token context window throughout this post.

The models I tested sit in this long-context range. Qwen 3.6 advertises a 256K-token context window, while Gemma 4 spans 128K for the smaller variants and 256K for the larger ones. Those advertised lengths are important, but they are not the same thing as “this will fit comfortably on my GPU.”

An advertised context length is best understood as the supported operating range. It is not merely a soft recommendation, but it is also not a hard physical wall. The model was trained or tuned to handle sequences up to that length, its positional encoding (RoPE) is calibrated around that range, and attention behavior should remain reasonably stable inside it. You can push beyond it, much like overclocking hardware, but then you are outside the specification. Techniques such as YaRN can extend context further, even toward 1M tokens, but that is outside the scope of this post.

Longer context is useful, especially for code, documents, and agent workflows, but it is not free. The context is stored in the KV cache, and the KV cache consumes memory. On a local setup, that memory is usually scarce GPU memory. Offloading to system RAM is possible, but it quickly becomes a performance tradeoff. For comparison, my RTX 3090 FE has a theoretical memory bandwidth of 936.2 GB/s, while dual-channel DDR4 at 3600 MHz on the Ryzen system reaches 57.6 GB/s. That gap is large enough to feel in practice.

This is where quantization enters. Quantization reduces the precision used to store model weights, and in some cases the KV cache, so the same model can fit into less memory. The image analogy is pixelation: you keep the overall structure, but you throw away detail. Lower precision usually means lower memory use and sometimes higher speed, but it can also reduce output quality, weaken multi-step reasoning, or make long-context behavior less stable.

The rest of this section looks at the tradeoff from a few angles: model size, runnable context, KV-cache precision, CPU/GPU placement, and long-context generation speed. I tweak one subset of parameters at a time so the compromises are easier to see.

Note

By default, llama.cpp uses float16/f16 for the KV cache. For Qwen 3.6, Unsloth recommends bfloat16/bf16 instead. Since Qwen 3.6 was trained with bf16, and NVIDIA Ampere cards such as the RTX 3090 support bf16 natively, I set the KV cache type explicitly in the benchmarks. The difference between F16 and BF16 is explained here.

How I Estimate Fit and Measure Speed

I split the benchmarking into two separate questions.

First, I ask a sizing question: how much context can this model theoretically fit into my available VRAM? For that, I use llama-fit-params. It estimates the memory needed for model weights, KV cache, compute buffers, and backend overhead, then reports either the maximum context size that should fit or the offload parameters needed to reach a target context size.

Second, I ask a performance question: how fast does it actually run once loaded? For that, I use llama-bench, because fitting into memory and being pleasant to use are not the same thing. A configuration can technically support a large context window while becoming too slow for interactive work.

The workflow is:

  1. Use llama-fit-params to estimate whether the model and KV cache fit in VRAM.
  2. If the full target context does not fit, let llama-fit-params suggest CPU/GPU offload parameters.
  3. For MoE models, test special placement options such as -cmoe, which offloads expert layers to CPU.
  4. Run llama-bench on the resulting configuration to measure generation speed at increasing context depth.
  5. Compare the tradeoff between context length, quantization, offloading, and throughput.

For the maximum-context experiment, the input is only the model and quantization choice:

flowchart LR

models@{ shape: db, label: "(model, quant.)" }
llama["llama-fit-params"]
out["projected max.<br>context size"]

models --> llama --> out

The way to find maximum window.

For the hybrid/offload experiment, I fix the target context length first and let llama-fit-params search for a placement strategy:

flowchart LR

models@{ shape: db, label: "(model, quant., ctx size)" }
llama["llama-fit-params"]
out["offload tune<br>parameters"]

models --> llama --> out

The way to find what offload tuning parameters to reach target context length.

Model Sizes

Before context length enters the picture, the first constraint is simply the model size. The quantized weights need to fit in VRAM with enough room left for the KV cache and compute buffers. On a 24 GB GPU, this already narrows the field.

GGUF model sizes reported by llama-bench in results-b9519.json; quantization labels are preserved as reported.
Model Type Quantization Size (MiB)
unsloth/Qwen3.6-35B-A3B-GGUF MoE UD-Q6_K_XL 30,358
unsloth/Qwen3.6-35B-A3B-GGUF MoE Q5_K_XL 25,350
unsloth/Qwen3.6-35B-A3B-GGUF MoE UD-Q4_K_XL 21,314
unsloth/Qwen3.6-27B-GGUF Dense UD-Q4_K_XL 16,786
unsloth/gemma-4-26B-A4B-it-GGUF MoE UD-Q6_K_XL 22,201
unsloth/gemma-4-26B-A4B-it-GGUF MoE UD-Q5_K_XL 20,220
unsloth/gemma-4-26B-A4B-it-GGUF MoE UD-Q4_K_XL 16,208
unsloth/gemma-4-12B-it-GGUF Dense UD-Q8_K_XL 12,990

These sizes are not the full runtime memory requirement, but they are a useful first filter. If the model weights already fill most of the card, a long context window will require KV-cache quantization, CPU/GPU offloading, or both.

Estimating Runnable Context

After checking model sizes, the next question is not simply whether the model can load, but how much context can remain on the GPU with it. I used llama-fit-params for this step instead of calculating the memory budget by hand.

The tool inspects the selected model, quantization, KV-cache type, available GPU memory, and llama.cpp runtime buffers. It then either reduces the projected context size to something that fits, or suggests CPU/GPU placement parameters when I ask for a specific target context.

These results are specific to my RTX 3090, my llama.cpp build, and the amount of free VRAM at the time of testing. I therefore treat them as sizing results, not reusable command-line recipes. The exact offload pattern may change on another machine.

The important pattern is simple: model weights and KV cache compete for the same VRAM. Higher-precision KV cache keeps more numerical detail but leaves less room for long context. Quantized KV cache gives up some precision, but it can make much larger context windows fit.

For reproducibility, one example llama-fit-params invocation and output is included in the appendix.

Benchmarking Long-Context Generation

These benchmarks focus on decode speed at different context depths. In llama-bench, I set n_prompt = 0, kept n_gen at the default 128, and increased n_depth across powers of two. This means the benchmark does not measure prompt processing speed, time to first token, or full request latency. It measures how quickly the model can continue generating once the KV cache is already populated.

That is the behavior I care about for long-running chats, code-review sessions, and agent workflows. In those cases, the model may already be carrying tens or hundreds of thousands of tokens, and the practical question becomes: how much does generation slow down as the context grows?

The benchmark machine was:

  • Ryzen 7 5800X (8C/16T)
  • ASUS TUF Gaming B550M-Plus WiFi II, PCIe 4.0 x16
  • 32 GB DDR4 at 3600 MHz
  • NVIDIA RTX 3090 FE, 24 GB VRAM
  • llama.cpp build 9519 (7fe2ae45a)

The figures below are generated from results-b9519.json. Each curve represents a runnable configuration after the fit step: model quantization, KV-cache type, target context, and any CPU/GPU placement selected by llama-fit-params.

These plots should not be read as general model-quality rankings or full end-to-end serving benchmarks. They isolate one performance question: generation throughput while carrying a large context.

The figures are grouped by model family and quantization. The x-axis is existing context depth, while the y-axis is generation throughput.

Qwen 3.6

For Qwen, I tested both the 35B-A3B MoE variants and the 27B dense variant. This makes the comparison useful for separating model-size pressure from architecture differences.

Gemma 4

For Gemma, I tested the 12B dense model and the 26B-A4B MoE variants. This gives a second model family for checking whether the same context-depth patterns hold beyond Qwen.

The pattern is more useful than any single number. Larger quantizations consume more memory before generation even begins. Higher-precision KV cache leaves less room for context. CPU/GPU hybrid placement can make otherwise impossible contexts load, but it may reduce throughput sharply. Quantized KV cache is often the more attractive compromise when the goal is to keep long context on the GPU.

Appendix

Example llama-fit-params Output

This is one representative fit run. The exact output is hardware-specific, but it shows the kind of decision llama-fit-params makes before benchmarking.

./llama-fit-params \
  -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M \
  -fit off \
  -fa on \
  -ctk bf16 \
  -ctv bf16 \
  -fitt 128

The relevant output is:

projected to use 26570 MiB of device memory vs. 23784 MiB of free device memory
cannot meet free memory target of 128 MiB, need to reduce device memory by 2914 MiB
context size reduced from 262144 to 121344
main: printing fitted CLI arguments to stdout...
-c 121344 -ngl -1

In this case, the model could stay fully on the GPU only by reducing the context window from 256K tokens to about 118K tokens.

Reuse

CC BY-NC-SA 4.0
 

© Copyright 2021, Gregor Cerar