Local AI

Running local LLMs without the rituals

A small wrapper around llama.cpp that I wrote because I kept rebuilding the same shell scripts, and one I think is finally worth keeping.

Ramazan Yavuz
Ramazan Yavuz ·

Every few months I would try to run a local language model on my own hardware. The motivation was always the same: I do not want my prompts going to a vendor I cannot control, I do not want to pay per token for things I am still iterating on, and I am genuinely curious about what the open models can do on the machines I already own. The intention was always real. The follow-through, less so.

Each attempt followed a similar shape. Pick a runtime (llama.cpp, Ollama, text-generation-webui, vLLM, take your pick). Pick a model file, then pick a quantization (Q4_K_M, Q5_K_S, IQ3_XXS, and a dozen other names that imply some difference you should already know). Find a GGUF that downloads anonymously. Skim three blog posts to figure out which flags matter for your hardware. Realize the binary you built a few weeks ago is already two versions out of date. Start over. After running through this loop a few times and writing essentially the same glue script each time, I stopped and wrote it properly.

The result is hydra-llm. It is a small CLI plus an optional KDE Plasma 6 widget. Underneath it is just llama.cpp running in Docker, with a curated catalog of community-quantized GGUF files (Bartowski, lmstudio-community, mradermacher) that download without a Hugging Face account. The CLI does the four things I actually do at the prompt: list-online, download, chat, and api. The Plasma widget adds glanceable status for the people who, like me, have the panel always visible.


Existing options sit at two extremes. Some are too magical: they hide what is running, the binary calls home, the catalog is curated by someone whose interests are not aligned with yours, and there is no escape hatch when you want one. Others are too raw: you write the Docker compose, you mount the volume, you pick the port, you tune the flags, you pin the image, you do it again next month. I wanted something in the middle. Real Docker containers I can see in docker ps. One config file I can read end to end. No telemetry. No vendor account.

Docker carries the cross-distro Linux packaging burden for me, which is a tax I am not willing to pay for a side project. llama.cpp moves fast enough that any binary I built last month is already stale, and rebuilding for half a dozen distros is not how I want to spend my evenings. The cost is that the user has to have Docker. For the audience I care about, that is already true.


The first real design decision was what to do about hardware variance. A Phi-3-mini runs comfortably on a fanless mini-PC. A Llama-3.3-70B at Q4 needs serious memory or a workstation GPU. Ship one flat catalog and the new user picks the biggest model they have heard of, then watches their machine swap itself into a coma.

So hydra-llm doctor looks at CPU, RAM, GPU, and VRAM, classifies the machine into a tier, and list-online filters the catalog accordingly. The tiers are tiny, laptop, halo, workstation, and server. You can override, but the default is the one that does not destroy your day.

The halo tier is named after AMD's Strix Halo silicon: laptops with 64-128 GB of unified memory and a Strix iGPU. That tier sits in a gap that did not really exist a year ago. You can run a Gemma-3-27B at Q4 or a Qwen-2.5-32B at Q4 on a laptop now, with no discrete GPU, without sounding like a hairdryer. I wrote a separate review of the laptop that pushed me into building this tier specifically; the short version is that unified memory at this scale changes the answer to "what should I run locally" for a lot of people.


The temptation with configuration is always to invent a beautiful nested format. I resisted, mostly. The result is three flat layers, narrowest wins. Personas live at ~/.config/hydra-llm/personas/<name>.md, are markdown with optional YAML front matter, and apply across models. Per-alias system prompts live at ~/.config/hydra-llm/prompts/<alias>.txt and apply to that catalog id unless a persona overrides them. Per-alias sampling parameters live at ~/.config/hydra-llm/params/<alias>.json and cover temperature, top_p, top_k, repeat_penalty, max_tokens, and seed.

Inside the chat REPL, /params shows what is active, /set temperature 0.4 changes one for the current session, /reset clears history but keeps the system prompt, and /thoughts on toggles reasoning output for models that emit it. The whole surface is meant to be discoverable from /help and forgettable when you are not configuring.


I run KDE Plasma. The CLI is fine but I wanted to see at a glance whether a model is up, what it is doing, and how loaded the machine is. The widget has the obvious controls (start, stop, console, logs, configure) and one detail I am embarrassed by how much I like: a HAL-eye indicator that breathes faster as utilization climbs, shows a yellow scanning ring while a container is loading, and turns solid red once at least one model is healthy. The widget ships as a separate package (hydra-llm-plasma); on non-Plasma desktops the CLI works fully and a native UI for GNOME and XFCE is on the roadmap.


The catalog only references community-quantized GGUFs that download without a Hugging Face account. This was deliberate, and it shaped which models made the cut. If you want gated weights (the official meta-llama/* or google/gemma-* repos), set HF_TOKEN in your environment and the CLI passes it through. The CLI never prompts for one, never stores one, and never sends it anywhere except Hugging Face when you have asked it to. There is no telemetry, no analytics, and no auto-update calls. Sessions are saved as JSON in ~/.local/state/hydra-llm/sessions/, and you can delete them whenever.


The friction I had been hitting was never technical. Local LLM stacks have been runnable for two years. The friction lived in the rituals: which file, which flag, which runtime, which quantization. Once those choices are pre-made for the 80% case, the experience is genuinely good.

I also kept the scope small on purpose. hydra-llm does not fine-tune. It does not do RAG. It does not embed. It picks a model, downloads it, and lets you talk to it. Every time I was tempted to add a feature I asked whether it sat on the path of "I want to chat with a local model right now," and most of the time the honest answer was no. Side projects die from feature creep more often than from bad code.

If your local-LLM ritual still involves more than two commands, the runtime is probably not the thing to fix.