lillycoder: A Local-First Coder REPL With Permission Gates

The cloud coding agents are good. They are also a bundle of trade-offs you cannot opt out of: the model lives on someone else's machine, the company sets the rate limits, your code is shipped through their API, and the price per token is whatever they say it is this quarter. Most of that you can live with. The one that bothers me is the lock-in. If I want to point a coding agent at a different model tomorrow, or run it on a plane, or pair it with something I am tinkering with locally, I cannot. lillycoder is the version of that workflow that works the other way around. The agent runs locally, talks to whatever LLM server you already have, and treats the model on the other end as interchangeable.

The shape of the tool is small on purpose. You run lillycoder in a project directory and it drops you into a chat REPL. You type something. The model on the other end picks tools to do it: read a file, write a file, edit a file, run a shell command, install a package, grep the project, list a directory. Each tool call comes back as a structured action, the agent executes it inside your current working directory, and the result feeds back into the conversation. From the outside, this is the same loop every coding agent runs. From the inside, the difference is what is doing the picking.

What is doing the picking is whatever you want. lillycoder speaks the OpenAI /v1/chat/completions shape, which is by now the lingua franca of local LLM servers. llama.cpp publishes it. ollama publishes it on its compatibility surface. LM Studio publishes it. hydra-llm publishes it. So the agent does not care which one is running, and on first launch it scans common ports and offers to use whatever it finds. If nothing is running, you point it at a URL with --api. The model is not part of the package; it is whatever you started.

Local-first means a different threat model than the cloud version. Up there, the worst case is mostly economic: the model burns tokens, you pay for them. Down here, the model can write to your disk and execute shell commands. That is more powerful and more dangerous, and pretending otherwise would be irresponsible. So the permission system is the part of the tool I spent the most time on.

Every mutating action is gated by a prompt before it runs. When the model decides to write a file, you see the path and the size; when it wants to run a command, you see the command. You answer with one of four options:

🦊 lilly wants to: write_file("src/index.js", 142 chars)
   [y]es  [n]o  [a]lways for this tool  [p]ath: always for this exact target
   >

The four options are not decoration. y is the obvious one. n rejects this single call and lets the model retry or change tack. a grants a session-scoped pass for the tool itself, which is what you want once you trust the model to read files but still want to look at every bash. p grants a pass for this exact path, which is what you want when the model is iterating on a single file and you do not want to keep approving each save. Both passes are session-scoped: they vanish when the REPL exits.

Above the permission prompt sits a hard-deny classifier. There is a list of commands that simply will not run, no matter what you answer at the prompt and no matter what flag you pass. sudo, rm -rf /, mkfs, dd of=/dev/*, fork bombs, recursive chmod or chown against / or ~. These are refused before they reach the executor. The --bypass-permissions flag (which is there for headless or scripted use) skips the per-call prompt, but it does not skip the safety classifier. That is by design; bypass should be for tedium, never for danger.

The classifier is small and dumb on purpose. It is a string-and-pattern checker, not a sandbox. A malicious shell pipeline that does the same damage in a clever way will get past it; sandboxing the entire tool is a bigger project than this one is. What the classifier buys you is protection against the kind of mistake that comes from an LLM hallucinating a path or copy-pasting a destructive command from training data. That is the realistic failure mode, and the thing the classifier exists to catch.

For experimenting safely, the repository ships a docker-compose.yml that mounts a single WORKINGDIR/ into a container with lillycoder already installed. You edit on the host, run the agent inside the container, and any damage stays inside that one mounted folder. It is the same idea as a chroot but with less ceremony, and it is the configuration I use whenever I want to give the model latitude on a fresh codebase without watching every prompt.

docker compose up -d
docker compose exec lillycoder bash
# inside the container, in /workspace:
lillycoder --api http://host.docker.internal:11434/v1

Inside that sandbox the --bypass-permissions flag becomes a lot more reasonable. The hard-deny list still runs. The model still cannot escape the mount. You get to watch what an autonomous agent on a 14B local model actually does without the per-keystroke approval loop, and you can throw the whole thing away with one docker compose down -v.

The model on the other end matters. Tool-calling reliability is not free; it requires a model that was trained or fine-tuned to emit structured tool calls without going off the rails. The community-distilled tier of "good at tools" is not the same as the tier of "good at chat," and a model that is excellent at writing prose can be useless at picking the right tool with the right arguments. lillycoder keeps a small allowlist of model families that have, in my testing, proven reliable enough: Qwen 2.5 and 3, Gemma 3, Llama 3.1, Mistral Small 3, Dolphin 3 R1. If you point the tool at a model outside that list, it warns you. --force silences the warning when you know what you are doing.

For the actual coding loop, the size that earns its keep on consumer hardware is roughly 14B to 32B at Q4_K_M quantization. That fits on a 16 to 24 GB GPU and produces tool calls that are, in my use, indistinguishable from the cloud agents on small-to-medium tasks. The cloud still wins for cross-file refactors at scale, multi-step reasoning over very long contexts, and anything where the SOTA quality gap shows up. For the day-to-day "edit this file, run the tests, fix the failure," the local loop has caught up.

There is a sibling project called hydra-llm that I built before this one. It manages local LLM servers: download a model, start it on a port, get an OpenAI-compatible endpoint back. lillycoder talks that exact shape. So the two compose into a fully local coding stack with two commands:

# in hydra-llm:
hydra-llm start qwen2.5-32b
hydra-llm api   qwen2.5-32b           # prints the URL

# in your project directory:
lillycoder --api http://localhost:18087/v1

That is the whole thing. hydra-llm handles the model lifecycle. lillycoder is the agent on top. They are two halves of one workflow, and either half is replaceable. If you already have ollama running, skip hydra-llm. If you have a different agent you like better, point it at hydra-llm's endpoint instead. The split is intentional; tools that try to be both the runtime and the agent always end up worse at one of those jobs.

Installation is the boring part, which is how it should be. On Debian and Ubuntu there is a signed apt repository, so it is a normal package after a one-time keyring setup:

sudo install -d -m 0755 /etc/apt/keyrings
curl -fsSL https://ra-yavuz.github.io/apt/pubkey.gpg \
  | sudo tee /etc/apt/keyrings/ra-yavuz.gpg >/dev/null
echo "deb [signed-by=/etc/apt/keyrings/ra-yavuz.gpg] https://ra-yavuz.github.io/apt stable main" \
  | sudo tee /etc/apt/sources.list.d/ra-yavuz.list
sudo apt update
sudo apt install lillycoder

From source on any Linux it is the usual git clone and pip install --user -e .. After that, cd into a project, run lillycoder, and start typing. If a local LLM server is already listening on a common port, the tool finds it. If not, pass --api with the URL.

The bundled persona is a kid-coder voice, deliberately playful, because I find it pleasant to work with and the alternative was yet another terse-junior-engineer voice that the world has plenty of. If that is not your taste, drop a markdown file at ~/.config/lillycoder/personas/<name>.md and run with --persona <name>. The persona is the system prompt; everything else, the tools, the gates, the loop, is unchanged.

That is also a useful place to encode project conventions. A persona that says "always run the test suite after edits, never modify files outside src/, prefer pure functions" turns into a behavior the model leans on across the whole session. It is not a sandbox or a guarantee, but it does shape what the model picks before it gets to the permission prompt, which means fewer prompts to refuse and more to accept on autopilot.

The bigger point behind all of this is that local agentic coding is no longer a research project. The pieces are there: capable models, an open endpoint shape, fast enough hardware, and the realisation that "agent" and "model server" are two different concerns that should not be welded together. lillycoder is one specific composition of those pieces. It is opinionated about safety (gates by default, hard-deny on top), uninterested in being the model runtime (point it at one), and small enough that the entire codebase is readable in one sitting if you want to know what runs on your machine.

If you want to try it, the install lines above are real and the source is on GitHub. If something local is already listening, just running lillycoder in a folder will pick it up. The first thing to type, the same thing I type, is "what files are in this folder?" From the answer you can already tell whether the model on the other end is going to be a useful collaborator, and that is the only thing that really matters about which one you picked.