Claude Code has a /voice mode. You hold a key, you talk, it transcribes. Speech in. What it does not have is speech out: the replies still arrive as text you have to read. That asymmetry is the whole idea behind claude-can-speak. The model should be able to talk back, on its own initiative for the things worth hearing, or on every reply when you want a running narration. Turn it on and it is a conversation in both directions. Turn it off and you are silent again. The rest of the work was making that sentence true without anything feeling bolted on, and one of the choices I got wrong on the first pass was assuming the existing /voice switch could carry both directions. It could not, and that is the most instructive part of the story.
The first real decision was the voice, and it was less obvious than I expected. My instinct was to reach for the most natural, most modern neural text-to-speech I could run locally. There are several. The honest framing is not "AI voice versus robotic voice", because all the serious local options now are neural. The real axis is naturalness against latency on a CPU, and it matters here because this thing speaks after every reply. A model that takes four seconds per sentence is unusable when it fires constantly.
I looked at three. Kokoro, an 82-million-parameter model, is the most natural of the small ones and runs fast on a CPU. Piper is a touch less natural but multilingual across thirty-odd languages and very fast. XTTS version two is the most natural of all and can clone voices, but on a CPU it is several seconds per reply, and its licence is non-commercial, which is a problem for something I want to publish openly. XTTS lost on both counts.
That left a genuine tension between Kokoro and Piper, and I made the mistake of assuming Kokoro could cover the languages I wanted. It cannot. I had wanted English plus German, ideally Turkish, so the same setup could be reused for future projects. Kokoro's own voice list settles it: American and British English, Japanese, Mandarin, Spanish, French, Hindi, Italian, Brazilian Portuguese. No German. No Turkish. Checking that before building, rather than after, saved me from shipping the wrong engine.
So I built both, with one as the default and the other a switch away. Piper is the multilingual path: German through the Thorsten voice, Turkish through the dfki voice, and the rest of its catalogue available by name. Kokoro is the default, because for the primary use, English voice-out, it simply sounds better. I confirmed that by ear rather than by spec sheet. I synthesised the same sentence through every candidate female voice, the Piper options and the Kokoro ones, and played them back to back. Kokoro's af_heart won cleanly. A listening test is worth more than a benchmark when the metric is "does this sound like a person".
Both engines run inside a single Docker container, so they never touch the host's Python environment. The container is persistent: it starts once and stays warm, so the cost of importing the model runtime is paid a single time, not on every reply. The host hands it text, it hands back a WAV, and the host plays it. Only audio crosses the boundary. Warm, a Piper sentence comes back in about a second and a Kokoro one in a little over two, both fine for something that speaks while you keep working.
Wiring it into Claude Code is a Stop hook. When a reply finishes, Claude Code runs the hook and hands it the finished message. The hook checks whether the firehose is on, and if it is not, it exits silently and nothing speaks. If it is on, it strips markdown and drops fenced code blocks, because reading a code block aloud is unlistenable, then sends the cleaned text to the container and plays the result. The hook returns immediately and the speaking happens in a detached background process, so it never delays your next turn. What that "is the firehose on" check reads turned out to be the one decision I had to make twice, and I will come back to it.
One subtlety bit me here. The WAV encoders need to seek backward to write the file header, and a pipe to standard output is not seekable, so the first version produced a truncated, unplayable file. The fix is to synthesise to a temporary file inside the container and then stream that out. Obvious in hindsight, invisible until you hit it.
Reading every reply aloud is a blunt instrument, and partway through I realised it should not be the only mode. So there is a second one: a Claude Code skill called speak that hands the model a deliberate "say this out loud" capability. Instead of narrating everything, Claude can choose to voice only what is worth hearing: a spoken "the build is done and tests passed" when you have stepped away, a heads-up that a deploy needs confirmation, a short callout you asked for. The skill's description tells the model when to use it and, more importantly, when not to: never for routine replies, never for code or file paths, only for things genuinely worth interrupting your ears. The two modes are independent. You can run the firehose, the deliberate skill, both, or neither.
Speaking after every reply also demands a way to stop. A long answer you do not want to hear out is worse than no audio at all. There are three ways to interrupt: run a stop command, or simply send your next message, which a second hook uses as the signal to silence the previous reply, or just let a new reply supersede the old one. The mechanism underneath records the speaking process group the instant it starts, before synthesis even finishes, so an interrupt lands whether the model is still synthesising or already mid-sentence. Getting that ordering right was the difference between a stop button that works and one that only works sometimes.
The last decision was about what to ship, and it turned out to be a licensing decision in disguise. The clean instinct was: do not bundle the models at all. Piper's voices carry a patchwork of per-voice licences, its phonemizer is GPL, and Kokoro's voice pack terms are not spelled out. Redistributing that mix inside my own MIT package would be a mess. Not shipping them is both simpler and more correct: the package contains only my own code, and each model is downloaded on first use, straight from its official source, under its own licence, into a local cache. I redistribute nothing. A short third-party notes file lists every engine and model with its licence and origin.
Distribution followed the same "fit the thing to its nature" logic. My usual default is a Debian package, but this is a generic Claude Code extension, not a Linux system tool, and a .deb would gate it to Debian and Ubuntu for no good reason. It is mostly per-user Claude configuration plus a small CLI. npm fits that far better and reaches macOS, Arch, Fedora, and WSL alike. So it ships as npm install -g claude-can-speak, with Docker as a runtime requirement the installer checks for and explains rather than a packaging dependency.
The most instructive bug showed up after release, and it was my own clever idea biting back. The original plan was elegant on paper: gate the speak-everything firehose on Claude Code's built-in /voice, so one switch would control both directions, you talk to it and it talks back. I shipped it that way. Then I turned /voice off and the replies kept being spoken. The assumption was wrong in two places at once. First, /voice is speech-IN, the dictation toggle, and its live state is not something a hook can reliably read from the settings file; the field I was reading meant "dictation is configured", which was permanently true. Second, a finishing-reply hook fires on every reply regardless of voice state, so there was nothing actually gating it. The feature appeared to work only because, by coincidence, the field I read happened to be true.
The fix was to stop being clever and give the tool its own honest switch. The firehose now reads a single state file that claude-can-speak on and claude-can-speak off write, defaulting to off. It is no longer coupled to /voice at all, because the two things were never really the same concern: one is how you dictate prompts, the other is whether you want answers read back. Conflating them produced a switch that could not be switched off. The lesson is an old one that I relearn regularly: a feature that depends on reverse-engineering another system's internal state is a feature waiting to break, and "it works on my machine right now" is not the same as "it works". Verifying that with a trace, watching whether the hook even fired, was what turned a confusing "why is it still talking" into a one-line root cause.
The honest switch fixed the behaviour, but it left a quieter problem that only surfaced when someone used the tool fresh. The firehose now lived behind a terminal command, claude-can-speak on. That is correct, but it is not where a Claude Code user looks. The instinct, mine included, is to type a slash in Claude Code and scan the menu for the toggle. There was nothing there, because the tool deliberately does not touch /voice, and so the natural reaction was "is this even installed". The feature worked perfectly and felt broken, purely because the control was in the wrong place for the person reaching for it.
The fix was to put the switch where the hand reaches. A small /ccs slash command now ships with the tool and is installed by setup, so /ccs on, /ccs off, and /ccs status work from inside Claude Code, next to /voice where you expected them. The terminal command still exists; the slash command is just a thin wrapper over it. There was one more subtlety underneath, and it is worth stating plainly because it is the kind of thing that reads as a bug. Claude Code reads its list of hooks once, when a session starts. Flipping the firehose on mid-session updates the state file the hook reads, but if the session began before the hook was ever registered, there is no hook in that session to read it. So the first time, you start one fresh session after setup, and from then on every reply speaks. The hook's own code and the on/off state both update live; only the initial registration is read at start. Naming that out loud in the documentation was the actual fix, because the software was already doing the right thing and only the explanation was missing.
None of these pieces is a clever algorithm. The interesting work was the sequence of small, honest choices: latency over raw naturalness, then naturalness over breadth once the languages were verified, two modes instead of one, an interrupt that fires during synthesis and not just playback, models fetched instead of shipped, npm instead of a Debian package, an explicit on/off switch instead of a clever one borrowed from another feature, and a slash command that puts that switch where the hand reaches for it. Each one came from taking a real constraint seriously rather than from a clever trick. The result is quiet by default, and when you turn it on, Claude Code talks back, on its own initiative through the skill or on every reply through the firehose. That is the entire feature, and getting it to feel simple took every one of those decisions, including the one I had to make twice.
