April 27, 20265 min readErik Pavelka

Let's Chat Coach now talks. And listens.

How we built voice on Let's Chat Coach: the 33-combination voice test, what we got right, and what we're still working on.

We just shipped voice on Let's Chat Coach. You can speak your question instead of typing it, and the coach speaks back in real time, while the response is still being generated.

We built this because coaching is a conversation, and sometimes you'd rather speak it than type it. When you're wrestling with a tough call about your team, your career, or a conversation you've been avoiding, you may want to speak to, and listen to another voice.

What changed

Voice input. Press the microphone, talk, and your words become the prompt. Same quality as typing. You don't need to be careful or formal. We handle the pauses, the "ums," the restarts.

Voice output. The coach responds in a natural voice with a professional-coach tone: direct and grounded, firm when the content calls for it, no pep-talk energy. You can read along or just listen.

Streaming responses. Whether voice is on or off, responses now stream in as they're generated instead of landing as a single block after a delay. On a typical coaching reply, you see the first words within about half a second instead of waiting for the whole thing.

Why this voice

We tested eleven voices against three different tone prompts — 33 combinations — before shipping anything. Most of them sounded wrong for coaching: too bright, too performative, too generic, or subtly salesy in ways we couldn't quite articulate until we heard the alternative.

Coach's voice is paired with a "seasoned executive coach" instruction prompt: conversational, not hurried, comfortable with silence, kind in baseline but willing to push back. The instruction turned out to matter as much as the voice. The same voice with a cheerful-assistant prompt sounds like a completely different person.

What we got right

Sentence-level audio. We wait for a complete sentence before synthesizing speech. Word-by-word or phrase-by-phrase synthesis sounds robotic because text-to-speech (TTS) engines need a full clause to land pitch and emphasis naturally.
In-order playback. Parallel synthesis with serialized playback, so shorter sentences don't skip ahead of longer ones in the audio queue.
Markdown-aware cleaning. Asterisks don't get read as "asterisk asterisk bold asterisk asterisk." URLs don't get spelled out. Emoji don't get named. The on-screen text still shows the formatting; only the spoken version is cleaned.
Honest failure modes. When a sentence's synthesis hits a rate limit or a transient error, we retry once and then skip rather than replay stale audio or stall the queue.

What we are working on and thinking about

A 1-3 second gap between text showing up on screen and the first audio playing. The first sentence has to finish generating and then finish synthesizing before you hear anything. We added a small "Preparing voice…" indicator to close the feedback loop, but the gap is still real.
Voice preference isn't user-settable yet. Everyone hears the same voice with the executive-coach tone. We may consider adding a picker in the future.
English-only for now.
Listening for silence. Real conversations have natural pauses while you think. A live coach reads those pauses and waits — ours doesn't yet. Today, if you start talking while the coach is speaking, the coach doesn't stop. We're holding off on shipping that capability until the underlying tech is good enough that it won't accidentally cut you off mid-thought. We need our coach to do what a live coach does: see that you're still thinking, and wait.

Voice is live as of April 27, 2026 at letschatcoach.com. Open the Mic and Speaker toggles on your next session. It's a different experience than we expected.