ADR-014: Voice Pipeline (Streaming STT/TTS)
- Status: Accepted
- Date: 2026-02-04
Context
Voice tutoring is a core differentiator but has higher cost and higher abuse risk than text.
Constraints:
- Voice must be server-side only.
- Voice is not available to anonymous users.
- STT and TTS vendor: ElevenLabs.
- Live session coordination uses Durable Objects ADR-013.
We want low-latency streaming behavior with predictable persistence and retention:
- Do not store raw audio by default.
- Persist transcripts and structured metadata in D1 as session events.
Decision
Architecture
- Web client connects to the session Durable Object (DO) via WebSocket.
- Client streams microphone audio frames to the DO.
- DO brokers streaming STT with ElevenLabs and emits transcript events.
- Assistant responses are generated as text first; TTS is then synthesized and streamed back to the client.
Audio format requirements:
- Client captures audio at 16kHz sample rate, mono, 16-bit PCM.
- Audio is chunked into frames (e.g., 100-500ms) and streamed over WebSocket.
- DO may implement server-side Voice Activity Detection (VAD) to reduce noise.
Data persistence
- Raw audio is treated as ephemeral and is not written to D1.
- The following are persisted as session events:
- transcript partials/finals (with timestamps)
- assistant text output
- voice generation metadata (voice id, duration, cost/usage counters)
Event schema for transcripts:
interface TranscriptEvent {
eventType: "transcript_partial" | "transcript_final";
sessionId: ULID;
sequence: number;
timestamp: RFC3339;
role: "user" | "assistant";
text: string;
audioOffsetMs: number; // position in audio stream
}
Quotas and gating
- Voice requires a signed-in user.
- Quota checks occur before accepting sustained audio streaming.
- Usage accounting records:
- number of prompts
- voice seconds/minutes (tracked separately per session)
- Voice quota resets on the same schedule as text prompts.
Failure behavior
- If the STT stream fails, the DO notifies the client and falls back to text input.
- If TTS fails, the DO returns text-only response.
- Client may retry voice input after a brief backoff.
Voice selection
- Users may select a preferred voice from available ElevenLabs voices.
- Voice preference is stored per-user in D1 (
tn-sessions). - Default voice is used if user preference is not set.
Consequences
- Session DO implementation must support bidirectional streaming.
- We need clear event types for transcript segments and voice metadata.
- Quota enforcement must account for sustained connections.
- Additional load on ElevenLabs API requires careful cost monitoring.
Alternatives considered
- Client-side direct-to-ElevenLabs: harder quotas, leaks vendor boundaries, weaker abuse control.
- Storing raw audio: higher privacy risk and storage costs.
- Batch transcription instead of streaming: higher latency, worse UX.
Implementation notes
- ElevenLabs streaming API should be used when available; otherwise, chunked audio.
- Implement exponential backoff for API rate limits.
- Log errors without including any audio content.
- Voice support in CLI is optional for MVP; can be added in Phase 2.