Architecture
Four goroutines, four bounded channels, one voice loop.
Each session runs an isolated pipeline. No shared mutable state, no unbounded queues. The server is written in Go with Pion WebRTC for predictable low-latency performance.
Session Pipeline
Browser
Mic / Speaker
Reader
RTP → Opus decode
Inbound
STT + VAD
Agent
LLM + Tools + TTS
Sender
Opus encode → RTP
Browser
Playback
How a voice turn works
WHIP Signaling
Browser gathers ICE candidates, creates a DataChannel, and sends a single SDP POST to /whip. Server responds with an answer and a session UUID. No WebSocket needed.
RTP Decode & VAD
Opus packets arrive over WebRTC. The Reader goroutine decodes to 20ms PCM frames on a bounded channel (50 frames). Inbound feeds them to STT and runs energy-based VAD for barge-in detection.
STT + LLM + Tools
Deepgram or OpenAI Whisper transcribes in real time. The Agent goroutine streams transcripts to the LLM with conversation history and invokes plugins via function calling when the LLM requests a tool.
TTS + RTP Encode
The LLM response is synthesized sentence-by-sentence through Cartesia, Deepgram, or ElevenLabs. The Sender goroutine encodes PCM to Opus and writes RTP with wall-clock pacing.
Goroutine Detail
Each goroutine owns one stage of the pipeline. They communicate exclusively through bounded channels — no locks, no shared mutable state.
runReader()inbound.go→ inPCMCh (50 frames)
Reads RTP packets from the WebRTC track, decodes Opus to 20ms PCM frames, and pushes them onto the inbound PCM channel.
runInbound()inbound.go→ transcriptCh (10 items)
Consumes PCM frames, feeds them to the STT provider, and runs energy-based VAD. When speech ends, emits a TranscriptEvent. Detects barge-in when the user speaks while the agent is talking.
runAgent()agent.go→ outPCMCh (100 frames)
Receives transcripts, streams them to the LLM with conversation history, invokes plugins via function calling, then synthesizes the response sentence-by-sentence through TTS. Handles barge-in cancellation.
runSender()outbound.go→ WebRTC local track
Consumes outbound PCM frames, encodes to Opus, and writes RTP packets with wall-clock pacing. Marks new talkspurts for proper audio playback.
Barge-in: Interrupt the agent mid-sentence
Energy-based VAD detects speech while the agent is talking. When the user interrupts:
- 1. VAD detects speech onset (RMS > 500 for 60ms)
- 2. Inbound sends signal on
interruptCh - 3. Agent cancels the current LLM/TTS context
- 4. Agent drains
outPCMChto stop playback - 5. Pipeline resumes listening for the new utterance
VAD Parameters
Tuned for natural conversation.
Minimum energy to trigger speech detection
Energy must exceed threshold for this long
Silence must persist before confirming end
WHIP Signaling
StreamCoreAI uses the WHIP standard (RFC 9725) for signaling. The client sends an SDP offer via HTTP POST, the server responds with an SDP answer. ICE, DTLS, and SRTP are negotiated automatically. A DataChannel carries transcripts and events bidirectionally.
POST /whip HTTP/1.1
Content-Type: application/sdp
v=0
o=- 0 0 IN IP4 0.0.0.0
s=-
t=0 0
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
...
→ 201 Created
Content-Type: application/sdp
Location: /whip/{session-uuid}DataChannel Events
The server sends structured JSON events over the WebRTC DataChannel:
// Real-time transcript (partial or final)
{"type": "transcript", "text": "What's the weather?", "final": true}
// LLM response text as it streams
{"type": "response", "text": "Let me check that for you."}
// Pipeline latency metrics
{"type": "timing", "stage": "llm_first_token", "ms": 312}
// Error events
{"type": "error", "message": "TTS provider timeout"}Key Source Files
pipeline/pipeline.goPipeline struct, channels, New(), Start(), Stop()
pipeline/inbound.gorunReader + runInbound goroutines
pipeline/agent.gorunAgent, respond, synthesizeSentences
pipeline/types.goPCMFrame, TranscriptEvent, message types
vad/vad.goEnergy-based VAD (RMS threshold, onset, offset)
peer/peer.goWebRTC peer, RemoteTrackCh, LocalTrack
session/session.goSession wiring, AddPeer, pipeline lifecycle
config/config.goTOML configuration loader
Audio Constants
| Parameter | Value | Notes |
|---|---|---|
| Sample Rate | 48000 Hz | Opus native rate |
| Frame Size | 960 samples | 20ms per frame |
| Channels | 1 (mono) | Mono audio |
| Codec | Opus | WebRTC standard |
| Plugin Timeout | 30 seconds | Per execute call |