Architecture

Four goroutines, four bounded channels, one voice loop.

Each session runs an isolated pipeline. No shared mutable state, no unbounded queues. The server is written in Go with Pion WebRTC for predictable low-latency performance.

Session Pipeline

Browser

Mic / Speaker

Reader

RTP → Opus decode

Inbound

STT + VAD

Agent

LLM + Tools + TTS

Sender

Opus encode → RTP

Browser

Playback

How a voice turn works

01

WHIP Signaling

Browser gathers ICE candidates, creates a DataChannel, and sends a single SDP POST to /whip. Server responds with an answer and a session UUID. No WebSocket needed.

02

RTP Decode & VAD

Opus packets arrive over WebRTC. The Reader goroutine decodes to 20ms PCM frames on a bounded channel (50 frames). Inbound feeds them to STT and runs energy-based VAD for barge-in detection.

03

STT + LLM + Tools

Deepgram or OpenAI Whisper transcribes in real time. The Agent goroutine streams transcripts to the LLM with conversation history and invokes plugins via function calling when the LLM requests a tool.

04

TTS + RTP Encode

The LLM response is synthesized sentence-by-sentence through Cartesia, Deepgram, or ElevenLabs. The Sender goroutine encodes PCM to Opus and writes RTP with wall-clock pacing.

Goroutine Detail

Each goroutine owns one stage of the pipeline. They communicate exclusively through bounded channels — no locks, no shared mutable state.

runReader()inbound.go

inPCMCh (50 frames)

Reads RTP packets from the WebRTC track, decodes Opus to 20ms PCM frames, and pushes them onto the inbound PCM channel.

runInbound()inbound.go

transcriptCh (10 items)

Consumes PCM frames, feeds them to the STT provider, and runs energy-based VAD. When speech ends, emits a TranscriptEvent. Detects barge-in when the user speaks while the agent is talking.

runAgent()agent.go

outPCMCh (100 frames)

Receives transcripts, streams them to the LLM with conversation history, invokes plugins via function calling, then synthesizes the response sentence-by-sentence through TTS. Handles barge-in cancellation.

runSender()outbound.go

WebRTC local track

Consumes outbound PCM frames, encodes to Opus, and writes RTP packets with wall-clock pacing. Marks new talkspurts for proper audio playback.

Barge-in: Interrupt the agent mid-sentence

Energy-based VAD detects speech while the agent is talking. When the user interrupts:

  1. 1. VAD detects speech onset (RMS > 500 for 60ms)
  2. 2. Inbound sends signal on interruptCh
  3. 3. Agent cancels the current LLM/TTS context
  4. 4. Agent drains outPCMCh to stop playback
  5. 5. Pipeline resumes listening for the new utterance

VAD Parameters

Tuned for natural conversation.

RMS Threshold500

Minimum energy to trigger speech detection

Onset Duration60ms

Energy must exceed threshold for this long

Offset Duration300ms

Silence must persist before confirming end

WHIP Signaling

StreamCoreAI uses the WHIP standard (RFC 9725) for signaling. The client sends an SDP offer via HTTP POST, the server responds with an SDP answer. ICE, DTLS, and SRTP are negotiated automatically. A DataChannel carries transcripts and events bidirectionally.

POST /whip HTTP/1.1
Content-Type: application/sdp

v=0
o=- 0 0 IN IP4 0.0.0.0
s=-
t=0 0
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
...

→ 201 Created
   Content-Type: application/sdp
   Location: /whip/{session-uuid}

DataChannel Events

The server sends structured JSON events over the WebRTC DataChannel:

// Real-time transcript (partial or final)
{"type": "transcript", "text": "What's the weather?", "final": true}

// LLM response text as it streams
{"type": "response", "text": "Let me check that for you."}

// Pipeline latency metrics
{"type": "timing", "stage": "llm_first_token", "ms": 312}

// Error events
{"type": "error", "message": "TTS provider timeout"}

Key Source Files

pipeline/pipeline.go

Pipeline struct, channels, New(), Start(), Stop()

pipeline/inbound.go

runReader + runInbound goroutines

pipeline/agent.go

runAgent, respond, synthesizeSentences

pipeline/types.go

PCMFrame, TranscriptEvent, message types

vad/vad.go

Energy-based VAD (RMS threshold, onset, offset)

peer/peer.go

WebRTC peer, RemoteTrackCh, LocalTrack

session/session.go

Session wiring, AddPeer, pipeline lifecycle

config/config.go

TOML configuration loader

Audio Constants

ParameterValueNotes
Sample Rate48000 HzOpus native rate
Frame Size960 samples20ms per frame
Channels1 (mono)Mono audio
CodecOpusWebRTC standard
Plugin Timeout30 secondsPer execute call