Architecture

Four goroutines, four bounded channels, one voice loop.

Each session runs an isolated pipeline. No shared mutable state, no unbounded queues. The server is written in Go with Pion WebRTC for predictable low-latency performance.

Session Pipeline

Browser

Mic / Speaker

→

Reader

RTP → Opus decode

→

Inbound

STT + VAD

→

Agent

LLM + Tools + TTS

→

Sender

Opus encode → RTP

→

Browser

Playback

How a voice turn works

WHIP Signaling

Browser gathers ICE candidates, creates a DataChannel, and sends a single SDP POST to /whip. Server responds with an answer and a session UUID. No WebSocket needed.

RTP Decode & VAD

Opus packets arrive over WebRTC. The Reader goroutine decodes to 20ms PCM frames on a bounded channel (50 frames). Inbound feeds them to STT and runs energy-based VAD for barge-in detection.

STT + LLM + Tools

Deepgram or OpenAI Whisper transcribes in real time. The Agent goroutine streams transcripts to the LLM with conversation history and invokes plugins via function calling when the LLM requests a tool.

TTS + RTP Encode

The LLM response is synthesized sentence-by-sentence through Cartesia, Deepgram, or ElevenLabs. The Sender goroutine encodes PCM to Opus and writes RTP with wall-clock pacing.

Goroutine Detail

Each goroutine owns one stage of the pipeline. They communicate exclusively through bounded channels — no locks, no shared mutable state.

runReader()inbound.go

→ inPCMCh (50 frames)

Reads RTP packets from the WebRTC track, decodes Opus to 20ms PCM frames, and pushes them onto the inbound PCM channel.

runInbound()inbound.go

→ transcriptCh (10 items)

Consumes PCM frames, feeds them to the STT provider, and runs energy-based VAD. When speech ends, emits a TranscriptEvent. Detects barge-in when the user speaks while the agent is talking.

runAgent()agent.go

→ outPCMCh (100 frames)

Receives transcripts, streams them to the LLM with conversation history, invokes plugins via function calling, then synthesizes the response sentence-by-sentence through TTS. Handles barge-in cancellation.

runSender()outbound.go

→ WebRTC local track

Consumes outbound PCM frames, encodes to Opus, and writes RTP packets with wall-clock pacing. Marks new talkspurts for proper audio playback.

Barge-in: Interrupt the agent mid-sentence

Energy-based VAD detects speech while the agent is talking. When the user interrupts:

1. VAD detects speech onset (RMS > 500 for 60ms)
2. Inbound sends signal on interruptCh
3. Agent cancels the current LLM/TTS context
4. Agent drains outPCMCh to stop playback
5. Pipeline resumes listening for the new utterance

VAD Parameters

Tuned for natural conversation.

RMS Threshold500

Minimum energy to trigger speech detection

Onset Duration60ms

Energy must exceed threshold for this long

Offset Duration300ms

Silence must persist before confirming end

WHIP Signaling

StreamCoreAI uses the WHIP standard (RFC 9725) for signaling. The client sends an SDP offer via HTTP POST, the server responds with an SDP answer. ICE, DTLS, and SRTP are negotiated automatically. A DataChannel carries transcripts and events bidirectionally.

POST /whip HTTP/1.1
Content-Type: application/sdp

v=0
o=- 0 0 IN IP4 0.0.0.0
s=-
t=0 0
a=group:BUNDLE 0 1
m=audio 9 UDP/TLS/RTP/SAVPF 111
a=rtpmap:111 opus/48000/2
...

→ 201 Created
   Content-Type: application/sdp
   Location: /whip/{session-uuid}

DataChannel Events

The server sends structured JSON events over the WebRTC DataChannel:

// Real-time transcript (partial or final)
{"type": "transcript", "text": "What's the weather?", "final": true}

// LLM response text as it streams
{"type": "response", "text": "Let me check that for you."}

// Pipeline latency metrics
{"type": "timing", "stage": "llm_first_token", "ms": 312}

// Error events
{"type": "error", "message": "TTS provider timeout"}

Key Source Files

pipeline/pipeline.go

Pipeline struct, channels, New(), Start(), Stop()

pipeline/inbound.go

runReader + runInbound goroutines

pipeline/agent.go

runAgent, respond, synthesizeSentences

pipeline/types.go

PCMFrame, TranscriptEvent, message types

vad/vad.go

Energy-based VAD (RMS threshold, onset, offset)

peer/peer.go

WebRTC peer, RemoteTrackCh, LocalTrack

session/session.go

Session wiring, AddPeer, pipeline lifecycle

config/config.go

TOML configuration loader

Audio Constants

Parameter	Value	Notes
Sample Rate	`48000 Hz`	Opus native rate
Frame Size	`960 samples`	20ms per frame
Channels	`1 (mono)`	Mono audio
Codec	`Opus`	WebRTC standard
Plugin Timeout	`30 seconds`	Per execute call

SDKs →Plugin guide →Quickstart →