Self-Hosted Realtime Engine
The realtime engine runs live voice sessions in a self-hosted deployment. It provides an OpenAI-compatible WebSocket API for real-time voice AI.
What The Realtime Engine Does
The realtime engine is responsible for:
- Accepting authenticated realtime WebSocket connections
- Creating and managing live voice sessions
- Coordinating speech-to-text, model inference, and text-to-speech
- Executing custom tools and actions
- Streaming audio and text responses back to clients
- Managing conversation history and session state
WebSocket Endpoint
The primary endpoint for voice sessions:
ws://localhost:8787/v1/realtimeIn production with TLS:
wss://your-domain.com/v1/realtimeConnection requires:
- Valid ephemeral token in
Authorizationheader or query parameter - WebSocket protocol upgrade support
Example Connection
# Using wscat (install with: npm install -g wscat)
wscat -c "ws://localhost:8787/v1/realtime" \
-H "Authorization: Bearer ek_your_ephemeral_token"// Using native WebSocket
const ws = new WebSocket(
'ws://localhost:8787/v1/realtime',
[],
{
headers: {
'Authorization': 'Bearer ek_your_ephemeral_token'
}
}
);Ephemeral Token Flow
1. Request Token from Core
curl -X POST http://localhost:3000/vowel/api/generateToken \
-H "Content-Type: application/json" \
-H "X-API-Key: vkey_your_bootstrap_key" \
-d '{
"appId": "default",
"provider": "vowel-prime"
}'Response:
{
"token": "ek_eyJhbGciOiJIUzI1NiIs...",
"expiresAt": "2024-01-15T10:30:00Z",
"wsUrl": "ws://localhost:8787/v1/realtime"
}2. Connect with Token
wscat -c "ws://localhost:8787/v1/realtime" \
-H "Authorization: Bearer ek_eyJhbGciOiJIUzI1NiIs..."3. Session Established
Upon successful connection, the engine sends:
{
"type": "session.created",
"session": {
"id": "sess_abc123",
"model": "openai/gpt-oss-20b",
"voice": "aura-2-thalia-en"
}
}HTTP Endpoints
Health Check
GET http://localhost:8787/healthResponse:
{
"status": "ok",
"timestamp": "2024-01-15T10:30:00Z"
}Runtime Configuration
The engine exposes HTTP endpoints for managing runtime configuration without restarting:
Get current config:
curl http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}"Update config:
curl -X PUT http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"llm": {
"provider": "groq",
"model": "openai/gpt-oss-20b"
}
}'Validate config change:
curl -X POST http://localhost:8787/config/validate \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-H "Content-Type: application/json" \
-d '{"llm": {"provider": "openrouter"}}'Reload config from disk:
curl -X POST http://localhost:8787/config/reload \
-H "Authorization: Bearer ${ENGINE_API_KEY}"List available presets:
curl http://localhost:8787/presets \
-H "Authorization: Bearer ${ENGINE_API_KEY}"WebSocket Events
Client → Server Events
| Event | Description |
|---|---|
session.update | Update session configuration |
input_audio_buffer.append | Stream audio chunk (PCM16, 24kHz, mono) |
input_audio_buffer.commit | Signal end of speech |
conversation.item.create | Add text message to conversation |
response.create | Request AI response |
response.cancel | Cancel in-progress response |
Server → Client Events
| Event | Description |
|---|---|
session.created | Session initialized |
input_audio_buffer.speech_started | Speech detected |
input_audio_buffer.speech_stopped | Speech ended |
conversation.item.created | New conversation item |
response.text.delta | Streaming text response |
response.audio.delta | Streaming audio response (PCM16) |
response.done | Response complete |
error | Error occurred |
Example: Update Session
{
"type": "session.update",
"session": {
"instructions": "You are a helpful assistant.",
"voice": "aura-2-thalia-en",
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"silence_duration_ms": 550
}
}
}Example: Send Audio
{
"type": "input_audio_buffer.append",
"audio": "base64EncodedPcm16AudioData..."
}How It Fits With Core
Core and the realtime engine have different responsibilities:
- Core prepares and issues short-lived access for session startup (token generation, app management)
- Realtime engine runs the session after the client connects (audio processing, AI inference, TTS)
What Operators Should Care About
Public WebSocket Endpoint Stability
Keep the WebSocket URL stable between deployments. Clients store this URL in their configuration.
Health and Restart Behavior
The engine includes a health check endpoint (/health) used by Docker and load balancers. The container restarts automatically on failure (restart: unless-stopped).
Upstream Provider Configuration
Provider settings are loaded from:
- Environment variables (bootstrap/fallback)
- Runtime YAML config (
/app/data/config/runtime.yaml) - Runtime config HTTP API
Update providers without restarting:
curl -X PUT http://localhost:8787/config \
-H "Authorization: Bearer ${ENGINE_API_KEY}" \
-d '{"llm": {"provider": "openrouter", "model": "anthropic/claude-3-sonnet"}}'TLS and Proxy Support
For production deployments behind a reverse proxy:
# nginx configuration
location /v1/realtime {
proxy_pass http://localhost:8787/v1/realtime;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400;
proxy_send_timeout 86400;
}Logs and Metrics
View engine logs:
# All logs
bun run stack:logs | grep engine
# Errors only
bun run stack:logs | grep engine | grep -i error
# Session events
bun run stack:logs | grep engine | grep -i "session\|speech\|response"Provider Configuration
LLM Providers
Configure the LLM provider via environment variables or runtime config:
Groq (default, fast):
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key
GROQ_MODEL=openai/gpt-oss-20bOpenRouter (100+ models):
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-v1-your_key
OPENROUTER_MODEL=anthropic/claude-3-sonnetSpeech-to-Text
Default: Deepgram Nova-3
STT_PROVIDER=deepgram
DEEPGRAM_API_KEY=your_key
DEEPGRAM_STT_MODEL=nova-3
DEEPGRAM_STT_LANGUAGE=en-USText-to-Speech
Default: Deepgram Aura-2
TTS_PROVIDER=deepgram
DEEPGRAM_TTS_MODEL=aura-2-thalia-enVoice Activity Detection
Default: Silero VAD
VAD_PROVIDER=silero
VAD_ENABLED=true
VAD_THRESHOLD=0.5
VAD_MIN_SILENCE_MS=550Runtime Config Ownership
The engine persists its runtime configuration as YAML at /app/data/config/runtime.yaml on the engine-data Docker volume. Environment variables act as bootstrap defaults and fallbacks.
Config hierarchy (highest priority first):
- Runtime config HTTP API updates
- Runtime YAML file (
/app/data/config/runtime.yaml) - Environment variables
This allows updating configuration without rebuilding or restarting containers.
Typical Session Flow
- Token Acquisition: Client requests token from Core
- WebSocket Connection: Client connects to engine with token
- Session Creation: Engine validates token and creates session
- Audio Streaming: Client sends audio chunks
- Speech Detection: VAD detects speech start/end
- Transcription: STT converts speech to text
- AI Processing: LLM generates response
- Synthesis: TTS converts text to audio
- Streaming: Audio/text streamed back to client
Source Repository
The realtime engine is open source at github.com/usevowel/engine.
Related Docs
- Architecture - How the engine fits in the stack
- Core - Token service and control plane
- Configuration - Environment variables and config options
- Deployment - Deploying the full stack
- Troubleshooting - Debugging engine issues