Skip to content

Self-Hosted Realtime Engine

The realtime engine runs live voice sessions in a self-hosted deployment. It provides an OpenAI-compatible WebSocket API for real-time voice AI.

What The Realtime Engine Does

The realtime engine is responsible for:

  • Accepting authenticated realtime WebSocket connections
  • Creating and managing live voice sessions
  • Coordinating speech-to-text, model inference, and text-to-speech
  • Executing custom tools and actions
  • Streaming audio and text responses back to clients
  • Managing conversation history and session state

WebSocket Endpoint

The primary endpoint for voice sessions:

ws://localhost:8787/v1/realtime

In production with TLS:

wss://your-domain.com/v1/realtime

Connection requires:

  • Valid ephemeral token in Authorization header or query parameter
  • WebSocket protocol upgrade support

Example Connection

bash
# Using wscat (install with: npm install -g wscat)
wscat -c "ws://localhost:8787/v1/realtime" \
  -H "Authorization: Bearer ek_your_ephemeral_token"
javascript
// Using native WebSocket
const ws = new WebSocket(
  'ws://localhost:8787/v1/realtime',
  [],
  {
    headers: {
      'Authorization': 'Bearer ek_your_ephemeral_token'
    }
  }
);

Ephemeral Token Flow

1. Request Token from Core

bash
curl -X POST http://localhost:3000/vowel/api/generateToken \
  -H "Content-Type: application/json" \
  -H "X-API-Key: vkey_your_bootstrap_key" \
  -d '{
    "appId": "default",
    "provider": "vowel-prime"
  }'

Response:

json
{
  "token": "ek_eyJhbGciOiJIUzI1NiIs...",
  "expiresAt": "2024-01-15T10:30:00Z",
  "wsUrl": "ws://localhost:8787/v1/realtime"
}

2. Connect with Token

bash
wscat -c "ws://localhost:8787/v1/realtime" \
  -H "Authorization: Bearer ek_eyJhbGciOiJIUzI1NiIs..."

3. Session Established

Upon successful connection, the engine sends:

json
{
  "type": "session.created",
  "session": {
    "id": "sess_abc123",
    "model": "openai/gpt-oss-20b",
    "voice": "aura-2-thalia-en"
  }
}

HTTP Endpoints

Health Check

bash
GET http://localhost:8787/health

Response:

json
{
  "status": "ok",
  "timestamp": "2024-01-15T10:30:00Z"
}

Runtime Configuration

The engine exposes HTTP endpoints for managing runtime configuration without restarting:

Get current config:

bash
curl http://localhost:8787/config \
  -H "Authorization: Bearer ${ENGINE_API_KEY}"

Update config:

bash
curl -X PUT http://localhost:8787/config \
  -H "Authorization: Bearer ${ENGINE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "llm": {
      "provider": "groq",
      "model": "openai/gpt-oss-20b"
    }
  }'

Validate config change:

bash
curl -X POST http://localhost:8787/config/validate \
  -H "Authorization: Bearer ${ENGINE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"llm": {"provider": "openrouter"}}'

Reload config from disk:

bash
curl -X POST http://localhost:8787/config/reload \
  -H "Authorization: Bearer ${ENGINE_API_KEY}"

List available presets:

bash
curl http://localhost:8787/presets \
  -H "Authorization: Bearer ${ENGINE_API_KEY}"

WebSocket Events

Client → Server Events

EventDescription
session.updateUpdate session configuration
input_audio_buffer.appendStream audio chunk (PCM16, 24kHz, mono)
input_audio_buffer.commitSignal end of speech
conversation.item.createAdd text message to conversation
response.createRequest AI response
response.cancelCancel in-progress response

Server → Client Events

EventDescription
session.createdSession initialized
input_audio_buffer.speech_startedSpeech detected
input_audio_buffer.speech_stoppedSpeech ended
conversation.item.createdNew conversation item
response.text.deltaStreaming text response
response.audio.deltaStreaming audio response (PCM16)
response.doneResponse complete
errorError occurred

Example: Update Session

json
{
  "type": "session.update",
  "session": {
    "instructions": "You are a helpful assistant.",
    "voice": "aura-2-thalia-en",
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "silence_duration_ms": 550
    }
  }
}

Example: Send Audio

json
{
  "type": "input_audio_buffer.append",
  "audio": "base64EncodedPcm16AudioData..."
}

How It Fits With Core

Core and the realtime engine have different responsibilities:

  • Core prepares and issues short-lived access for session startup (token generation, app management)
  • Realtime engine runs the session after the client connects (audio processing, AI inference, TTS)

What Operators Should Care About

Public WebSocket Endpoint Stability

Keep the WebSocket URL stable between deployments. Clients store this URL in their configuration.

Health and Restart Behavior

The engine includes a health check endpoint (/health) used by Docker and load balancers. The container restarts automatically on failure (restart: unless-stopped).

Upstream Provider Configuration

Provider settings are loaded from:

  1. Environment variables (bootstrap/fallback)
  2. Runtime YAML config (/app/data/config/runtime.yaml)
  3. Runtime config HTTP API

Update providers without restarting:

bash
curl -X PUT http://localhost:8787/config \
  -H "Authorization: Bearer ${ENGINE_API_KEY}" \
  -d '{"llm": {"provider": "openrouter", "model": "anthropic/claude-3-sonnet"}}'

TLS and Proxy Support

For production deployments behind a reverse proxy:

nginx
# nginx configuration
location /v1/realtime {
    proxy_pass http://localhost:8787/v1/realtime;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 86400;
    proxy_send_timeout 86400;
}

Logs and Metrics

View engine logs:

bash
# All logs
bun run stack:logs | grep engine

# Errors only
bun run stack:logs | grep engine | grep -i error

# Session events
bun run stack:logs | grep engine | grep -i "session\|speech\|response"

Provider Configuration

LLM Providers

Configure the LLM provider via environment variables or runtime config:

Groq (default, fast):

bash
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_your_key
GROQ_MODEL=openai/gpt-oss-20b

OpenRouter (100+ models):

bash
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-v1-your_key
OPENROUTER_MODEL=anthropic/claude-3-sonnet

Speech-to-Text

Default: Deepgram Nova-3

bash
STT_PROVIDER=deepgram
DEEPGRAM_API_KEY=your_key
DEEPGRAM_STT_MODEL=nova-3
DEEPGRAM_STT_LANGUAGE=en-US

Text-to-Speech

Default: Deepgram Aura-2

bash
TTS_PROVIDER=deepgram
DEEPGRAM_TTS_MODEL=aura-2-thalia-en

Voice Activity Detection

Default: Silero VAD

bash
VAD_PROVIDER=silero
VAD_ENABLED=true
VAD_THRESHOLD=0.5
VAD_MIN_SILENCE_MS=550

Runtime Config Ownership

The engine persists its runtime configuration as YAML at /app/data/config/runtime.yaml on the engine-data Docker volume. Environment variables act as bootstrap defaults and fallbacks.

Config hierarchy (highest priority first):

  1. Runtime config HTTP API updates
  2. Runtime YAML file (/app/data/config/runtime.yaml)
  3. Environment variables

This allows updating configuration without rebuilding or restarting containers.

Typical Session Flow

  1. Token Acquisition: Client requests token from Core
  2. WebSocket Connection: Client connects to engine with token
  3. Session Creation: Engine validates token and creates session
  4. Audio Streaming: Client sends audio chunks
  5. Speech Detection: VAD detects speech start/end
  6. Transcription: STT converts speech to text
  7. AI Processing: LLM generates response
  8. Synthesis: TTS converts text to audio
  9. Streaming: Audio/text streamed back to client

Source Repository

The realtime engine is open source at github.com/usevowel/engine.