Testing
The self-hosted stack includes multiple testing capabilities to verify functionality and validate voice agent behavior.
Quick Start
Run the built-in smoke test to verify your stack is working:
# After starting the stack
bun run stack:up
# Wait for services to be healthy, then run the smoke test
bun run stack:testRun the full test suite with the Test Harness:
# Set your API key and run tests
export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/testerSmoke Test
The smoke test (bun run stack:test) performs a quick health check of your entire stack without requiring complex setup.
What It Tests
- Engine Health - Verifies the engine responds to health checks
- Core Health - Verifies the Core token service is running
- Token Minting - Tests that Core can generate ephemeral tokens
- WebSocket Connection - Validates tokens work for engine connections
- Session Creation - Confirms the engine creates sessions properly
Running the Smoke Test
# From the workspace root
bun run stack:testThe smoke test will report:
✅ Engine health: OK- Engine container is responsive✅ Core health: OK- Core container is responsive✅ Token minted: <token_prefix>...- Token generation works✅ Connected to WebSocket- WebSocket connection succeeds✅ Session created- Full flow is working
Interpreting Results
| Result | Meaning |
|---|---|
| All checks pass | Stack is fully operational |
| Engine/Core health fails | Containers not running or not healthy |
| Token minting fails | Check .env configuration and API keys |
| WebSocket/Session fails | Engine configuration or provider issues |
Test Harness Framework
The stack includes a sophisticated Test Harness in the Engine for automated end-to-end testing of voice agent conversations.
Architecture
The Test Harness uses an LLM-powered Test Driver that simulates a human user conducting conversations with your voice agent:
Components:
- TestDriver: LLM agent that generates realistic user messages and evaluates responses
- TestHarness: Orchestrates the conversation, manages WebSocket connection, validates tool calls
- EngineConnection: Handles WebSocket communication with the engine
- Scenarios: Pre-defined test cases with objectives and expected tool calls
Key Features
- Automated Conversation Flow: The TestDriver carries natural conversations toward test objectives
- Tool Call Validation: Verifies the agent uses the right tools with correct arguments
- Mock Tool Results: Returns realistic mock data so conversations can continue
- Detailed Logging: Generates timestamped Markdown logs of every test run
- Timeout Handling: Gracefully handles slow responses or connection issues
- Retry Logic: Automatic retry with exponential backoff for rate-limited LLM calls
Demo Test Scenarios
The Test Harness includes four built-in scenarios matching the demo application's tools:
1. Weather Tool Test
Tests that the agent correctly uses the get_weather tool.
const weatherScenario = {
name: 'Weather Tool Test',
driver: {
objective: 'Test the weather lookup tool by asking for weather in New York',
personality: 'curious user interested in weather',
maxTurns: 4,
},
expectedToolCalls: [{
name: 'get_weather',
required: true,
validate: (args) => args.location.toLowerCase().includes('new york'),
mockResult: {
location: 'New York, NY',
temperature: '72°F',
condition: 'Sunny',
},
}],
};2. Calculator Tool Test
Tests that the agent uses the calculate tool for math queries.
const calculatorScenario = {
name: 'Calculator Tool Test',
driver: {
objective: 'Test the calculator tool by asking to calculate 15 * 24',
personality: 'user doing math homework',
maxTurns: 3,
},
expectedToolCalls: [{
name: 'calculate',
required: true,
validate: (args) => args.expression.includes('15') && args.expression.includes('24'),
mockResult: { expression: '15 * 24', result: 360 },
}],
};3. Multi-Tool Conversation Test
Tests multiple tool usage in a single conversation flow.
const multiToolScenario = {
name: 'Multi-Tool Conversation Test',
driver: {
objective: 'First ask for weather in Paris, then ask to calculate hours in 3 days',
personality: 'traveler planning a trip',
maxTurns: 6,
},
expectedToolCalls: [
{ name: 'get_weather', required: true, mockResult: {...} },
{ name: 'calculate', required: true, mockResult: {...} },
],
};4. Context Retention Test
Tests that the agent remembers context across conversation turns.
const contextScenario = {
name: 'Context Retention Test',
driver: {
objective: 'First ask "What is the weather in London?" Then ask "What about Paris?"',
personality: 'casual conversationalist',
maxTurns: 5,
},
// Expects get_weather to be called appropriately based on context
};Running Tests
Prerequisites
Stack must be running:
bashbun run stack:upAPI Key configured: You need a valid publishable API key
bash# Create key in Core UI or use bootstrap key export API_KEY=vkey_your_key_hereInstall tester dependencies:
bashbun install --cwd engine/packages/tester
Running All Tests
export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/testerRunning a Single Test
export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/tester --test-name-pattern="Weather Tool Test"Running with Different LLM Providers
The TestDriver supports Groq and OpenRouter for its own LLM calls:
# Use Groq (faster, often free tier available)
export TEST_DRIVER_PROVIDER=groq
export GROQ_API_KEY=your_groq_key
bun test --cwd engine/packages/tester
# Use OpenRouter (more model options, free tier available)
export TEST_DRIVER_PROVIDER=openrouter
export OPENROUTER_API_KEY=your_openrouter_key
bun test --cwd engine/packages/tester
# Prefer free models (recommended for cost control)
export TEST_DRIVER_MODEL=arcee-ai/trinity-large-preview:free
bun test --cwd engine/packages/testerConfiguring Test Endpoints
Override the default endpoints if your stack runs on non-standard ports:
# Point to custom stack location
export TEST_BASE_URL=http://localhost:8787
export TEST_MODEL=openai/gpt-oss-20b
bun test --cwd engine/packages/testerCreating Custom Test Scenarios
Basic Scenario Structure
import { TestScenario } from '@vowel/tester';
const myScenario: TestScenario = {
name: 'My Custom Test',
driver: {
objective: 'What the test should accomplish',
personality: 'Type of user to simulate',
maxTurns: 5,
temperature: 0.3, // Lower = more deterministic testing
},
connection: {
baseUrl: 'http://localhost:8787',
model: 'openai/gpt-oss-20b',
voice: 'Ashley',
instructions: 'Agent system prompt for this test',
tools: [
{
type: 'function',
name: 'my_tool',
description: 'What this tool does',
parameters: { /* JSON schema */ },
},
],
},
expectedToolCalls: [
{
name: 'my_tool',
required: true,
validate: (args) => /* validation logic */,
mockResult: { /* mock response data */ },
},
],
timeout: 30000, // milliseconds
};Step-by-Step Custom Test Creation
- Create a scenario file (
my-scenarios.ts):
import { TestScenario } from '@vowel/tester';
export const bookingScenario: TestScenario = {
name: 'Restaurant Booking Test',
driver: {
objective: 'Book a table for 4 people at 7pm on Friday',
personality: 'busy professional making a reservation',
maxTurns: 6,
},
connection: {
baseUrl: process.env.TEST_BASE_URL || 'http://localhost:8787',
model: process.env.TEST_MODEL || 'openai/gpt-oss-20b',
voice: 'Ashley',
instructions: 'You are a restaurant booking assistant. Use the book_table tool when customers want to make reservations.',
tools: [
{
type: 'function',
name: 'book_table',
description: 'Book a restaurant table',
parameters: {
type: 'object',
properties: {
party_size: { type: 'number' },
date: { type: 'string', description: 'YYYY-MM-DD' },
time: { type: 'string', description: 'HH:MM' },
},
required: ['party_size', 'date', 'time'],
},
},
],
},
expectedToolCalls: [
{
name: 'book_table',
required: true,
validate: (args) => {
return args.party_size === 4 && args.time === '19:00';
},
mockResult: {
booking_id: 'BK12345',
status: 'confirmed',
message: 'Table booked for Friday at 7pm',
},
},
],
timeout: 30000,
};- Create a test file (
my-test.test.ts):
import { describe, test, expect } from 'bun:test';
import { TestHarness } from '@vowel/tester';
import { bookingScenario } from './my-scenarios';
const API_KEY = process.env.API_KEY || '';
const runTests = API_KEY ? describe : describe.skip;
runTests('Restaurant Tests', () => {
const harness = new TestHarness(API_KEY, './logs');
test('Booking Flow', async () => {
const result = await harness.runScenario(bookingScenario);
console.log('\n📊 Results:');
console.log(` Passed: ${result.passed}`);
console.log(` Duration: ${result.duration}ms`);
console.log(` Evaluation: ${result.evaluation}`);
expect(result.passed).toBe(true);
}, { timeout: 60000 });
});- Run your custom test:
export API_KEY=vkey_your_key
bun test my-test.test.ts --cwd engine/packages/testerTroubleshooting Tests
Common Issues
"API_KEY not set"
# Set the API key from your Core UI
export API_KEY=vkey_your_publishable_key_here"Connection refused" / "ECONNREFUSED"
# Check if stack is running
docker compose ps
# Start the stack if needed
bun run stack:up
# Verify the base URL
export TEST_BASE_URL=http://localhost:8787"Token generation failed" / 401 errors
- Verify your API key is valid and has
mint_ephemeralscope - Check that the app is configured in Core UI
- Ensure the key matches
CORE_BOOTSTRAP_PUBLISHABLE_KEYif using bootstrap
Tests timeout frequently
# Increase timeout for slower environments
export TEST_TIMEOUT=60000 // 60 seconds
# Or in scenario config:
timeout: 60000,Tool calls not detected
- Check engine logs:
docker logs vowel-engine | grep -i tool - Verify tool names match exactly between scenario and agent
- Ensure mock results return valid JSON
"Rate limit exceeded"
The TestDriver has built-in retry logic, but you can also:
# Use a different model with higher limits
export TEST_DRIVER_MODEL=arcee-ai/trinity-large-preview:free
# Add delay between tests
export TEST_DELAY=2000 // milliseconds between testsDebug Mode
Enable verbose logging to see detailed WebSocket traffic:
export TEST_LOG_LEVEL=debug
bun test --cwd engine/packages/tester 2>&1 | tee test-debug.logLog Files
The Test Harness automatically generates detailed Markdown logs in ./logs/:
# View latest log
ls -la logs/*.md | tail -1 | xargs cat
# All logs include:
# - Timestamped event stream
# - Full conversation transcript
# - Tool call details with arguments
# - Pass/fail status and evaluationManual WebSocket Testing
For direct WebSocket testing without the Test Harness:
# Using the built-in connection test
curl http://localhost:3000/vowel/api/generateToken \
-X POST \
-H "Authorization: Bearer vkey_your_key" \
-d '{"appId":"default","config":{"provider":"engine","voiceConfig":{"model":"openai/gpt-oss-20b"}}}'Advanced Usage
Parallel Test Execution
# Run tests in parallel (Bun handles this automatically)
bun test --cwd engine/packages/tester --parallel
# Limit concurrency for rate-limited providers
export TEST_CONCURRENCY=2
bun test --cwd engine/packages/testerCI/CD Integration
# Example GitHub Actions workflow
- name: Run Vowel Stack Tests
run: |
bun run stack:up
sleep 30 # Wait for healthy
bun run stack:test
export API_KEY=${{ secrets.VOWEL_API_KEY }}
bun test --cwd engine/packages/testerPerformance Benchmarking
// Add to your test for performance checks
test('Performance Benchmark', async () => {
const result = await harness.runScenario(scenario);
// Assert performance requirements
expect(result.duration).toBeLessThan(5000); // 5 seconds max
expect(result.turns).toBeGreaterThan(2); // At least 2 turns
});Summary
| Test Type | Command | Purpose |
|---|---|---|
| Smoke Test | bun run stack:test | Quick health check |
| E2E Tests | bun test --cwd engine/packages/tester | Full conversation validation |
| Custom Tests | Write scenarios using TestHarness | Validate your specific use cases |