Testing

The self-hosted stack includes multiple testing capabilities to verify functionality and validate voice agent behavior.

Quick Start

Run the built-in smoke test to verify your stack is working:

bash

# After starting the stack
bun run stack:up

# Wait for services to be healthy, then run the smoke test
bun run stack:test

Run the full test suite with the Test Harness:

bash

# Set your API key and run tests
export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/tester

Smoke Test

The smoke test (bun run stack:test) performs a quick health check of your entire stack without requiring complex setup.

What It Tests

Engine Health - Verifies the engine responds to health checks
Core Health - Verifies the Core token service is running
Token Minting - Tests that Core can generate ephemeral tokens
WebSocket Connection - Validates tokens work for engine connections
Session Creation - Confirms the engine creates sessions properly

Running the Smoke Test

bash

# From the workspace root
bun run stack:test

The smoke test will report:

✅ Engine health: OK - Engine container is responsive
✅ Core health: OK - Core container is responsive
✅ Token minted: <token_prefix>... - Token generation works
✅ Connected to WebSocket - WebSocket connection succeeds
✅ Session created - Full flow is working

Interpreting Results

Result	Meaning
All checks pass	Stack is fully operational
Engine/Core health fails	Containers not running or not healthy
Token minting fails	Check `.env` configuration and API keys
WebSocket/Session fails	Engine configuration or provider issues

Test Harness Framework

The stack includes a sophisticated Test Harness in the Engine for automated end-to-end testing of voice agent conversations.

Architecture

The Test Harness uses an LLM-powered Test Driver that simulates a human user conducting conversations with your voice agent:

Components:

TestDriver: LLM agent that generates realistic user messages and evaluates responses
TestHarness: Orchestrates the conversation, manages WebSocket connection, validates tool calls
EngineConnection: Handles WebSocket communication with the engine
Scenarios: Pre-defined test cases with objectives and expected tool calls

Key Features

Automated Conversation Flow: The TestDriver carries natural conversations toward test objectives
Tool Call Validation: Verifies the agent uses the right tools with correct arguments
Mock Tool Results: Returns realistic mock data so conversations can continue
Detailed Logging: Generates timestamped Markdown logs of every test run
Timeout Handling: Gracefully handles slow responses or connection issues
Retry Logic: Automatic retry with exponential backoff for rate-limited LLM calls

Demo Test Scenarios

The Test Harness includes four built-in scenarios matching the demo application's tools:

1. Weather Tool Test

Tests that the agent correctly uses the get_weather tool.

typescript

const weatherScenario = {
  name: 'Weather Tool Test',
  driver: {
    objective: 'Test the weather lookup tool by asking for weather in New York',
    personality: 'curious user interested in weather',
    maxTurns: 4,
  },
  expectedToolCalls: [{
    name: 'get_weather',
    required: true,
    validate: (args) => args.location.toLowerCase().includes('new york'),
    mockResult: {
      location: 'New York, NY',
      temperature: '72°F',
      condition: 'Sunny',
    },
  }],
};

2. Calculator Tool Test

Tests that the agent uses the calculate tool for math queries.

typescript

const calculatorScenario = {
  name: 'Calculator Tool Test',
  driver: {
    objective: 'Test the calculator tool by asking to calculate 15 * 24',
    personality: 'user doing math homework',
    maxTurns: 3,
  },
  expectedToolCalls: [{
    name: 'calculate',
    required: true,
    validate: (args) => args.expression.includes('15') && args.expression.includes('24'),
    mockResult: { expression: '15 * 24', result: 360 },
  }],
};

3. Multi-Tool Conversation Test

Tests multiple tool usage in a single conversation flow.

typescript

const multiToolScenario = {
  name: 'Multi-Tool Conversation Test',
  driver: {
    objective: 'First ask for weather in Paris, then ask to calculate hours in 3 days',
    personality: 'traveler planning a trip',
    maxTurns: 6,
  },
  expectedToolCalls: [
    { name: 'get_weather', required: true, mockResult: {...} },
    { name: 'calculate', required: true, mockResult: {...} },
  ],
};

4. Context Retention Test

Tests that the agent remembers context across conversation turns.

typescript

const contextScenario = {
  name: 'Context Retention Test',
  driver: {
    objective: 'First ask "What is the weather in London?" Then ask "What about Paris?"',
    personality: 'casual conversationalist',
    maxTurns: 5,
  },
  // Expects get_weather to be called appropriately based on context
};

Running Tests

Prerequisites

Stack must be running:
bash
```
bun run stack:up
```

API Key configured: You need a valid publishable API key

bash

# Create key in Core UI or use bootstrap key
export API_KEY=vkey_your_key_here

Install tester dependencies:

bash

bun install --cwd engine/packages/tester

Running All Tests

bash

export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/tester

Running a Single Test

bash

export API_KEY=vkey_your_key_here
bun test --cwd engine/packages/tester --test-name-pattern="Weather Tool Test"

Running with Different LLM Providers

The TestDriver supports Groq and OpenRouter for its own LLM calls:

bash

# Use Groq (faster, often free tier available)
export TEST_DRIVER_PROVIDER=groq
export GROQ_API_KEY=your_groq_key
bun test --cwd engine/packages/tester

# Use OpenRouter (more model options, free tier available)
export TEST_DRIVER_PROVIDER=openrouter
export OPENROUTER_API_KEY=your_openrouter_key
bun test --cwd engine/packages/tester

# Prefer free models (recommended for cost control)
export TEST_DRIVER_MODEL=arcee-ai/trinity-large-preview:free
bun test --cwd engine/packages/tester

Configuring Test Endpoints

Override the default endpoints if your stack runs on non-standard ports:

bash

# Point to custom stack location
export TEST_BASE_URL=http://localhost:8787
export TEST_MODEL=openai/gpt-oss-20b
bun test --cwd engine/packages/tester

Creating Custom Test Scenarios

Basic Scenario Structure

typescript

import { TestScenario } from '@vowel/tester';

const myScenario: TestScenario = {
  name: 'My Custom Test',
  driver: {
    objective: 'What the test should accomplish',
    personality: 'Type of user to simulate',
    maxTurns: 5,
    temperature: 0.3,  // Lower = more deterministic testing
  },
  connection: {
    baseUrl: 'http://localhost:8787',
    model: 'openai/gpt-oss-20b',
    voice: 'Ashley',
    instructions: 'Agent system prompt for this test',
    tools: [
      {
        type: 'function',
        name: 'my_tool',
        description: 'What this tool does',
        parameters: { /* JSON schema */ },
      },
    ],
  },
  expectedToolCalls: [
    {
      name: 'my_tool',
      required: true,
      validate: (args) => /* validation logic */,
      mockResult: { /* mock response data */ },
    },
  ],
  timeout: 30000,  // milliseconds
};

Step-by-Step Custom Test Creation

Create a scenario file (my-scenarios.ts):

typescript

import { TestScenario } from '@vowel/tester';

export const bookingScenario: TestScenario = {
  name: 'Restaurant Booking Test',
  driver: {
    objective: 'Book a table for 4 people at 7pm on Friday',
    personality: 'busy professional making a reservation',
    maxTurns: 6,
  },
  connection: {
    baseUrl: process.env.TEST_BASE_URL || 'http://localhost:8787',
    model: process.env.TEST_MODEL || 'openai/gpt-oss-20b',
    voice: 'Ashley',
    instructions: 'You are a restaurant booking assistant. Use the book_table tool when customers want to make reservations.',
    tools: [
      {
        type: 'function',
        name: 'book_table',
        description: 'Book a restaurant table',
        parameters: {
          type: 'object',
          properties: {
            party_size: { type: 'number' },
            date: { type: 'string', description: 'YYYY-MM-DD' },
            time: { type: 'string', description: 'HH:MM' },
          },
          required: ['party_size', 'date', 'time'],
        },
      },
    ],
  },
  expectedToolCalls: [
    {
      name: 'book_table',
      required: true,
      validate: (args) => {
        return args.party_size === 4 && args.time === '19:00';
      },
      mockResult: {
        booking_id: 'BK12345',
        status: 'confirmed',
        message: 'Table booked for Friday at 7pm',
      },
    },
  ],
  timeout: 30000,
};

Create a test file (my-test.test.ts):

typescript

import { describe, test, expect } from 'bun:test';
import { TestHarness } from '@vowel/tester';
import { bookingScenario } from './my-scenarios';

const API_KEY = process.env.API_KEY || '';
const runTests = API_KEY ? describe : describe.skip;

runTests('Restaurant Tests', () => {
  const harness = new TestHarness(API_KEY, './logs');

  test('Booking Flow', async () => {
    const result = await harness.runScenario(bookingScenario);

    console.log('\n📊 Results:');
    console.log(`   Passed: ${result.passed}`);
    console.log(`   Duration: ${result.duration}ms`);
    console.log(`   Evaluation: ${result.evaluation}`);

    expect(result.passed).toBe(true);
  }, { timeout: 60000 });
});

Run your custom test:

bash

export API_KEY=vkey_your_key
bun test my-test.test.ts --cwd engine/packages/tester

Troubleshooting Tests

Common Issues

"API_KEY not set"

bash

# Set the API key from your Core UI
export API_KEY=vkey_your_publishable_key_here

"Connection refused" / "ECONNREFUSED"

bash

# Check if stack is running
docker compose ps

# Start the stack if needed
bun run stack:up

# Verify the base URL
export TEST_BASE_URL=http://localhost:8787

"Token generation failed" / 401 errors

Verify your API key is valid and has mint_ephemeral scope
Check that the app is configured in Core UI
Ensure the key matches CORE_BOOTSTRAP_PUBLISHABLE_KEY if using bootstrap

Tests timeout frequently

bash

# Increase timeout for slower environments
export TEST_TIMEOUT=60000  // 60 seconds

# Or in scenario config:
timeout: 60000,

Tool calls not detected

Check engine logs: docker logs vowel-engine | grep -i tool
Verify tool names match exactly between scenario and agent
Ensure mock results return valid JSON

"Rate limit exceeded"

The TestDriver has built-in retry logic, but you can also:

bash

# Use a different model with higher limits
export TEST_DRIVER_MODEL=arcee-ai/trinity-large-preview:free

# Add delay between tests
export TEST_DELAY=2000  // milliseconds between tests

Debug Mode

Enable verbose logging to see detailed WebSocket traffic:

bash

export TEST_LOG_LEVEL=debug
bun test --cwd engine/packages/tester 2>&1 | tee test-debug.log

Log Files

The Test Harness automatically generates detailed Markdown logs in ./logs/:

bash

# View latest log
ls -la logs/*.md | tail -1 | xargs cat

# All logs include:
# - Timestamped event stream
# - Full conversation transcript
# - Tool call details with arguments
# - Pass/fail status and evaluation

Manual WebSocket Testing

For direct WebSocket testing without the Test Harness:

bash

# Using the built-in connection test
curl http://localhost:3000/vowel/api/generateToken \
  -X POST \
  -H "Authorization: Bearer vkey_your_key" \
  -d '{"appId":"default","config":{"provider":"engine","voiceConfig":{"model":"openai/gpt-oss-20b"}}}'

Advanced Usage

Parallel Test Execution

bash

# Run tests in parallel (Bun handles this automatically)
bun test --cwd engine/packages/tester --parallel

# Limit concurrency for rate-limited providers
export TEST_CONCURRENCY=2
bun test --cwd engine/packages/tester

CI/CD Integration

yaml

# Example GitHub Actions workflow
- name: Run Vowel Stack Tests
  run: |
    bun run stack:up
    sleep 30  # Wait for healthy
    bun run stack:test
    export API_KEY=${{ secrets.VOWEL_API_KEY }}
    bun test --cwd engine/packages/tester

Performance Benchmarking

typescript

// Add to your test for performance checks
test('Performance Benchmark', async () => {
  const result = await harness.runScenario(scenario);

  // Assert performance requirements
  expect(result.duration).toBeLessThan(5000);  // 5 seconds max
  expect(result.turns).toBeGreaterThan(2);       // At least 2 turns
});

Summary

Test Type	Command	Purpose
Smoke Test	`bun run stack:test`	Quick health check
E2E Tests	`bun test --cwd engine/packages/tester`	Full conversation validation
Custom Tests	Write scenarios using `TestHarness`	Validate your specific use cases

Testing ​

Quick Start ​

Smoke Test ​

What It Tests ​

Running the Smoke Test ​

Interpreting Results ​

Test Harness Framework ​

Architecture ​

Key Features ​

Demo Test Scenarios ​

1. Weather Tool Test ​

2. Calculator Tool Test ​

3. Multi-Tool Conversation Test ​

4. Context Retention Test ​

Running Tests ​

Prerequisites ​

Running All Tests ​

Running a Single Test ​

Running with Different LLM Providers ​

Configuring Test Endpoints ​

Creating Custom Test Scenarios ​

Basic Scenario Structure ​

Step-by-Step Custom Test Creation ​

Troubleshooting Tests ​

Common Issues ​

"API_KEY not set" ​

"Connection refused" / "ECONNREFUSED" ​

"Token generation failed" / 401 errors ​

Tests timeout frequently ​

Tool calls not detected ​

"Rate limit exceeded" ​

Debug Mode ​

Log Files ​

Manual WebSocket Testing ​

Advanced Usage ​

Parallel Test Execution ​

CI/CD Integration ​

Performance Benchmarking ​

Summary ​

Testing

Quick Start

Smoke Test

What It Tests

Running the Smoke Test

Interpreting Results

Test Harness Framework

Architecture

Key Features

Demo Test Scenarios

1. Weather Tool Test

2. Calculator Tool Test

3. Multi-Tool Conversation Test

4. Context Retention Test

Running Tests

Prerequisites

Running All Tests

Running a Single Test

Running with Different LLM Providers

Configuring Test Endpoints

Creating Custom Test Scenarios

Basic Scenario Structure

Step-by-Step Custom Test Creation

Troubleshooting Tests

Common Issues

"API_KEY not set"

"Connection refused" / "ECONNREFUSED"

"Token generation failed" / 401 errors

Tests timeout frequently

Tool calls not detected

"Rate limit exceeded"

Debug Mode

Log Files

Manual WebSocket Testing

Advanced Usage

Parallel Test Execution

CI/CD Integration

Performance Benchmarking

Summary