Voice S2S Platform β Production-Grade Speech-to-Speech AIΒΆ
TL;DR: A full-duplex speech-to-speech platform built on Amazon Nova Sonic with two selectable AI brain modes, a JSON-driven tool system routable through AWS AgentCore, automatic dialect detection with mid-conversation voice switching, and a phantom action watcher that catches LLMs when they promise but don't deliver.
Stack: Node.js β’ TypeScript β’ Amazon Nova Sonic β’ AWS Bedrock β’ Next.js 15 β’ WebSocket β’ AWS Connect β’ Langfuse
β¨ FeaturesΒΆ
- π Full-Duplex Voice - PCM16 audio streams in both directions on a single WebSocket; mic captures at 16 kHz, playback at 24 kHz
- π§ Dual Brain Modes - Nova Sonic Direct (~200β500 ms) for low-latency interactions; Bedrock Agent (~1β3 s) for multi-step orchestrated reasoning
- π οΈ JSON-Driven Tools - Drop a JSON file into
/tools/to expose a new capability; no code changes required - π Dialect Detection & Voice Switching - AWS Transcribe identifies language mid-conversation; the system generates a natural transition phrase and switches voice automatically
- π» Phantom Action Watcher - Detects when the LLM verbally commits to an action (balance check, dispute filing, transaction lookup) but skips the tool call, then reprompts silently
- π Workflow Graphs - JSON-defined conversation state machines injected into system prompts and visualised in the frontend WorkflowDesigner
- π AWS Connect Integration - Lambda functions bridge Amazon Connect telephony calls into the same Nova Sonic pipeline via Kinesis Video Streams
- π§ͺ Simulated E2E Testing - Claude Haiku auto-generates user-role messages for scripted conversation tests; outcomes persisted as JSON
π§ ArchitectureΒΆ
graph TB
Phone[π AWS Connect Call] -->|KVS stream| Lambda[Lambda: process-turn.js]
Browser[π Browser] -->|PCM16 binary frames| WS[WebSocket Server\nserver.ts :8080]
Lambda --> WS
WS --> Transcribe[AWS Transcribe Streaming\ntranscribe-client.ts]
Transcribe -->|transcript| ModeRouter{Brain Mode}
ModeRouter -->|Nova Sonic Direct| Sonic[sonic-client.ts\nBidirectional Nova Sonic stream]
ModeRouter -->|Bedrock Agent| Agent[bedrock-agent-client.ts\nAgentCore orchestration]
Sonic --> ToolMgr[tool-manager.ts\nDynamic tool dispatch]
Agent --> ToolMgr
ToolMgr -->|has gatewayTarget| Gateway[agentcore-gateway-client.ts\nSigV4 MCP requests]
ToolMgr -->|local| Local[Local tool execution\nserver.ts]
Gateway --> AgentCore[AWS AgentCore Gateway]
Sonic -->|binary audio| WS
WS --> Browser
subgraph "Supporting Services"
Dialect[dialect-detector.ts\nLocale β voice ID]
Phantom[phantom-action-watcher.ts\nDetects broken promises]
Prompt[prompt-service.ts\nLangfuse versioned prompts]
Sim[simulation-service.ts\nClaude Haiku test user]
end
Sonic --> Dialect
Sonic --> Phantom
Sonic --> Prompt
subgraph "Frontend (Next.js)"
AppCtx[AppContext.tsx\nGlobal state]
WsHook[useWebSocket]
AudioHook[useAudioProcessor]
WorkflowUI[WorkflowDesigner\nVisualises workflow graphs]
end
WS -->|JSON control messages| AppCtx
π― What Makes This SpecialΒΆ
Phantom Action WatcherΒΆ
LLMs frequently say "I'll check your balance now" and then return a text response instead of calling the tool. phantom-action-watcher.ts monitors every turn for high-confidence verbal commitments (balance check, transaction lookup, dispute filing) and, if no corresponding tool call appears in the same response, issues a silent reprompt. Users never see the failure; the model self-corrects.
Dual Brain, One InterfaceΒΆ
The same frontend, same WebSocket protocol, and same tool definitions work across both brain modes. Nova Sonic Direct runs tools natively within the model stream for minimal latency. Bedrock Agent mode hands off to AWS AgentCore for complex multi-step reasoning. Switching modes is a runtime setting β no redeployment.
Tool System as ConfigurationΒΆ
Every tool is a JSON file: name, description, input_schema, instruction, category, optional gatewayTarget. Setting gatewayTarget routes the call through AgentCoreGatewayClient using AWS SigV4-signed MCP requests. Tools without gatewayTarget run locally. Adding a capability is dropping a file β no TypeScript changes.
Dialect-Aware ConversationsΒΆ
AWS Transcribe Streaming returns language identification confidence scores alongside transcripts. dialect-detector.ts maps locale codes (en-GB, fr-FR, etc.) to voice IDs. When confidence crosses a threshold, transition-handler.ts calls Bedrock to generate a natural "let me switch to your language" phrase, then swaps the active voice mid-session.
π Technical HighlightsΒΆ
Backend (Node.js + TypeScript)ΒΆ
- WebSocket server (
server.ts): single entry point handling session lifecycle, binary audio frames, and JSON control messages on one connection - Nova Sonic (
sonic-client.ts): bidirectional stream; system prompt assembled at startup from modular.txtfiles inbackend/prompts/ - Prompt composition:
core-system_default.txt+core-guardrails.txt+core-tool_access_assistant.txt+ persona file + silently injected dialect detection prompt - Langfuse integration:
prompt-service.tsfetches production-labelled prompt versions at runtime β update prompts without redeploying
Frontend (Next.js 15)ΒΆ
- Static export served directly from the backend in production (
output: 'export') AppContext.tsx: central state for session, transcripts, tool events, token usage, and workflow step trackingWorkflowDesigner: visualises embedded JSON workflow graphs (workflow-banking.json,workflow-disputes.json, etc.)- Credentials can be set at runtime via the Settings panel (stored in
sessionStorage)
AWS InfrastructureΒΆ
- Amazon Connect:
start-session.jsLambda creates a DynamoDB record on call arrival;process-turn.jsLambda processes each KVS turn - CloudFormation:
aws/cloudformation.yaml+aws/banking-data-layer.yamldefine supporting infrastructure - Simulated testing:
simulation-service.tsuses Claude Haiku on Bedrock to generate realistic user turns; results persisted totests/test_logs/
π Key MetricsΒΆ
- Latency: ~200β500 ms (Nova Sonic Direct), ~1β3 s (Bedrock Agent)
- Tool loading: dynamic at startup β zero-downtime capability additions
- Test coverage: scripted E2E conversations with pass/fail/unknown outcomes logged as JSON
- Deployment: single
npm startserves both frontend and backend from port 8080
This project demonstrates production-level voice AI engineering: real-time audio pipelines, multi-modal AWS service orchestration, self-healing LLM behaviour, and a configuration-driven architecture that scales without code changes.