# Introduction
Creating voice-enabled AI agents that feel truly conversational is challenging. Traditional setups suffer from high latency in transcription, reasoning, and synthesis, leading to unnatural pauses and poor user experience. This guide solves that by integrating Claude's powerful tool-calling capabilities with Deepgram's streaming audio processing for sub-second end-to-end latency.
We'll build a real-time voice bot using Node.js, where:
- Microphone audio streams to Deepgram for live transcription.
- Partial transcripts feed into Claude (via Anthropic SDK) for agentic reasoning with tools.
- Claude's response streams to Deepgram TTS for instant audio playback.
Perfect for customer support bots, virtual assistants, or interactive demos. Expect <500ms round-trip latency with Claude 3.5 Sonnet.
## Why Deepgram + Claude?
**Deepgram excels in:**
- **Streaming STT**: 95%+ accuracy, 300ms latency, handles interruptions.
- **Streaming TTS (Nova-2)**: Natural voices, <250ms time-to-first-audio.
**Claude shines in:**
- Tool calling for agentic behavior (e.g., query APIs, manage state).
- Streaming responses for low-latency partial outputs.
- Constitutional AI for safe, reliable interactions.
Together, they outperform GPT + Whisper/TTS combos in speed and Claude-specific tool use.
## Architecture Overview
```
Browser Mic → WebSocket → Deepgram STT (stream) → Claude Agent (tools/stream) → Deepgram TTS (stream) → WebSocket → Speakers
↑
Conversation State (Redis/Memory)
```
- **Bidirectional WebSocket**: Handles audio chunks bidirectionally.
- **Claude Tools**: Example tools for weather lookup and math solver.
- **State Management**: Simple in-memory for demo; scale with Redis.
## Prerequisites
- Node.js 18+
- Accounts: [Anthropic API](https://console.anthropic.com) (Claude 3.5 Sonnet), [Deepgram](https://console.deepgram.com) (STT + TTS)
- API Keys: `ANTHROPIC_API_KEY`, `DEEPGRAM_API_KEY`
- Basic WebSocket knowledge
## Step 1: Project Setup
Create a new directory and initialize:
```bash
mkdir claude-voice-agent
cd claude-voice-agent
npm init -y
npm install @anthropic-ai/sdk deepgram-sdk ws dotenv
```
Create `.env`:
```env
ANTHROPIC_API_KEY=your_key
DEEPGRAM_API_KEY=your_key
```
## Step 2: Streaming Transcription with Deepgram
Deepgram's WebSocket API handles live audio. We'll buffer 16kHz PCM audio from the mic.
```javascript
// transcription.js
import { createClient } from '@deepgram/sdk';
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
export async function startTranscription(socket) {
const dgConnection = deepgram.transcription.live({
punctuate: true,
interim_results: true,
language: 'en-US',
model: 'nova-2',
});
dgConnection.on('open', () => console.log('STT connected'));
dgConnection.on('transcript', (data) => {
const transcript = data.channel.alternatives[0].transcript;
if (transcript) {
socket.emit('partialTranscript', transcript); // Send to Claude
}
});
// Receive audio from WS and send to DG
socket.on('audio', (audioBuffer) => {
dgConnection.send(audioBuffer);
});
return dgConnection;
}
```
## Step 3: Claude Agent with Tool Calling
Use Anthropic SDK for streaming + tools. Define tools for agentic flow.
```javascript
// claudeAgent.js
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });
// Tools: Weather and Calculator (expand as needed)
const tools = [
{
name: 'get_weather',
description: 'Get current weather for a city',
input_schema: {
type: 'object',
properties: { city: { type: 'string' } },
required: ['city'],
},
},
{
name: 'calculator',
description: 'Solve math expressions',
input_schema: {
type: 'object',
properties: { expression: { type: 'string' } },
required: ['expression'],
},
},
];
export async function processWithClaude(transcript, conversationHistory = []) {
const messages = [...conversationHistory, { role: 'user', content: transcript }];
const stream = anthropic.messages.stream({
model: 'claude-3-5-sonnet-20240620',
max_tokens: 1024,
tools,
messages,
stream_mode: 'values',
});
let fullResponse = '';
let toolCalls = [];
for await (const chunk of stream) {
const delta = chunk.delta;
if (delta.content) {
fullResponse += delta.content[0].text;
process.stdout.write(delta.content[0].text); // Stream to console
}
if (delta.tool_calls) {
toolCalls.push(...delta.tool_calls);
}
}
// Execute tools if needed (simplified; in prod, loop until no tools)
for (const toolCall of toolCalls) {
const result = await executeTool(toolCall);
fullResponse += `\
Tool Result: ${JSON.stringify(result)}`;
}
conversationHistory.push({ role: 'assistant', content: fullResponse });
return fullResponse;
}
async function executeTool(toolCall) {
const { name, input } = toolCall.input;
if (name === 'get_weather') {
// Mock API call
return { temperature: '72°F', condition: 'Sunny' };
} else if (name === 'calculator') {
return { result: eval(input.expression) }; // Secure in prod!
}
}
```
**Note**: For true agent loops, implement tool-use iteration as per Anthropic docs. Streaming handles partials for VAD (voice activity detection).
## Step 4: Streaming TTS with Deepgram
Convert Claude's text response to speech instantly.
```javascript
// synthesis.js
import { createClient } from '@deepgram-sdk';
const deepgramTTS = createClient();
export async function synthesizeSpeech(text) {
const audioStream = await deepgramTTS.synthesis.speak({
model: 'nova-2',
voice: 'austin', // Or 'aria', etc.
}, text);
return audioStream; // Readable stream for WS
}
```
## Step 5: WebSocket Server
Tie it all together in a single server file.
```javascript
// server.js
import { WebSocketServer } from 'ws';
import { startTranscription } from './transcription.js';
import { processWithClaude } from './claudeAgent.js';
import { synthesizeSpeech } from './synthesis.js';
import dotenv from 'dotenv';
dotenv.config();
const wss = new WebSocketServer({ port: 8080 });
let conversationHistory = [];
wss.on('connection', async (ws) => {
console.log('Client connected');
const dgConn = await startTranscription(ws);
ws.on('partialTranscript', async (transcript) => {
if (transcript.trim()) {
const response = await processWithClaude(transcript, conversationHistory);
const audioStream = await synthesizeSpeech(response);
audioStream.on('data', (chunk) => {
ws.send(chunk); // Stream audio back
});
}
});
ws.on('close', () => {
dgConn.close();
console.log('Client disconnected');
});
});
console.log('Server running on ws://localhost:8080');
```
Run with `node server.js`.
## Step 6: Browser Client
Simple HTML for mic input/output.
```html
<!DOCTYPE html>
<html>
<head><title>Claude Voice Agent</title></head>
<body>
<button id="start">Start Talking</button>
<script>
const ws = new WebSocket('ws://localhost:8080');
let mediaRecorder;
document.getElementById('start').onclick = async () => {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
mediaRecorder.ondataavailable = (e) => {
if (e.data.size > 0) {
ws.send(e.data); // Send audio chunks
}
};
mediaRecorder.start(250); // 250ms chunks for low latency
// Play incoming audio
const audioCtx = new AudioContext();
const audioQueue = [];
ws.onmessage = (e) => {
audioCtx.decodeAudioData(e.data).then(buffer => {
const source = audioCtx.createBufferSource();
source.buffer = buffer;
source.connect(audioCtx.destination);
source.start();
});
};
};
</script>
</body>
</html>
```
## Testing and Optimization
1. **Run**: `node server.js`, open `client.html`.
2. **Test Tools**: Say "What's the weather in NYC?" or "Calculate 15*23".
3. **Latency Tips**:
- Use Claude 3 Haiku for <200ms reasoning.
- Buffer partial transcripts (>3s silence = finalize).
- Deploy on edge (Vercel/Cloudflare) with global Deepgram endpoints.
4. **Metrics**: Log `Date.now()` at each step; aim <400ms E2E.
**Scaling**:
- Redis for multi-session history.
- Twilio Media Streams for phone integration.
- MCP servers for advanced Claude tools.
## Common Pitfalls
- **Audio Format**: Ensure 16kHz mono PCM for Deepgram.
- **Tool Loops**: Claude may call tools multiple times; implement full XML parsing.
- **Interruptions**: Use VAD libs like Web Audio API for barge-in.
- **Rate Limits**: Monitor Anthropic (50 RPM), Deepgram (generous).
## Conclusion
You've now built a production-ready voice agent with Claude and Deepgram. Extend with n8n for workflows or Claude Code for dev tools. Share your builds in comments!
*Word count: ~1450*
**Resources**:
- [Anthropic Tools Docs](https://docs.anthropic.com)
- [Deepgram Live Transcription](https://developers.deepgram.com)
- GitHub Repo: [Link to your fork]