Build an AI Voice Agent with ElevenLabs, Whisper and Twilio (Complete Guide)
AI voice agents handle inbound and outbound calls, qualify leads, book appointments, and push data to your CRM — 24/7, at a fraction of human agent costs. This guide builds a complete voice agent from scratch.
What an AI Voice Agent Can Do
A well-built AI voice agent sounds human, thinks fast, and works around the clock. It can answer your phone, understand what the caller needs, ask follow-up questions, look up information in real time, and take action — booking a meeting, updating a CRM record, sending a confirmation email — before ending the call.
For businesses, this replaces or augments a human call centre agent at a cost of $0.05–$0.15 per minute (versus $1–$3 per minute for human agents), with zero sick days, zero training time, and instant scalability from 1 call to 1,000 simultaneous calls.
The stack: Twilio for phone number and call handling, OpenAI Whisper for speech-to-text transcription, GPT-4o for conversation logic and decision making, and ElevenLabs for text-to-speech with ultra-realistic voices. The application layer runs on FastAPI (Python) with a WebSocket server handling real-time audio streaming.
Architecture Overview
The real-time voice pipeline works in a loop measured in milliseconds:
- Caller speaks → Twilio streams audio bytes via WebSocket to your server
- Whisper transcribes audio chunks in near real-time (100–200ms latency)
- Voice Activity Detection (VAD) detects end of utterance (silence threshold)
- Complete transcript sent to GPT-4o with conversation history and tool definitions
- GPT-4o responds (with tool calls if needed — CRM lookup, calendar check, etc.)
- ElevenLabs converts response text to audio (200–400ms)
- Audio streamed back to Twilio, played to caller
- Repeat from step 1
Total round-trip latency: 600ms–1.2s. At the lower end, this feels like a natural conversational pause. At the higher end, it's noticeable but acceptable for business calls.
Setting Up Twilio Media Streams
Twilio's Media Streams API streams raw audio from a call to your WebSocket server in real time. Configure a TwiML response when an inbound call arrives:
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
import xmltodict
app = FastAPI()
@app.post("/incoming-call")
async def incoming_call(request: Request):
"""Twilio calls this webhook when a call arrives."""
twiml = """
<Response>
<Connect>
<Stream url="wss://yourserver.com/media-stream">
<Parameter name="systemPrompt"
value="You are an AI receptionist for Acme Corp. Qualify the caller, ask for name, company, and reason for calling. If they want to book a demo, collect their preferred date/time."/>
</Stream>
</Connect>
</Response>
"""
return Response(content=twiml, media_type="application/xml")
The WebSocket Handler: Receiving and Sending Audio
import asyncio, json, base64
from fastapi import WebSocket
from openai import AsyncOpenAI
import websockets
openai_client = AsyncOpenAI()
@app.websocket("/media-stream")
async def media_stream(websocket: WebSocket):
await websocket.accept()
# Buffer for accumulating audio chunks
audio_buffer = bytearray()
conversation_history = []
stream_sid = None
async for message in websocket.iter_text():
data = json.loads(message)
event = data.get("event")
if event == "start":
stream_sid = data["start"]["streamSid"]
system_prompt = data["start"]["customParameters"]["systemPrompt"]
conversation_history = [{"role": "system", "content": system_prompt}]
elif event == "media":
# Twilio sends audio as base64-encoded mulaw 8kHz
audio_chunk = base64.b64decode(data["media"]["payload"])
audio_buffer.extend(audio_chunk)
# Detect end of utterance (simple silence detection)
if is_end_of_utterance(audio_buffer):
user_text = await transcribe(bytes(audio_buffer))
audio_buffer.clear()
if user_text.strip():
conversation_history.append({"role": "user", "content": user_text})
response_text, tool_calls = await get_ai_response(conversation_history)
if tool_calls:
tool_results = await execute_tools(tool_calls)
conversation_history.extend(tool_results)
response_text, _ = await get_ai_response(conversation_history)
conversation_history.append({"role": "assistant", "content": response_text})
audio_response = await synthesize_speech(response_text)
await send_audio_to_caller(websocket, stream_sid, audio_response)
elif event == "stop":
await save_call_transcript(conversation_history)
break
Speech-to-Text with Whisper
import io
import audioop
from openai import AsyncOpenAI
async def transcribe(audio_bytes: bytes) -> str:
"""Transcribe mulaw audio to text using Whisper."""
# Convert mulaw 8kHz to PCM 16kHz (Whisper requirement)
pcm_audio = audioop.ulaw2lin(audio_bytes, 2)
pcm_audio = audioop.ratecv(pcm_audio, 2, 1, 8000, 16000, None)[0]
# Create a wav file in memory
wav_buffer = io.BytesIO()
# ... write WAV header + PCM data ...
response = await openai_client.audio.transcriptions.create(
model="whisper-1",
file=("audio.wav", wav_buffer, "audio/wav"),
language="en"
)
return response.text
For lower latency than Whisper, Deepgram offers real-time streaming transcription with 50–100ms latency and supports streaming the transcript word-by-word as the caller speaks — enabling faster interruption detection and response generation.
Text-to-Speech with ElevenLabs
from elevenlabs.client import AsyncElevenLabs
elevenlabs = AsyncElevenLabs(api_key=ELEVENLABS_API_KEY)
async def synthesize_speech(text: str) -> bytes:
"""Convert text to ultra-realistic speech audio."""
audio_stream = await elevenlabs.generate(
text=text,
voice="Rachel", # Professional, warm female voice
model="eleven_turbo_v2", # Fastest model, ~200ms latency
voice_settings={
"stability": 0.75,
"similarity_boost": 0.75,
"style": 0.0,
"use_speaker_boost": True
},
output_format="ulaw_8000" # Twilio-compatible format directly
)
return b"".join([chunk async for chunk in audio_stream])
eleven_turbo_v2 is ElevenLabs' lowest-latency model — 200–300ms for short sentences. For the best voice quality (indistinguishable from human), eleven_multilingual_v2 adds ~100ms but produces noticeably more natural prosody and emotion.
Tool Integration: CRM, Calendar, Real-Time Data
tools = [
{
"type": "function",
"function": {
"name": "lookup_contact",
"description": "Look up a contact in the CRM by phone number. Call this at the start of every conversation.",
"parameters": {
"type": "object",
"properties": {
"phone": {"type": "string", "description": "Caller phone number in E.164 format"}
},
"required": ["phone"]
}
}
},
{
"type": "function",
"function": {
"name": "book_appointment",
"description": "Book a demo or consultation appointment in the calendar.",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
"date": {"type": "string", "description": "ISO 8601 datetime"},
"purpose": {"type": "string", "enum": ["demo", "consultation", "support"]}
},
"required": ["name", "email", "date", "purpose"]
}
}
}
]
When GPT-4o decides to book an appointment, it outputs a structured tool call. Your executor hits the Google Calendar API (or Calendly), creates the event, emails a confirmation, and updates the CRM — all before the caller hangs up.
Reducing Latency: The Critical UX Factor
Latency is the primary UX challenge in voice AI. Human conversations have 200–500ms natural pauses. At 1.2s+ response time, the agent starts feeling robotic. Tactics to reduce latency:
- Stream ElevenLabs audio: Start sending audio to Twilio as the first chunk arrives, not after the full response is generated. Reduces perceived latency by 200–400ms.
- Use Deepgram instead of Whisper: 50ms streaming transcription vs 300ms batch Whisper.
- Use GPT-4o mini for simple turns: Route filler responses ("Let me check that for you") to a fast, cheap model. Route complex reasoning to GPT-4o.
- Optimistic response generation: Start generating the TTS response as the LLM streams tokens, before the full response is complete.
- Regional deployment: Deploy your WebSocket server in the same AWS region as your target callers. Twilio → your server latency drops from 80ms to 15ms.
With all optimisations, a production voice agent can achieve 700–900ms end-to-end latency — feeling natural in conversation.
Senior Full Stack Developer — Laravel, Vue.js, Nuxt.js & AI. Available for freelance projects.
Hire Me for Your Project