Build an AI Voice Agent with ElevenLabs, Whisper & Twilio

What an AI Voice Agent Can Do

A well-built AI voice agent sounds human, thinks fast, and works around the clock. It can answer your phone, understand what the caller needs, ask follow-up questions, look up information in real time, and take action — booking a meeting, updating a CRM record, sending a confirmation email — before ending the call.

For businesses, this replaces or augments a human call centre agent at a cost of $0.05–$0.15 per minute (versus $1–$3 per minute for human agents), with zero sick days, zero training time, and instant scalability from 1 call to 1,000 simultaneous calls.

The stack: Twilio for phone number and call handling, OpenAI Whisper for speech-to-text transcription, GPT-4o for conversation logic and decision making, and ElevenLabs for text-to-speech with ultra-realistic voices. The application layer runs on FastAPI (Python) with a WebSocket server handling real-time audio streaming.

Architecture Overview

The real-time voice pipeline works in a loop measured in milliseconds:

Caller speaks → Twilio streams audio bytes via WebSocket to your server
Whisper transcribes audio chunks in near real-time (100–200ms latency)
Voice Activity Detection (VAD) detects end of utterance (silence threshold)
Complete transcript sent to GPT-4o with conversation history and tool definitions
GPT-4o responds (with tool calls if needed — CRM lookup, calendar check, etc.)
ElevenLabs converts response text to audio (200–400ms)
Audio streamed back to Twilio, played to caller
Repeat from step 1

Total round-trip latency: 600ms–1.2s. At the lower end, this feels like a natural conversational pause. At the higher end, it's noticeable but acceptable for business calls.

Setting Up Twilio Media Streams

Twilio's Media Streams API streams raw audio from a call to your WebSocket server in real time. Configure a TwiML response when an inbound call arrives:

from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import Response
import xmltodict

app = FastAPI()

@app.post("/incoming-call")
async def incoming_call(request: Request):
    """Twilio calls this webhook when a call arrives."""
    twiml = """
    <Response>
        <Connect>
            <Stream url="wss://yourserver.com/media-stream">
                <Parameter name="systemPrompt"
                           value="You are an AI receptionist for Acme Corp. Qualify the caller, ask for name, company, and reason for calling. If they want to book a demo, collect their preferred date/time."/>
            </Stream>
        </Connect>
    </Response>
    """
    return Response(content=twiml, media_type="application/xml")

The WebSocket Handler: Receiving and Sending Audio

import asyncio, json, base64
from fastapi import WebSocket
from openai import AsyncOpenAI
import websockets

openai_client = AsyncOpenAI()

@app.websocket("/media-stream")
async def media_stream(websocket: WebSocket):
    await websocket.accept()

    # Buffer for accumulating audio chunks
    audio_buffer = bytearray()
    conversation_history = []
    stream_sid = None

    async for message in websocket.iter_text():
        data = json.loads(message)
        event = data.get("event")

        if event == "start":
            stream_sid = data["start"]["streamSid"]
            system_prompt = data["start"]["customParameters"]["systemPrompt"]
            conversation_history = [{"role": "system", "content": system_prompt}]

        elif event == "media":
            # Twilio sends audio as base64-encoded mulaw 8kHz
            audio_chunk = base64.b64decode(data["media"]["payload"])
            audio_buffer.extend(audio_chunk)

            # Detect end of utterance (simple silence detection)
            if is_end_of_utterance(audio_buffer):
                user_text = await transcribe(bytes(audio_buffer))
                audio_buffer.clear()

                if user_text.strip():
                    conversation_history.append({"role": "user", "content": user_text})
                    response_text, tool_calls = await get_ai_response(conversation_history)

                    if tool_calls:
                        tool_results = await execute_tools(tool_calls)
                        conversation_history.extend(tool_results)
                        response_text, _ = await get_ai_response(conversation_history)

                    conversation_history.append({"role": "assistant", "content": response_text})

                    audio_response = await synthesize_speech(response_text)
                    await send_audio_to_caller(websocket, stream_sid, audio_response)

        elif event == "stop":
            await save_call_transcript(conversation_history)
            break

Speech-to-Text with Whisper

import io
import audioop
from openai import AsyncOpenAI

async def transcribe(audio_bytes: bytes) -> str:
    """Transcribe mulaw audio to text using Whisper."""
    # Convert mulaw 8kHz to PCM 16kHz (Whisper requirement)
    pcm_audio = audioop.ulaw2lin(audio_bytes, 2)
    pcm_audio = audioop.ratecv(pcm_audio, 2, 1, 8000, 16000, None)[0]

    # Create a wav file in memory
    wav_buffer = io.BytesIO()
    # ... write WAV header + PCM data ...

    response = await openai_client.audio.transcriptions.create(
        model="whisper-1",
        file=("audio.wav", wav_buffer, "audio/wav"),
        language="en"
    )
    return response.text

For lower latency than Whisper, Deepgram offers real-time streaming transcription with 50–100ms latency and supports streaming the transcript word-by-word as the caller speaks — enabling faster interruption detection and response generation.

Text-to-Speech with ElevenLabs

from elevenlabs.client import AsyncElevenLabs

elevenlabs = AsyncElevenLabs(api_key=ELEVENLABS_API_KEY)

async def synthesize_speech(text: str) -> bytes:
    """Convert text to ultra-realistic speech audio."""
    audio_stream = await elevenlabs.generate(
        text=text,
        voice="Rachel",           # Professional, warm female voice
        model="eleven_turbo_v2",  # Fastest model, ~200ms latency
        voice_settings={
            "stability": 0.75,
            "similarity_boost": 0.75,
            "style": 0.0,
            "use_speaker_boost": True
        },
        output_format="ulaw_8000"  # Twilio-compatible format directly
    )
    return b"".join([chunk async for chunk in audio_stream])

eleven_turbo_v2 is ElevenLabs' lowest-latency model — 200–300ms for short sentences. For the best voice quality (indistinguishable from human), eleven_multilingual_v2 adds ~100ms but produces noticeably more natural prosody and emotion.

Tool Integration: CRM, Calendar, Real-Time Data

tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_contact",
            "description": "Look up a contact in the CRM by phone number. Call this at the start of every conversation.",
            "parameters": {
                "type": "object",
                "properties": {
                    "phone": {"type": "string", "description": "Caller phone number in E.164 format"}
                },
                "required": ["phone"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "book_appointment",
            "description": "Book a demo or consultation appointment in the calendar.",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"},
                    "date": {"type": "string", "description": "ISO 8601 datetime"},
                    "purpose": {"type": "string", "enum": ["demo", "consultation", "support"]}
                },
                "required": ["name", "email", "date", "purpose"]
            }
        }
    }
]

When GPT-4o decides to book an appointment, it outputs a structured tool call. Your executor hits the Google Calendar API (or Calendly), creates the event, emails a confirmation, and updates the CRM — all before the caller hangs up.

Reducing Latency: The Critical UX Factor

Latency is the primary UX challenge in voice AI. Human conversations have 200–500ms natural pauses. At 1.2s+ response time, the agent starts feeling robotic. Tactics to reduce latency:

Stream ElevenLabs audio: Start sending audio to Twilio as the first chunk arrives, not after the full response is generated. Reduces perceived latency by 200–400ms.
Use Deepgram instead of Whisper: 50ms streaming transcription vs 300ms batch Whisper.
Use GPT-4o mini for simple turns: Route filler responses ("Let me check that for you") to a fast, cheap model. Route complex reasoning to GPT-4o.
Optimistic response generation: Start generating the TTS response as the LLM streams tokens, before the full response is complete.
Regional deployment: Deploy your WebSocket server in the same AWS region as your target callers. Twilio → your server latency drops from 80ms to 15ms.

With all optimisations, a production voice agent can achieve 700–900ms end-to-end latency — feeling natural in conversation.

Build an AI Voice Agent with ElevenLabs, Whisper and Twilio (Complete Guide)

What an AI Voice Agent Can Do

Architecture Overview

Setting Up Twilio Media Streams

The WebSocket Handler: Receiving and Sending Audio

Speech-to-Text with Whisper

Text-to-Speech with ElevenLabs

Tool Integration: CRM, Calendar, Real-Time Data

Reducing Latency: The Critical UX Factor

Related Articles

Run LLMs Locally with Ollama: Zero API Costs, Full Privacy, Production-Ready

LangGraph: Build Stateful AI Agents That Actually Work in Production

AI Memory with Mem0: Give Your Chatbot Long-Term Memory Across Sessions