
What is Voice agent?
A voice agent is a system that enables real-time, human-like conversations through speech, powered by the reasoning capabilities of modern foundation models. It does not just transcribe or respond. It listens, understands intent, reasons, and speaks back naturally.
As users move away from screens, voice is emerging as the most intuitive interface. It removes friction, lowers cognitive load, and fits seamlessly into how humans already communicate.
This shift is also where capital is flowing. Investors are betting heavily on:
Hyper-personalized experiences across entertainment, education, and healthcare
Scalable developer platforms that make voice agents easy to build and deploy.
Enterprise workflows and customer support, where voice can automate high-volume interactions
Voice-based security, including authentication and fraud detection
A few startups signal where this is headed:
Liberate Voice AI, modernizing insurance call centers with AI-driven conversations.
Suki AI, reducing clinician burnout by automating medical documentation.
Voiceitt, enabling accessibility by recognizing non-standard speech in healthcare settings.
Voice agents are not just another interface layer. They represent a shift toward software that finally speaks the user’s language.
Anatomy of Voice Agents
Most voice agents today follow one of two architectural patterns.
1. Speech-to-Speech Real-Time APIs
These are end-to-end models that process raw audio input and directly produce audio output. Examples include OpenAI Realtime API and Cartesia.
They offer simpler integration, lower setup complexity, and latency optimized out of the box. The trade-off is limited control. You cannot easily intervene in intermediate steps, deeply customize prompting, insert tools or business logic, or fine-tune behavior at each stage.
This approach works well for fast prototyping and straightforward conversational use cases.
2. Modular Pipeline Architecture
This design separates the system into explicit stages:
Speech-to-Text → LLM → Text-to-Speech.
Each component can be independently selected, tuned, and replaced. You can swap providers, introduce function calling and tools, inject domain-specific logic, and optimize latency or quality at a per-stage level.
This is the preferred architecture for production-grade agents where control, observability, and customization matter.
Component importance varies by use case: ASR critical for medical vocabulary; LLM for complex reasoning.
Audio Signals Basics
Automatic Speech Recognition operates on one-dimensional, continuous audio signals that are first digitized. A common sampling rate is 16 kHz, transforming a continuous signal x(t) into a discrete sequence x[n].
Humans do not perceive sound as raw waveforms. We perceive how frequency content evolves over time. To capture this, ASR systems transform audio from the time domain into the frequency domain using the Short-Time Fourier Transform. The result is a two-dimensional spectrogram with time on one axis and frequency on the other. Structurally, it resembles an image.
This representation explains why convolutional layers combined with transformers work so well for speech. Convolutions capture local time-frequency patterns, while transformers model long-range temporal dependencies.
Modern ASR systems are dominated by encoder-heavy speech transformers that predict time-aligned tokens directly from spectrogram features. Older systems relied on Hidden Markov Models paired with Gaussian Mixture Models, with explicit alignment and handcrafted features.
The shift from HMM-GMM pipelines to end-to-end neural encoders is what unlocked today’s accuracy and robustness in real-world audio.
Press enter or click to view image in full size


Key Challenges: Latency
Voice agents live or die by latency. If a response takes longer than 200 to 300 milliseconds, the interaction starts to feel artificial. Humans are extremely sensitive to conversational timing, and even small delays break the illusion of intelligence.
Production systems achieve sub-300 millisecond responsiveness through aggressive optimization. Common strategies include running components closer to the edge, using faster inference-optimized language models, and eliminating unnecessary hops in the pipeline. Companies like AssemblyAI and ElevenLabs demonstrate what is possible when latency is treated as a first-class constraint.
A major source of delay comes from turn detection. Voice Activity Detection and End-of-Utterance detection introduce latency through multiple stages. Audio buffering typically adds 20 to 50 milliseconds. Signal processing adds another 10 to 30 milliseconds. The final decision layer, which determines whether the user has finished speaking, can add 50 to 100 milliseconds.
These systems face hard trade-offs. They must distinguish between natural pauses and true sentence endings, handle background noise and diverse accents, and balance speed against transcription accuracy.
Effective solutions focus on speed and adaptability. Lightweight models such as Silero VAD can run in 1 to 5 milliseconds. Adaptive timeouts, typically in the 300 to 800 millisecond range and adjusted based on conversational context, reduce false cutoffs. Custom training on noisy, real-world audio further improves robustness. When done well, these techniques enable end-to-end responses in under one second, as seen in production systems like Sierra AI.
Real Time Communication (RTC) Protocols
Real-time voice agents depend on communication protocols that minimize latency, handle streaming data efficiently, and remain reliable under variable network conditions. Several approaches exist, but not all are equally suited for conversational audio.
WebSockets
WebSockets provide a persistent, full-duplex connection between client and server. They are simple to integrate and widely supported, making them a common choice for streaming audio chunks and partial responses. However, they lack built-in media handling, congestion control, and peer optimization.
Server-Sent Events (SSE)
SSE allows servers to push continuous, one-way updates to clients over a single long-lived HTTP connection. It works well for streaming text tokens or status updates, but it is fundamentally unsuitable for real-time audio because it does not support bidirectional communication.
Long Polling
Long polling simulates real-time updates by holding HTTP requests open until data is available. It is inefficient, introduces unnecessary latency, and scales poorly. For voice agents, it is largely obsolete.
MQTT
MQTT is a lightweight publish-subscribe protocol designed for IoT and unreliable networks. While it excels at low-bandwidth messaging, it is not optimized for high-frequency, low-latency audio streaming required by conversational voice systems.
WebRTC
WebRTC is the standard for real-time audio, video, and data streaming in browsers and mobile applications. It enables direct peer-to-peer communication, includes built-in echo cancellation, jitter buffering, and congestion control, and is optimized for sub-second latency.
For production-grade voice agents, WebRTC is the dominant choice. Other protocols can support control signals or token streaming, but real-time speech lives and performs best on WebRTC.
Press enter or click to view image in full size

How you can implement real-time voice agents
End-to-end low-latency voice pipelines are notoriously hard to build from scratch. You are dealing with WebRTC media servers, NAT traversal, adaptive jitter buffers, interruption handling, and horizontal scaling, all at once. Rebuilding this stack in-house is rarely worth it.
The pragmatic choice is to use a battle-tested framework like LiveKit, which is open source, production-grade, and designed specifically for real-time media systems.
Below is what a modern, high-performance voice agent stack looks like in production.
Modern Production Stack
Voice Activity Detection (VAD)
Use fast, lightweight models such as Silero VAD or LiveKit’s built-in WebRTC VAD. These typically run in 1 to 5 milliseconds and allow the system to react almost instantly when speech begins.
Turn Detection and Interruption Handling
LiveKit’s Agents framework manages turn-taking automatically. It handles conversational signals like short acknowledgements, supports barge-in interruptions, and prevents the agent from talking over the user. This layer is critical for natural conversations and is often underestimated.
Speech-to-Text (STT)
Streaming transcription is triggered only when voice activity is detected. Common providers include Deepgram, AssemblyAI, Whisper.cpp, and the OpenAI Realtime API. This avoids wasted compute and reduces overall latency.
LLM Layer
This is usually the single biggest contributor to latency. Streaming-capable models are essential. Common choices include OpenAI Realtime models, Claude 3.5, Groq-hosted models, and Together.ai.
The key metric here is Time to First Token. For example, GPT-4o-mini returns the first token in roughly 350 milliseconds, while GPT-4o takes closer to 700 milliseconds on the same prompt. That difference is immediately noticeable in voice interactions.
Text-to-Speech (TTS)
Audio should be streamed as soon as tokens are available, without waiting for the full response. Providers like ElevenLabs Turbo, OpenAI streaming TTS, and Cartesia support chunked synthesis. The key metric is Time to First Byte, with top providers now delivering audio in under 150 milliseconds.
Example Demo
To make this concrete, I built a voice agent using the stack described above, integrated into a React Native Expo app. The agent handles common clinic FAQs and appointment-related interactions in real time.
You can watch the demo here:
https://vimeo.com/1149576721?share=copy&fl=sv&fe=ci




