The short version.
LiveKit is the realtime audio & video layer most modern voice-AI products sit on top of. Instead of stitching together WebRTC, STUN/TURN, codecs, jitter buffers and reconnection logic yourself, you point a client SDK at a LiveKit server (cloud or self-hosted) and you get a low-latency audio pipe between a user's microphone and your backend — with SDKs for web, iOS, Android, React Native, Flutter, Unity and the server side in Node, Python, Go, Rust and Ruby.
It is the same stack OpenAI uses for ChatGPT's voice mode. That single fact tells you most of what you need to know about its production posture.
Four nouns to remember.
Room
A session. Everyone who joins the same room can publish and subscribe to each other's tracks.
Participant
A user, a bot, or an AI agent inside a room. Each has an identity and a JWT-issued token.
Track
An audio or video stream a participant publishes. Other participants subscribe to consume it.
Agent
A server-side participant that listens, thinks (LLM + STT + TTS), and speaks back — built with the Agents SDK.
Authentication is JWT-based: your backend mints a short-lived access token with the participant's identity and the room they're allowed to join, the client connects over wss://, and LiveKit takes care of the rest.
Voice AI, without the duct tape.
LiveKit Agents is the part most operators actually want. It is a server-side framework (Python and Node) that wires together the four pieces a voice assistant needs:
- Voice Activity Detection (VAD) — knows when the user starts and stops speaking.
- Speech-to-Text (STT) — Deepgram, OpenAI, Google, Azure, whatever you plug in.
- LLM — Claude, GPT, Gemini, or any provider with a streaming chat API.
- Text-to-Speech (TTS) — ElevenLabs, Cartesia, OpenAI, Azure, others.
The framework handles turn-taking, interruptions ("barge-in"), partial transcripts, tool calls, and reconnection. You write the agent's behaviour; you don't write the audio loop.
A minimal Python agent
from livekit.agents import AgentSession, Agent
from livekit.plugins import openai, deepgram, elevenlabs, silero
async def entrypoint(ctx):
session = AgentSession(
vad=silero.VAD.load(),
stt=deepgram.STT(),
llm=openai.LLM(model="gpt-4o-mini"),
tts=elevenlabs.TTS(),
)
await session.start(
room=ctx.room,
agent=Agent(instructions="You are a friendly UAE concierge."),
)
That's the whole shape of a voice agent. The plugins are swappable; the loop is not your problem.
Use cases worth shipping.
- Phone-style concierge on a website — a "press to talk" button on a landing page that takes a 60-second voice brief and emails it to sales.
- WhatsApp voice replacement for FAQs — a web agent that answers the same 20 questions a salesperson answers 50 times a day, in Arabic or English.
- Voice-driven kiosk — a tablet at a restaurant, clinic or showroom that takes orders or triages enquiries by voice and writes to a CRM via tool calls.
- Outbound call agents — bridged through a SIP gateway, a LiveKit agent can place or receive phone calls, not just browser sessions.
- Live transcription & meeting notes — an agent that joins a sales call as a silent participant and produces structured notes.
The unlock is not the technology — STT, LLMs and TTS are commodities now. The unlock is the round trip: getting a user's audio to the LLM and back as speech with sub-second latency, on any network, without the call breaking. That round trip is what LiveKit sells.
Two ways to run it.
LiveKit Cloud is the managed offering: you get a wss:// URL, an API key and secret, and a generous free tier. Globally distributed media servers, so a user in Dubai and an agent process in Frankfurt still get low latency.
Self-hosted is the open-source server (livekit-server) that you run on your own infra — one Docker container plus a TURN setup for users behind strict NATs. Good for data-residency requirements, but you own the SRE.
For an SME shipping its first voice agent, Cloud is almost always the right answer. Move to self-host only when a customer contract or a regulator forces the question.
How a build usually starts.
- Sign up at
livekit.io, create a project, copy the URL + API key + secret. - Pick an SDK on the client (web is fastest) and on the server (Python if you're doing AI work, Node if you're already in the JS stack).
- Stand up a token endpoint: a single backend route that takes a user identity and returns a JWT signed with your API secret.
- Wire the client to fetch a token, connect to the room, publish the microphone track.
- Start an Agents process locally that joins the same room and runs the VAD → STT → LLM → TTS loop.
- Talk to it. Iterate on the system prompt and tool calls until it sounds like the brand.
Most teams get to a working prototype in a long afternoon. The expensive part — the realtime audio plumbing — is solved.
Where to read next.
Primary documentation
- docs.livekit.io — the canonical reference for server, client SDKs, and the Agents framework.
- docs.livekit.io/agents — the Agents framework, Python and Node guides, plugin catalogue.
- github.com/livekit — the open-source repos: server, SDKs, agents, examples.
- livekit.io — the company site, Cloud pricing, sample apps.
This brief is a snapshot. Plugin lists, free-tier limits, SDK names and pricing change. Treat the official docs as the source of truth and use this page as the orientation map.