Voice AI

Building Dilr Voice on Pipecat: the open-source voice AI framework powering thousands of production calls

Dilr Voice handles thousands of production calls daily on the Pipecat open-source framework. A deep-dive into the architecture: multi-agent routing, real-time telephony, knowledge base RAG, and the six systems we engineered on top.

Silero VAD Deepgram STT GPT-4.1-mini ElevenLabs TTS Twilio WS DILR.AI ENGINEERING · DEEP-DIVE Building Dilr Voice on Pipecat Open-source voice AI framework · Production architecture POWERED BY Pipecat OSS

When we started building Dilr Voice in late 2025, the landscape for voice AI was fragmented. You could stitch together Twilio, a speech-to-text API, an LLM, and a text-to-speech engine — but the glue code between them was always custom, always brittle, and always the thing that broke at 2am. Then we found Pipecat.

This post is a technical deep-dive into how Dilr Voice is built on Pipecat — what the framework gives us for free, what we had to engineer on top, and what the production architecture looks like when you're handling thousands of concurrent calls across 30+ languages.

What is Pipecat?

Pipecat is an open-source Python framework for building voice and multimodal AI applications. It provides a pipeline abstraction — a directed graph of processors that transform frames of data (audio, text, control signals) as they flow through the system.

Audio comes in from a phone call. Voice Activity Detection detects when someone is speaking. STT converts speech to text. The LLM generates a response. TTS converts it back to speech. Pipecat handles the frame routing, async execution, backpressure, and interruption logic. We handle everything else.

What Pipecat gives us

🔌 Provider abstraction

Swap Deepgram for Whisper, OpenAI for Claude, ElevenLabs for Azure — without rewriting the pipeline. Common interface across all providers.

🗣️ Interruption handling

Caller talks over the agent? Pipecat stops TTS, flushes the buffer, captures the new utterance, resumes. Deceptively hard to build from scratch.

📦 Frame-based architecture

Everything is a frame — AudioRawFrame, TextFrame, LLMMessagesFrame. Insert middleware anywhere: logging, metrics, guardrails, transformations.

⚡ Async pipeline

Fully async Python. STT and TTS run concurrently. No blocking on I/O. Handles 50+ concurrent calls per pod without thread contention.

Our production stack

ComponentProviderWhy we chose it
VADSileroBest accuracy/latency tradeoff, runs locally — no network hop
STTDeepgram Nova-3Lowest latency streaming STT, best accent handling across 30+ languages
LLMGPT-4.1-miniBest cost/quality for real-time voice — fast enough for sub-second responses
TTSElevenLabs Flash v2.5Most natural voice, lowest latency in its class, streaming output
TelephonyTwilio + VobizTwilio for US/UK/global, Vobiz for India (+91) — both configurable per agent

All configurable per-agent in the Dilr Voice platform. No code changes needed — just update the Smart Agent config.

What we built on top

Pipecat is a framework, not a product. To turn it into Dilr Voice, we engineered six major systems on top.

1. Multi-agent routing (Agent Flows)

Single-agent conversations hit a ceiling fast. A receptionist agent can't also be a billing expert and a technical support specialist. We built Agent Flows — a multi-agent system where specialist agents handle different parts of a call.

Fast path · ~0ms

Keyword matching: "book", "cancel", "schedule" → Action Agent. Pre-defined per agent type. Supports Hindi.

Handoff signals · ~1ms

Agent says "let me look that up" → Knowledge Agent. "Let me book that" → Action Agent. Regex-based detection.

Fallback · default

No keyword match, no handoff signal — stay on current agent. Marketing agent is the fallback default.

The system prompt is swapped in-place in the LLM context — no new context created, full conversation history preserved across agent switches. The caller never notices.

2. Dynamic system prompt construction

Every call starts by building a system prompt from seven layers:

System prompt layers · built per call
1Base promptpersonality · role · goal · instructions
2Languageinput/output language pair
3Organisationbusiness name · industry · hours
4Call directionINBOUND vs OUTBOUND
5Custom contextper-call data from API
6Lead contexthistory · sentiment · last call summary
7Tools schemabook · search KB · SMS · email · transfer

Two calls to the same agent produce completely different system prompts — because the caller's context is different. A returning caller with negative sentiment gets a more empathetic opening.

3. Knowledge Base (RAG)

Callers ask questions the LLM doesn't know — your pricing, your hours, your return policy. We built hybrid retrieval:

  • Vector search: text-embedding-3-small embeddings in PostgreSQL + pgvector (1536-dim)
  • BM25 full-text: PostgreSQL tsvector for keyword queries
  • Reciprocal Rank Fusion: merges both result sets

Upload PDFs, paste URLs (we crawl), or type text — all in the platform.

// Hybrid search pipeline
vector_results = pgvector.search(
  embedding, top_k=10
)
bm25_results = postgres.fts(
  query, top_k=10
)
// Reciprocal Rank Fusion
final = rrf(vector_results, bm25_results)
→ returns top 5 chunks

4. Production telephony

The API server handles config, billing, and post-call processing. The Pipecat Runner handles real-time voice. They communicate via gRPC. This separation means we scale runners independently — more pods when volume spikes, without touching the API.

5. AI tools the agent can use

📅
Book appointment
Google Calendar
🔍
Search KB
Hybrid RAG
💬
Send SMS
Confirmation / follow-up
📧
Send email
Summary / receipt
👤
Update lead
Name · intent · notes
📞
Transfer call
Live human handoff

Tools are configurable per agent in Agent Flows. The Action agent gets book_appointment + send_sms. The Knowledge agent gets search_knowledge_base. The Greeter has none — it just talks.

6. Infrastructure

Runtime
  • PlatformGKE · europe-west2
  • APIFastAPI + gRPC
  • VoicePipecat Runner
  • WorkersCelery + Redis
  • DatabaseCloud SQL + pgvector
Optimisations
  • STT/TTS poolsNo cold-start
  • LLM warmupOn WS connect
  • Codecμ-law 8kHz pre-negotiated
  • Agent switch~1ms in-place
  • Concurrency~50 calls/pod

The numbers

After 6 months in production on Dilr Voice:

1.2s
Avg first response
78%
Containment rate
50+
Calls per pod
30+
Languages
99.7%
Uptime (90d)

Try it

Dilr Voice is live and free to try — $20 in credits on signup, no credit card required. Build an agent in the visual flow builder, attach a phone number, and call it.

Get started
pipecatvoice-aiarchitectureopen-sourcedilr-voicetelephonyproduction
← Previous
The evaluation harness: engineering the layer between your AI and production

One email, once a month. No hype. Just what we learned shipping.