Voice AI RAG: knowledge bases that work on live calls

A voice agent that does not know your specifics is a liability. Ask it about a customer's policy excess, a branch's Saturday hours, a product's return window or a tariff's exit fee, and a model running on its training data alone will do one of two things: confidently invent an answer, or escalate everything to a human and erase the business case. Retrieval-augmented generation — RAG — is the architecture that fixes both failure modes by grounding the agent's answers in your own knowledge base. It is also the single most under-engineered layer in enterprise voice deployments.

The reason is simple. Most teams port a RAG pattern they built for a chatbot straight onto the phone, and it falls apart. A chatbot can think for three seconds, show its working, and let the user re-read a wrong answer and re-ask. A voice agent has none of those affordances. It has a sub-second budget before the silence becomes rude, it gets exactly one shot at the answer because the caller cannot scroll back, and a single confabulated number on a recorded line is a compliance event. Live-call RAG is a different engineering problem, and this guide sets out the architecture that makes it production-viable for regulated UK and EMEA enterprises.

This guide is shipped by the team behind Dilr Voice — enterprise voice AI live in 40+ countries with retrieval grounding built into the call path. If you are mapping where a grounded agent fits your stack first, our Voice AI agents page and the DATS five-stage methodology are the place to start.

The stakes are not abstract. McKinsey's State of AI 2025 found roughly 88% of enterprises now use AI but only around 6% capture material EBIT impact — and the gap is overwhelmingly an execution gap, not a model gap. Grounding is where a voice programme either converts capability into a containable, trustworthy call or leaks it into hallucinations and over-escalation. This is the layer that the enterprise guide to AI voice agents treats as a single line and that this post unpacks end to end.

Why live-call RAG is a different problem to chatbot RAG

Start with the constraints, because they dictate every architectural choice that follows. A text RAG system optimises for answer quality with latency as a secondary concern. A voice RAG system inverts that: latency is a hard wall, and quality has to be achieved inside it. Four constraints separate the two.

The latency budget is brutal and non-negotiable. Human conversation tolerates roughly 200–500ms of silence before a pause reads as a hesitation or a dropped line. Your entire round trip — automatic speech recognition (ASR) finalising the caller's words, query formulation, retrieval, reranking, generation and the first byte of text-to-speech (TTS) — has to fit inside that window, or feel like it does. Retrieval cannot take the 1–2 seconds a chatbot quietly absorbs. It gets a slice, typically under 200ms, of an end-to-end budget under 800ms. Every millisecond the knowledge layer spends is a millisecond stolen from the model and the voice. The same physics governs the broader pipeline covered in our work on voice agent latency benchmarks.

There is no re-prompt. A chatbot user who gets a vague answer types "no, the other policy." A caller hears a wrong answer, believes it, acts on it, and you find out at the complaint stage. The agent has to be right the first time or honestly say it does not know — there is no graceful middle.

Hallucination tolerance is zero, not low. On a recorded line, in a regulated sector, an invented figure is not a quality blemish; it is a mis-statement that may breach FCA Consumer Duty, mislead a patient, or create a contractual representation. We argue elsewhere that hallucination is a procurement gate, not a tuning detail — and retrieval grounding is the primary control that closes it.

The answer has to be speakable. A chatbot can return a table, a bulleted list and three links. A voice agent has to compress the retrieved knowledge into one or two sentences a person can absorb by ear, in the right order, without a wall of caveats. Retrieval that returns the right document but the wrong shape of content still fails on the phone.

<200ms

Retrieval slice of the budget

<800ms

End-to-end target, P95

Attempts the caller gives you

Ungrounded claims tolerated

These constraints are why the choice between a tightly-grounded retrieval design and a loosely-prompted model is not stylistic. It maps directly onto the architecture trade-off we set out in LLM-driven versus scripted voice agents: the more your agent must speak verified facts, the more retrieval has to carry the load and the less you can leave to the model's parametric memory.

The retrieval pipeline, stage by stage

A production voice RAG path is not "embed the question, fetch the nearest chunk, prompt the model." That naïve pattern is exactly what fails under load. The working architecture is a sequence of fast, bounded stages, each with a latency ceiling and a fallback.

Query formulation. The caller rarely speaks a clean query. They say "yeah so I moved house last month and I'm not sure if my cover still, you know, works." Before retrieval you need a crisp search query plus structured filters — account type, region, product line, entitlement — pulled from the call context and CRM, not just the raw transcript. This is where the agent's session memory earns its place; carrying entities across turns and calls, as we cover in our work on the broader tool-calling architecture for voice agents, lets you scope retrieval to the right shelf of the knowledge base instead of the whole warehouse.

Hybrid retrieval. Pure vector search misses exact-match needles — a policy code, an SKU, a branch identifier — that enterprise callers quote verbatim. Pure keyword search misses paraphrase. Production voice RAG runs both: a dense vector lookup for semantic recall and a sparse keyword (BM25-style) lookup for precision, fused into one candidate set. Filters from the formulation step prune the index before the search runs, which is both a relevance win and a latency win — you are searching thousands of vectors, not millions.

Reranking. The fused candidate set is reordered by a cross-encoder or a lightweight reranker that actually reads the query against each candidate, rather than trusting the first-pass similarity score. On a call you keep this tight — rerank a small top-k, not fifty passages — because every extra candidate costs milliseconds. The reranker's confidence score is also your first quality gate: if the best candidate scores below threshold, you do not generate a confident answer, you trigger a grounded refusal or a handover.

Grounding and assembly. The surviving passages are assembled into a context window with explicit provenance — which document, which version, which effective date — so the model can cite and so you can audit. This is the step most chatbot ports skip, and it is the one that makes the difference between an agent that sounds authoritative and one that is.

Generation and streaming TTS. The model writes a short, speakable answer constrained to the retrieved context, and the first clause streams into TTS before the last clause is written. Streaming is not a nicety; it is how you hide the remaining latency behind the start of speech.

The knowledge base itself: chunking, structure and freshness

Retrieval is only as good as what it searches. A knowledge base assembled for human reading — long PDFs, sprawling intranet pages, policy documents written for lawyers — is the wrong substrate for live-call RAG. Three disciplines turn source content into a retrievable knowledge base.

Chunk for the answer, not the document. The unit of retrieval should be the unit of a spoken answer: a self-contained passage that resolves one caller intent, with enough context to stand alone but short enough to fit the latency and speakability budgets. Splitting a 40-page handbook into arbitrary 500-token windows produces chunks that start mid-clause and answer nothing cleanly. Semantic chunking — splitting on meaning boundaries, keeping a heading and its rule together, attaching the effective date and jurisdiction as metadata — is what makes the top result actually answer the question.

Make metadata a first-class citizen. Every chunk should carry the fields you filter and cite on: product line, region, customer segment, effective date, source system, and a confidence or authority tag. This metadata is what lets query formulation prune the index to the right scope, and it is what lets the agent say "as of your current tariff" rather than quoting a superseded rate. It is also the backbone of auditability — a regulator asking "where did the agent get that figure" should be answerable from the chunk's provenance.

Solve freshness explicitly. A knowledge base that is correct at deployment and stale a fortnight later is worse than no knowledge base, because it grounds the agent in confident, wrong answers. You need a sync pipeline from the systems of record — pricing, policy, inventory, scheduling — with cache invalidation tied to source changes, not a nightly rebuild that lags reality. For volatile facts (a live appointment slot, a current account balance) retrieval is the wrong tool entirely; those belong to a real-time tool call against the source system, which is why grounded voice agents always combine RAG for stable knowledge with live tool-calling for transactional state. The discipline of keeping the agent's words tied to verified context is the same one we describe in voice AI prompt engineering for production, where grounded-refusal prompting and retrieval work together.

The transcript and retrieval logs this produces are also an asset in their own right. The same real-time transcription data layer that records the call also tells you which questions retrieval handled well and which fell through — the raw material for the knowledge gaps you fix next.

Grounding and refusal: keeping the agent honest

Grounding is the discipline that makes retrieval trustworthy rather than merely present. Three controls do the work.

Constrain generation to retrieved context. The system prompt and the decoding strategy should bias the model hard toward answering only from the assembled passages, and toward explicit attribution. If the answer is not in the context, the correct output is not a plausible guess — it is a refusal or a handover. This is the behavioural contract that separates a grounded agent from a fluent one.

Design the refusal, do not leave it to chance. "I don't have that to hand — let me put you through to someone who does" is a feature, not a failure, when it is calibrated. The threshold is a business decision: in a low-risk informational call you can let the model answer at moderate retrieval confidence; in a regulated advice context you set the bar high and escalate readily. A grounded refusal that routes to a human with full context preserves trust where a confident wrong answer destroys it — and it connects directly to the handover quality we treat in the broader voice-ai playbook.

Make every answer auditable. Because grounding carries provenance, every factual claim the agent makes can be traced to a source chunk, its version and its effective date. That is what turns an opaque model output into an evidence trail — the difference between "the AI said it" and "the AI quoted clause 4.2 of the v8 policy, effective March 2026." For regulated deployers this is not optional; it is the artefact a DPIA and an FCA file are built on. It also intersects with data governance: what sits in the knowledge base, how long it is retained, and where it lives are decisions governed by the same rules we set out in voice AI data retention under GDPR, and they belong in the operating model, not bolted on afterward. Teams that formalise this early — as part of an AI operating model with clear ownership of the knowledge layer — avoid the procurement stalls that catch teams who treat the KB as an engineering afterthought.

Latency engineering: making RAG fast enough to speak

Getting retrieval under 200ms is an engineering problem with known levers. None of them is exotic; the discipline is applying them together and measuring at P95, not at the demo.

Cache aggressively and semantically. A large share of calls ask a small set of questions. A semantic cache keyed on the formulated query — not the raw transcript — returns the grounded context for a repeat question in single-digit milliseconds and takes the index out of the path entirely. The cache is invalidated by source changes, not by time, so it never serves stale facts.

Pre-fetch on intent, not on completion. You do not have to wait for the caller to finish the sentence. As ASR streams partial transcripts and intent classification fires, speculatively retrieve against the likely query so the candidate set is warm by the time the caller stops speaking. Wasted speculative fetches are cheap; saved latency is not.

Scope the index before you search it. Metadata filters that prune to the caller's product, region and segment turn a million-vector search into a thousand-vector search. Smaller search space, faster search, more relevant results — the rare lever that improves speed and quality at once.

Right-size every model in the path. The reranker, the embedder and the generator do not all need to be the largest model available. A small, fast reranker that runs in 20ms often beats a large one that runs in 120ms once you account for the latency it steals from the voice. The same calculus governs the generation model — a smaller grounded model frequently outperforms a larger ungrounded one on a call, because the retrieval is doing the knowing.

Indicative latency budget · grounded voice turn

ASR finalisation (after endpoint)~100–150ms
Query formulation + filters~20–40ms
Hybrid retrieve + rerank~80–150ms
Generation to first token~150–300ms
TTS first byte (streaming)~80–150ms

Indicative · representative of engagements. A semantic cache hit collapses the middle three rows to near zero. Measure your own P95, not the median, and not the demo.

This is also where the build-versus-buy question becomes concrete. Assembling and maintaining this pipeline — the index, the reranker, the cache, the sync jobs, the eval harness — is a standing engineering commitment, not a one-off integration. The trade-off is the same one we model in the in-house versus vendor operating model: a managed voice platform that ships retrieval grounding in the call path removes most of this surface area, which is precisely what Dilr Voice's voice automation platform is built to do.

Governance and compliance of the knowledge layer

The knowledge base is not just an engineering artefact; it is a regulated data asset, and enterprise procurement treats it as one. Four questions decide whether your retrieval design survives legal review.

What is in it, and is it allowed to be? Special-category data, customer PII and confidential commercial terms can all end up in a knowledge base by accident. The KB needs the same data-minimisation and lawful-basis discipline as any other processing surface, scoped in a DPIA before go-live rather than discovered in an audit.

Who can retrieve what? Access control belongs at the retrieval layer, not just the UI. A voice agent serving a retail customer must not be able to surface an internal pricing-strategy document or another customer's record, and the filter that enforces that has to be part of the index, enforced on every query.

How fresh, and how provable? Regulators increasingly expect that an automated answer can be reconstructed: what did the agent know, from which source, at what version, at the moment of the call. Provenance-carrying chunks plus retrieval logs give you that reconstruction; a black-box embedding store does not.

Where does it live? Knowledge-base storage, embeddings and the call data that flows through retrieval are all subject to residency and transfer obligations. For UK and EMEA deployers these decisions sit alongside the broader data-governance posture, and they belong in the same procurement conversation as the rest of the stack — the kind of structured evaluation our DATS five-stage methodology and our deployment approach are built around. Getting the knowledge layer's governance right early is far cheaper than retrofitting it, which is the recurring lesson across every regulated voice deployment we run.

Evaluating live-call RAG: the metrics that matter

You cannot improve what you do not measure, and voice RAG needs metrics that a chatbot evaluation does not surface. Five are non-negotiable.

Metric	What it catches	Target signal
Retrieval recall@k	Whether the right passage was even fetched	High and stable across intents
Groundedness	Whether the answer is supported by retrieved context	Near-total; ungrounded claims trend to zero
Answer correctness	Whether the spoken answer is actually right	Graded against a held-out answer set
Retrieval latency P95	Tail slowness that breaks the conversation	Under the slice budget, not just the median
Containment with quality floor	Resolution that does not sacrifice correctness	High containment AND high correctness together

The trap to avoid is optimising containment alone. An agent that answers everything confidently will post a beautiful containment rate and a terrible correctness rate, and you will not know until the complaints arrive. Groundedness and correctness are the guardrails that keep containment honest. This is the same "measure the thing that actually matters, not the flattering proxy" discipline we set out in evaluating voice AI accuracy beyond WER — word error rate tells you the ASR heard the caller; only groundedness tells you the agent told the truth. Stanford's AI Index 2026 notes that fewer than one in ten enterprises has fully scaled AI in any single function — and in voice, the evaluation discipline above is most of what separates the ones that scale from the ones stuck in pilot.

Want to see grounded retrieval in production? Try Dilr Voice live with $20 of credits, book an AI placement diagnostic to scope your knowledge layer, or read how we structure delivery in the AI execution office.

Frequently asked questions

Is RAG the same as fine-tuning for a voice agent?

No. Fine-tuning bakes patterns into the model's weights; RAG keeps facts in an external, updatable knowledge base the agent reads at call time. For enterprise voice you want both jobs separated: fine-tuning (or careful prompting) for tone, format and refusal behaviour, and retrieval for the facts that change — pricing, policy, hours, entitlements. Retraining a model every time a tariff changes is neither fast nor auditable; updating a chunk is both.

Can we just use a bigger model instead of RAG?

A bigger model knows more of the public internet, not more about your business. It still cannot tell a caller their specific renewal date or your branch's bank-holiday hours, and a larger ungrounded model often hallucinates those with more fluency, which is worse on a recorded line. Retrieval is how the agent knows your facts; model size is a separate dial.

How is live-call RAG different from RAG in a chatbot?

Three ways: a sub-200ms retrieval budget instead of a tolerant one, no opportunity for the caller to re-ask a vague answer, and zero tolerance for an ungrounded claim on a recorded, often regulated, line. The pipeline shape is similar; the engineering — caching, pre-fetch, reranker sizing, streaming — is tuned for latency and for speakability in a way a chatbot never has to be.

What happens when retrieval returns nothing good?

It triggers a designed grounded refusal or a warm escalation to a human with full call context — never a confident guess. The reranker's confidence score is the gate, and the threshold is a business decision calibrated to the risk of the call type. A clean "I don't have that to hand, let me put you through" preserves trust; a fabricated answer destroys it.

Is our knowledge base ready for retrieval as-is?

Usually not without work. Content written for human reading — long PDFs, legalese, sprawling intranet pages — needs semantic chunking, metadata enrichment (effective date, region, product, authority) and a freshness pipeline before it retrieves cleanly on a call. The knowledge base, not the model, is where most voice RAG projects succeed or fail.

Voice AI

Tool-calling architecture for voice agents

Compliance

Voice AI hallucination as a procurement gate

Product

Dilr Voice — grounded agents

Talk to the operators

Ground your voice agent in what your business actually knows.

We design retrieval-grounded voice agents that answer from your knowledge base inside the latency budget — auditable, governed, and built for regulated UK and EMEA enterprise. Thirty-minute scoping call, no deck, confidential.

Book a call → Try Dilr Voice ↗

Written by the Dilr.ai engineering team — practitioners who ship retrieval-grounded voice agents in production for regulated enterprise. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

Why live-call RAG is a different problem to chatbot RAG

The retrieval pipeline, stage by stage

The knowledge base itself: chunking, structure and freshness

Grounding and refusal: keeping the agent honest

Latency engineering: making RAG fast enough to speak

Governance and compliance of the knowledge layer

Evaluating live-call RAG: the metrics that matter

Frequently asked questions

Ground your voice agent in what your business actually knows.

Related articles

Voice AI Conversation Design: Scripting That Converts

Voice AI Tool Calling: Enterprise Architecture That Ships

Voice AI prompt engineering: from playground to production

One email, once a month. No hype. Just what we learned shipping.