Voice AI hallucination: a procurement gate

A voice agent that quotes the wrong premium on a motor insurance call is no longer a UX problem. It is a misrepresentation under FCA Consumer Duty, an inaccurate-data event under UK GDPR Article 5(1)(d), and — if the caller is in California — a $5,000-per-violation breach of SB 942. The same hallucination, three regulatory triggers, one procurement failure.

This is the shift the market has not internalised. Voice agents have moved from quoting hours and reading FAQs to executing payments, updating patient records, resetting credentials, and confirming policy terms. The output of the model is now the legal act of the enterprise. When the model invents, the enterprise has acted falsely — and the regulator does not care that the lie came from a transformer.

The procurement teams we work with still treat hallucination as a demo question: "Can you show me it doesn't make things up?" That question gets a confident demo answer and tells you nothing. The right question is structural: what is your hallucination containment architecture, and can you evidence it under audit? If a vendor cannot answer in the first thirty minutes, they should not be on the shortlist.

This is the framework we use when enterprise procurement teams ask us to evidence containment on Dilr Voice deployments under FCA, ICO and EU AI Act regimes. Or see the AI operating model, which turns the architecture below into RACI, controls and audit evidence.

Key takeaway

Hallucination is not a model property to test for. It is an architecture to evidence. Procurement that treats it as a feature question buys a regulatory liability; procurement that treats it as a containment system buys a defensible deployment.

Why "show me it doesn't hallucinate" is the wrong question. Vendors love this question because they cannot lose it. Pick a happy-path script, run it three times in a controlled demo, point at the clean transcripts. The buyer leaves reassured. Six months later the same agent invents a refund policy at 23:47 on a Saturday and the compliance team is on the phone to legal.

Stanford's HELM benchmarks and the Vectara Hallucination Leaderboard both put base hallucination rates on factual summarisation tasks between 3% and 27% across frontier models — and that is in optimised, latency-tolerant conditions. Drop a model into a 600-millisecond voice loop with interrupted speech, partial ASR transcripts and a knowledge base that updates weekly, and the upper bound is materially higher. No demo can falsify this. Only architecture can contain it.

$5K/violation

California SB 942 (in force Jan 2026)

€15M / 3%

EU AI Act Article 50 max penalty

27%

Upper-bound LLM hallucination on factual tasks

Enterprises AI-mature (McKinsey 2025)

The regulatory frame is converging on this point faster than most procurement functions realise. California's SB 942{target="_blank" rel="noopener"} attaches a per-violation civil penalty to undisclosed or misleading AI output. The EU AI Act{target="_blank" rel="noopener"} Article 50 places transparency and disclosure obligations on operators of conversational AI, with enforcement penalties up to €15M or 3% of global turnover. The UK ICO's automated decision-making guidance under UK GDPR Article 22, combined with FCA Consumer Duty in financial services, treats a confidently-spoken inaccurate statement as a live data and conduct event — not a software defect.

Three regulators, one architectural question. The companies that answer it will deploy. The companies that hand-wave it will be told "no" by their own legal teams, and they will blame the vendor.

The four-layer hallucination containment architecture

A defensible deployment has four containment layers, each with its own audit artefact. Treat any layer as optional and the chain breaks under regulatory scrutiny.

Layer 1 — Retrieval grounding

The model does not answer from parameters. It answers from a typed, versioned, source-of-truth retrieval over the enterprise's own knowledge base — pricing, policy, clinical pathways, product specs. Every utterance is tied to a retrieval ID, a document version, and a timestamp. If the knowledge base is wrong, that is a content-governance issue. If the model invents around the retrieval, the system blocks it at Layer 3.

Layer 2 — Constrained generation

Free-text generation is the wrong default for high-stakes turns. Numbers, identifiers, policy terms, dates and named entities should be slot-filled from structured retrieval, not generated. The model's job is to compose grammatically — not to invent values. This is the same principle that has kept structured workflow voice agents safe for a decade; the new generation has forgotten it in favour of "let the LLM speak freely". For regulated turns, that is the wrong trade.

Layer 3 — Pre-utterance validation

Before any committal sentence is sent to TTS, a second pass validates: does the proposed utterance contain only entities, numbers and claims drawn from this turn's retrieval set? If yes, speak. If no — deflect, hand to a human, or fall back to a safe scripted line. This is the layer most vendors do not have, and it is the one regulators care about most. It is the difference between "the model usually doesn't hallucinate" and "the system cannot speak an unevidenced claim."

Layer 4 — Post-call audit trail

For each call, persist: full transcript, retrieval IDs hit per turn, model version, prompt version, validation pass/fail flags, and any deflection events. This is the evidence pack a regulator will ask for. Under ICO automated decision-making guidance{target="_blank" rel="noopener"} and EU AI Act Article 12 logging requirements, the burden of proof sits with the deployer. "We trusted the model" is not a defence. "Here is the audit trail showing the system blocked 11,400 unevidenced utterances last quarter" is.

This containment thinking is the same architectural logic we describe in the EU AI Act Article 50 voice AI disclosure post — disclosure and grounding are two halves of the same transparency obligation. If you are early in the business case, the enterprise voice AI vendor checklist translates this architecture into RFP scoring criteria.

What this means for the procurement gate

The implication for buyers is concrete. Hallucination containment is not a question on page 14 of the security questionnaire. It is a Stage 1 gate — pass or fail, before commercial conversations begin.

Procurement stage	Old question	Procurement-gate question	Vendor evidence required
Stage 1 — Shortlist	"Does it hallucinate?"	"Describe your four-layer containment architecture in 30 minutes."	Architecture diagram + named control owner per layer
Stage 2 — Diligence	"Can you show a clean demo?"	"Show 100 randomly-sampled production transcripts with retrieval IDs and validation flags."	Live audit dashboard, not curated examples
Stage 3 — Legal review	"Are you GDPR-compliant?"	"Map your containment to ICO Article 22, EU AI Act Art. 50 + Art. 12, and SB 942 §22757.1."	Written compliance map, signed by vendor's DPO
Stage 4 — Pilot exit	"What's the accuracy?"	"What is the rate of blocked unevidenced utterances, and is it falling?"	Quarterly trend, with root-cause categorisation

A vendor that cannot pass Stage 1 in thirty minutes does not get to Stage 2. This is not gatekeeping for sport — it is the only way a Head of Risk can sign off a deployment that touches a regulated workflow.

The contrarian point worth naming: the vendors most likely to fail this gate are the ones with the loudest "agentic" marketing. The more autonomy you grant the model, the more containment you must evidence. The companies running narrow, structured, retrieval-grounded voice flows often have stronger containment architectures than the ones promising end-to-end "AI employees". Procurement should reward the boring architecture, not the impressive demo. The same logic applies in adjacent areas — see how we frame the voice AI orchestration vs platform decision for buyers in regulated sectors.

What "good" looks like in your RFP. Three concrete additions to any voice AI RFP, lifted from deployments that have survived legal review:

Containment architecture document — required as part of the response, not on request. Names the four layers, the controls per layer, the audit artefact per layer.
Production transcript sampling rights — the buyer has the right to randomly sample 1% of production transcripts for the first 90 days, with retrieval IDs and validation flags exposed.
Regulator notification clause — the vendor commits to notify the buyer within 24 hours of any regulator inquiry into hallucination-related events on the platform, across the vendor's full customer base.

These three clauses do more to de-risk a £250k+ voice AI deployment than any amount of demo time. They are also the clauses that separate vendors built for enterprise from vendors built for the demo.

Working through your own procurement gate? Try Dilr Voice with the four-layer architecture live, book an AI placement diagnostic, or read about our approach to placing voice AI inside regulated enterprise workflows.

Service

AI Placement Diagnostic

Talk to the operators

Make hallucination containment a procurement gate.

30-min scoping call. We map your containment architecture to ICO, FCA, EU AI Act and SB 942 obligations — and tell you whether your shortlist passes Stage 1.

Book a call → See operating model →

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

The four-layer hallucination containment architecture

Layer 1 — Retrieval grounding

Layer 2 — Constrained generation

Layer 3 — Pre-utterance validation

Layer 4 — Post-call audit trail

What this means for the procurement gate

Make hallucination containment a procurement gate.

Related articles

HIPAA-Grade Voice Automation: What Healthcare Teams Need

ICO AI Code of Practice: Voice AI obligations from May 2026

FCA AI Governance 2026: What Voice AI Deployments Must Do

One email, once a month. No hype. Just what we learned shipping.