AI voice agent QA and testing: an enterprise framework

Enterprise voice deployments in 2026 are no longer rare. Production voice agent implementations have grown roughly 340% year-over-year across hundreds of large organisations, and 67% of Fortune 500 companies now run production voice AI in at least one customer surface. The problem is what those teams are using to validate quality before — and after — go-live. Traditional contact-centre QA teams review 1–2% of calls by hand. When an AI voice agent is handling thousands of conversations a day, that sampling rate is a liability, not a methodology.

The mistake we see most often inside enterprise procurement is treating voice agent QA as a slight variation of human-agent QA: more dashboards, the same scorecards. It is not. An AI agent fails in ways a human never does — hallucination, prompt injection, latency cliffs under load, silent regression after a model update — and it succeeds at the things human QA used to catch by chance. Buyers who deploy without a deliberate testing framework end up running risk experiments on live customers. This guide is the framework we use with enterprise AI voice agents before they ever take a real call, and the testing scaffolding is now embedded into every Dilr Voice deployment we ship — alongside the AI execution office that owns the test plan post go-live.

This framework is shipped by the team behind Dilr Voice — enterprise voice AI deployed across regulated UK and EU surfaces. Or see AI placement diagnostic, our fixed-fee assessment used before any voice deployment commitment.

Key takeaway

You cannot QA an AI voice agent the same way you QA human agents. Production-grade voice AI needs four parallel test layers — unit/scenario, adversarial red-teaming, shadow/canary, and 100% production scoring — running continuously, not at release gates. Buyers who skip layers 2 and 3 are the ones whose pilots stall in AI voice pilot purgatory.

1–2%

Calls reviewed under legacy human QA

100%

Target coverage on production AI voice

50+

Adversarial vulnerability classes to test

340%

YoY growth in production voice deployments

The four-layer QA stack

Treat AI voice QA as four parallel disciplines, each catching failure modes the others miss. Skipping a layer does not save time — it pushes the same defects into a worse part of the customer cycle.

Layer 1 — Unit and scenario tests (deterministic)

The basics: every intent, every slot, every API call hit by a scripted test before code merges. Use synthetic transcripts and synthetic audio for ASR + NLU paths. Coverage targets we hold internally: 100% of named intents, 100% of fulfilment endpoints, ≥95% slot recall on golden-set audio. This is the layer that catches the silly stuff — regressions after a model swap, broken DTMF handling, dropped CRM writes. It is also where you validate voice AI accuracy beyond WER on the four-dimension accuracy scorecard.

Layer 2 — Adversarial and red-team tests

This is the layer most enterprise buyers do not even know to ask about, and it is where the largest reputational risks sit. The Aegis framework for voice agents formalises a taxonomy of adversarial scenarios: authentication bypass, resource abuse, privilege escalation, data poisoning, prompt injection, privacy leakage, and audio-specific attacks (background-speech injection, ultrasonic carriers, transcoded payloads). ASAPP's May 2026 launch of continuous red-teaming, screening more than 50 vulnerability classes per release, is now the floor — not the ceiling — of what a credible vendor offers. If your evaluation does not include a documented adversarial battery, you are not testing the agent; you are testing your own optimism.

The same red-team logic underpins our AI operating model consulting — the layer where governance, RACI and the test plan are jointly owned.

Layer 3 — shadow and canary calls. Once Layers 1 and 2 are green, the agent runs in shadow against a percentage of real traffic — listening, predicting, logging, but not speaking. You compare predicted action against the human agent's action across thousands of calls. Then canary release: 1% of live traffic, then 5%, then 25%, with automatic rollback if KPI deltas exceed pre-agreed thresholds (containment, escalation rate, AHT, CSAT). This layer is what separates teams who pass through pilot purgatory and teams who get stuck in it — and it links directly into AI voice program KPIs you will report to the board.

What production QA actually looks like at scale

Layer 4 is the layer most vendors over-promise and under-deliver. Production QA at scale means 100% of live calls auto-scored — not sampled — across a multi-dimensional rubric. The legacy "1–2% sampled by humans" model breaks at the first 10,000-call month. PolyAI, ASAPP, and the better orchestration platforms have all converged on the same shape: an embedded scoring model that runs on every call, plus a feedback loop that pushes flagged calls to humans for adjudication and back into the eval set.

The minimum scoring dimensions we look for are below. Anything less, and you are running a dashboard, not a QA system.

Dimension	What it catches	Sample frequency	Owner
Conversational fidelity	Did the agent stay on script intent, handle barge-in, recover from interruption	100% of calls	Voice ops
Compliance and disclosure	EU AI Act Article 50 disclosure, GDPR/PECR consent capture, retention notice	100% of calls	Compliance
Outcome accuracy	Did the booking write, did the payment process, did the CRM update	100% of calls	Engineering
Hallucination / fabrication	Did the agent invent a policy, price, or fact not in the knowledge base	100% of calls	AI governance
Safety / red-team replays	Replay of last week's adversarial findings against this week's model	Weekly batch	Security

Two notes on this table. First, the same scoring stack is what regulated buyers should be asking for at procurement — and we cover the procurement read in voice AI vendor evaluation. Second, this scoring rubric is also how you catch voice AI hallucination at the procurement gate before it becomes a board-level incident.

The loop is the point. Most enterprise teams ship a one-off test plan at go-live and never close the feedback ring. Without the loop from production back into the eval set, you accumulate drift you cannot see — and the first signal you get is a CSAT drop or a regulator email. The closed-loop scoring inside our Dilr Voice agents writes every flagged call back into the next-day eval batch automatically — no human re-keying required.

Script deviation detection — a category most buyers miss

Voice agents are increasingly LLM-driven, which means the literal text the agent says is generated, not selected from a fixed library. That is a feature, but it is also a QA problem: a model update can change the agent's phrasing in ways that pass all the deterministic tests and still violate brand guidelines, compliance scripts, or regulated disclosures. Script-deviation detection is a separate evaluator that compares actual utterances against required content — Article 50 disclosure language, FCA Consumer Duty phrasing, recording notice — and flags any call where required language is paraphrased away. This is one of the differences between LLM vs scripted voice agents at the architecture layer, and it is the single biggest gap in vendor demos we see during diagnostics.

Speak to our team about embedding script-deviation tests in your evaluation — book a 30-minute scoping call via contact and we will walk through the regulator-facing rubric we use with UK financial-services clients.

Ongoing performance monitoring after go-live

QA is not a release-gate exercise. Once the agent is live, the model can drift, the upstream LLM can change, the ASR can degrade under new accents, and the knowledge base can fall out of sync with the source-of-truth systems. The teams who treat ongoing monitoring as part of the product — not part of compliance — are the same teams who graduate out of pilot. The teams who treat it as paperwork are the ones whose KPIs decay quietly until a quarterly review.

Three monitoring artefacts every enterprise voice programme should produce on a weekly cadence:

Drift dashboard — model version, eval-set pass rate this week vs last, hallucination flag rate, top 10 net-new failure modes.
Red-team replay report — last week's adversarial findings, re-run against this week's model, with a documented pass/fail decision per finding.
Regulator-facing artefact file — call samples, scoring rubric, model version, disclosure audit. This is the file the FCA, ICO or auditor will ask for; produce it before they ask. The architecture is the same one we describe in AI voice governance, and it is the artefact our DATS five-stage methodology hands over to the steering committee.

If you are scoping this for a regulated deployment, the next step is concrete. Try Dilr Voice with the testing stack pre-wired, book an AI placement diagnostic, see our DATS methodology, or read about our approach to placing AI inside enterprise systems.

Most buyers procure on a feature list. The ones who get to production EBIT impact procure on a test plan. According to McKinsey's State of AI 2025, ~88% of enterprises now use AI but only ~6% capture material EBIT impact — and the gap is rarely about model choice. It is about the discipline of evaluation around the model. The Stanford AI Index 2026 confirms the same — fewer than 10% of enterprises have fully scaled AI in any single function. Voice is one of the few where the path to production is well-understood, provided you build the test stack first and the voice AI agent quality scoring loop second.

Service

AI Placement Diagnostic

Talk to the operators

Build the test stack before the agent ships.

30-min scoping call · No deck · Confidential. We will walk through the four-layer QA stack and what to demand from your voice AI vendor before signature.

Book a call → Try Dilr Voice ↗

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

The four-layer QA stack

Layer 1 — Unit and scenario tests (deterministic)

Layer 2 — Adversarial and red-team tests

What production QA actually looks like at scale

Script deviation detection — a category most buyers miss

Ongoing performance monitoring after go-live

Build the test stack before the agent ships.

Related articles

Voice AI Traffic Ramp: Canary and Shadow Deployment for Enterprise

Voice AI Telephony: Selecting the Provider That Doesn't Become a Constraint

Voice AI A/B Testing: Experimenting on Live Calls Without Breaking CSAT

One email, once a month. No hype. Just what we learned shipping.