AI voice pilot purgatory: Why 70% of programmes stall

In late 2025, McKinsey reported that 88% of enterprises now use AI in some form — yet only 6% have achieved material EBIT impact. The gap between the two numbers is the single most expensive line item in the 2026 boardroom. It is the cost of pilot purgatory, and voice AI is one of its largest tenants.

The pattern is consistent. A respected vendor demos well. Procurement signs off on a contained pilot — one queue, one outbound use case, three months. The pilot delivers a number that looks promising. Then it sits there. Eighteen months later, the original sponsor has moved roles, the integration was never built past Phase 1, and a new vendor is being evaluated to "do it properly this time". The capital is gone. The board is irritated. The use case is no closer to production.

This post argues that AI voice pilot purgatory is rarely a budget problem or a technology problem. It is a structural problem caused by four pre-pilot decisions that almost no enterprise gets right the first time. Fix those four decisions before the pilot starts and scale becomes a sequencing question. Skip them and the pilot is dead before the first call connects.

This piece is shipped by the team behind Dilr Voice — enterprise voice AI live across regulated UK and EU operators. If your organisation is staring at a stalled programme, start with an AI placement diagnostic, our 4–6 week, fixed-fee assessment of where the EBIT actually moves.

Key takeaway

Pilot purgatory is structural, not technical. The four decisions that determine whether a voice AI pilot graduates to scale — success metric design, integration scope, exit criteria, and operating ownership — are all made before a single call is placed. If any one of them is wrong, no amount of model performance will rescue the programme.

The scale of the stall — and why voice AI is the worst-affected use case

Aggregate the most recent enterprise data and a clear shape emerges. McKinsey's State of AI 2025 puts 88% of enterprises in some form of AI use, but only 33% have AI in production and just 6% are AI-mature. Stanford's AI Index 2026 finds that fewer than 10% of organisations have fully scaled AI in any single function. ServiceNow's Enterprise AI Maturity Index 2026 segments the field as 25% Exploring, 35% Implementing, 25% Scaling, 12% Optimising and 3% Leading — meaning 60% of enterprises have not yet escaped the build phase. BCG's Widening AI Value Gap labels 60% of organisations as Laggards, capturing little to no value from their spend.

Voice AI sits inside this same funnel, but with three structural disadvantages that make pilot purgatory more likely:

88%

Enterprises using AI (McKinsey 2025)

Capturing material EBIT impact

<10%

Fully scaled in any function (Stanford 2026)

Leading on AI maturity (ServiceNow 2026)

First, voice cuts across more functional owners than any other AI workload. A voice agent touches contact centre operations, IT, telephony, CRM, compliance, training and the front-line union. No single executive owns every dependency, and absent owners do not unblock integrations.

Second, the success metric is contested. Operations measures average handle time. Sales measures booked meetings. Compliance measures consent capture rates. CFO measures cost per resolved interaction. A pilot that optimises the wrong one of these by 20% will still be shut down by the executive who measures a different one.

Third, voice carries explicit regulatory exposure that other AI use cases avoid. The UK ICO's AI Code of Practice (effective May 2026) and the EU AI Act's Article 50 disclosure obligations both apply to voice. A pilot scoped without compliance representation in the room cannot graduate without a re-architecture.

The Solix analysis published in April 2026 — The Bill Comes Due — captures the boardroom consequence. After 24 months of capital deployed, FY26 budget reviews are now demanding production economics, not pilot anecdotes. The CFOs and COOs reading those papers are the same buyers DILR.AI works with. They are not patient.

The four pre-pilot decisions

Before scoping a pilot, four decisions need explicit, documented answers — signed off by named owners.

1. The success metric must be the production metric. A pilot measured on "AI accuracy" will pass and never deploy. A pilot measured on the same KPI the production line of business is held to — cost per resolved interaction, qualified meetings booked, collections recovered, appointments confirmed — passes only if the rest of the operating model is ready to absorb it.

2. Integration scope must include the systems that block scale, not just the systems that make the demo work. A pilot that connects to the CRM but skips the workforce management system, the dialler licensing model, the telephony provider's session limits, and the compliance recording archive is a demo with a phone number. Real production sits behind the integration items teams want to defer to Phase 2. Phase 2 is where pilots die.

3. Exit criteria must be defined before the pilot starts. Most pilots have entry criteria — what triggers go-live — but no documented exit criteria for what triggers the production decision. Without it, "promising results" become a permanent state. Exit criteria should specify: the threshold metric (in production units), the volume sample required to prove statistical significance, the latest acceptable latency at p95, the maximum acceptable hallucination rate per 1,000 calls, and the named approver for the production go-decision.

4. The operating model owner must exist before the pilot, not after. Who runs this once it is live? Whose P&L absorbs the cost? Who owns the QA loop, the prompt updates, the version control? If those answers do not exist on day one, the pilot is an orphan project the moment its sponsor moves on. We cover the operating-model dimension in detail in our voice AI orchestration vs platform breakdown — both architectures fail when nobody owns them.

The decision tree that separates pilots that scale from pilots that don't

The structural fixes above are not abstract. They translate into a single pre-pilot decision tree. If a programme cannot answer "yes" at every node, it is not yet ready to start the pilot — it is ready to start the diagnostic that precedes the pilot.

This sequencing aligns with our wider DATS approach to placing AI inside enterprise systems: diagnose before building, place before scaling, govern before optimising.

Pilot purgatory vs. production-ready: a direct comparison

Most enterprises cannot tell from the inside whether their pilot is structurally sound or structurally doomed. The differences are visible from the outside in five concrete dimensions:

Dimension	Pilot purgatory pattern	Production-ready pattern	Cost of getting this wrong
Success metric	"AI accuracy" or "containment rate" measured by IT	Cost per resolved interaction or booked meeting, owned by the business line P&L	Pilot passes, never deploys — typically £150k–£400k written off
Integration scope	CRM only; WFM, telephony, compliance archive deferred to Phase 2	All blocking integrations in Phase 1, even if some are read-only	6–9 months of re-scoping; usually triggers vendor reset
Exit criteria	"Promising results" — open-ended	Documented thresholds: metric, volume, p95 latency, hallucination rate, named approver	Pilot becomes a permanent state; sponsor turnover kills it
Operating ownership	Unowned, run by the vendor or a temporary task force	Named line-of-business owner with budget absorption from day one	Programme orphaned within 12 months of go-live
Compliance involvement	Reviewed at the end	Embedded from scoping, with ICO/EU AI Act obligations mapped	Re-architecture required for production; pilot data unusable

For the cost-side argument that makes the business case for fixing these gaps before they compound, see our AI voice cost per call analysis — the unit economics shift dramatically once the structural issues above are resolved.

What changes when these four decisions are made well

When all four pre-pilot decisions are made explicitly, the pilot's role changes. It is no longer a proof-of-concept hoping to graduate. It is a controlled test of a production system that has already been designed for scale. The exit gate becomes a calendar date, not a hope. The integration backlog is half the size because Phase 1 already covered the blockers. The line-of-business owner is already running the workforce-planning conversation.

LangChain's State of AI Agents 2026 finds that 57% of respondents have agents in production — but quality (33%) and latency (20%) remain the dominant blockers. Both of those blockers are downstream of the four pre-pilot decisions. A pilot whose success metric is the production metric has already optimised for quality at the dimension that matters. A pilot whose Phase 1 integration includes telephony and compliance archive has already optimised for latency on the production stack, not a sandbox.

If your enterprise voice AI programme is stuck in evaluation or stalled mid-pilot, the fastest unlock is a structured intervention — not another vendor demo. Try Dilr Voice against a real workflow, review your AI operating model, or read about our approach to placing AI where the P&L actually moves.

The contrarian point worth ending on is this: vendors have an interest in long pilots, because pilots generate consulting revenue without committing to production unit economics. Buyers have an interest in short, structurally-sound pilots that either graduate or die. The market is shifting in the buyer's favour. The 2026 boardroom is no longer paying for indefinite optionality — it is paying for the production line that pays back. The four decisions above are the price of admission.

The architecture that prevents pilot purgatory is the same architecture that makes scale possible. The four decisions are the foundation; everything else is execution.

Sources cited: Solix — The Bill Comes Due (April 2026) and McKinsey — The State of AI 2025 (November 2025).

Talk to the operators

Move your voice AI programme out of pilot purgatory.

A 30-minute call. No deck. We diagnose which of the four pre-pilot decisions is blocking your programme — and the shortest credible route to production economics.

Book a call → See diagnostic →

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

The scale of the stall — and why voice AI is the worst-affected use case

The four pre-pilot decisions

The decision tree that separates pilots that scale from pilots that don't

Pilot purgatory vs. production-ready: a direct comparison

What changes when these four decisions are made well

Move your voice AI programme out of pilot purgatory.

Related articles

AI voice program KPIs: the enterprise guide

Change management AI voice: what teams get wrong

Voice AI TCO: the hidden enterprise costs vendors hide

One email, once a month. No hype. Just what we learned shipping.