LLM vs scripted voice agents: enterprise guide

Enterprise voice AI demos in 2026 almost always look the same. A vendor connects you to a slick LLM-driven agent that handles tangents, jokes about the weather, and recovers gracefully from interruptions. You leave the call convinced you have seen the future. Then you ask the procurement question: what happens when this agent calls a customer whose mortgage payment is forty-three days overdue, or processes a claim on a regulated insurance policy, or schedules a hospital appointment for a vulnerable patient?

The vendors who answer "the LLM handles it" tend not to make it past Stage 2 in regulated procurement. The vendors who answer "scripted flow handles those branches, the LLM handles the conversation around them" tend to win the contract.

This is the architecture decision that sits underneath every enterprise voice AI evaluation. It is rarely framed honestly. Pure LLM camps oversell flexibility. Pure scripted camps oversell predictability. The production answer in 2026 is almost always a hybrid, but the shape of the hybrid is what separates a pilot that scales from one that stalls. If you want the full picture across architecture, latency, compliance and procurement, start with our enterprise AI voice agents guide — this post drills into one decision inside it.

This guide is shipped by the team behind Dilr Voice — enterprise voice AI deployed in regulated UK contact centres without an engineering team. Or see DATS, our senior-led methodology for placing AI inside enterprise systems.

Key takeaway

Pure LLM voice agents demo brilliantly and break unpredictably. Pure scripted flows survive audits but lose customers at the first off-script question. The enterprises winning in production use scripted flows for the regulated, fixed-outcome branches and LLM reasoning for the spaces in between — with a hallucination gate that refuses to answer rather than guess.

The data backs the hybrid view. Across 1,200 test calls benchmarked by independent reviewers in early 2026, Vapi (LLM-orchestrator, with guardrails) hit 500–600ms latency, Retell AI averaged 580–620ms, Bland AI (path-constrained) averaged ~800ms, and PolyAI (heavy pre-trained vertical scripts) sat between 700–900ms. The fastest stacks were not pure LLM or pure scripted — they were LLM reasoning constrained by scripted skeletons.

340%

YoY growth in production voice agents (2025)

Enterprises capturing material EBIT (McKinsey 2025)

40%

Global 2000 collaborating with AI agents by 2026

2.5×

EBIT lift for AI-mature enterprises (BCG 2025)

What each architecture actually is

The terms get used loosely. Before going further, pin them down.

Scripted flow builders

A scripted voice agent moves through a state machine. Each state has a finite set of valid inputs, a deterministic next state, and a fixed utterance to play. Some scripted builders use an LLM only to paraphrase the utterance in context — but the path through the conversation is determined by code or a visual flow editor. Bland AI's conversational pathways and PolyAI's industry-trained assistants both lean heavily on this pattern. Off-path inputs trigger a fallback or escalation. There is no inference about what the customer meant; only matching against allowed branches.

The strengths are obvious. Auditability is high — every path is enumerable. Latency is low because there is no open-ended reasoning step. Compliance teams can read the conversation map and sign it off. Hallucinations are structurally impossible because there is no generation outside the allowed utterance set. For the FCA governance work we covered separately, scripted flows clear the bar with the least friction.

The weakness is also obvious. Real customers do not respect the script. They ask three questions in one breath, change topic mid-sentence, and use natural language the flow builder never anticipated. Coverage gaps become customer experience holes.

LLM-driven agents

An LLM-driven voice agent uses a large language model to generate the next response — sometimes the next action — at every turn. The agent has tools (lookup customer, schedule appointment, transfer to human) and the model decides which to call. There is no enumerable state machine; behaviour is shaped by system prompts, RAG, and tool definitions. Vapi and Retell sit closer to this end of the spectrum, with bolted-on guardrails to constrain drift.

The strengths are coverage and naturalness. The agent handles topic switches, multi-intent utterances, and unexpected phrasing without falling back. The weakness is that LLMs do not know when they are wrong. They will confidently answer questions the system never sanctioned them to answer. Hallucination is not a bug — it is a property of how the model generates tokens. In regulated industries, that is a procurement gate, which is why we treat it as one in our voice AI hallucination procurement post.

The same architecture choice underpins how teams shape the wider operating model — we cover that across our DATS methodology, and specifically inside an AI operating model consulting engagement where the LLM-vs-scripted split is mapped to RACI, audit, and quality scoring layers before anything ships. For pre-procurement reviews, the same logic feeds our AI placement diagnostic, which scopes the architecture before any commitment.

The hybrid the production teams actually ship

The interesting answer is what wins in regulated UK production: a scripted skeleton with LLM reasoning inside the joints. The script enforces the audit-critical path. The LLM handles the conversational fluency around it. A separate guardrail layer refuses to answer rather than guess when intent falls outside the sanctioned set.

Here is the decision tree we walk clients through during an AI execution office engagement when they are stuck between architectures.

The pattern looks obvious once you see it, but most vendors do not ship it cleanly. They ship "LLM with prompt guardrails" (still hallucinates under load) or "scripted with LLM paraphrasing" (still gets stuck on natural utterances). The discipline is in where you draw the line, and what happens when the LLM is uncertain. If you want to talk through your specific call patterns, book a call with the operators — we run this exercise inside every diagnostic.

Dimension	Pure scripted	Pure LLM	Hybrid (production)
Hallucination risk	None — no generation outside script	High — model decides what to say	Contained — LLM only fills sanctioned slots
Off-script coverage	Poor — falls back to escalation	Excellent — handles novel phrasing	Excellent — LLM reasons within scripted skeleton
Audit trail	Fully enumerable paths	Probabilistic, hard to reproduce	Enumerable critical path + logged LLM turns
Latency (typical)	600–900ms (PolyAI/Bland band)	500–700ms (Vapi/Retell band)	500–650ms with caching
Build effort	High — exhaustive path mapping	Low — prompt + tools	Medium — script the regulated 30%
Compliance fit (FCA, ICO, EU AI Act)	Strongest	Weakest without retrofit	Strongest when designed in
Failure mode	Customer hits a wall	Customer hears a confident lie	Agent refuses and hands over

The hybrid wins for the same reason regulated workflows generally win: it makes the dangerous parts deterministic and the conversational parts flexible. The same logic shows up in the enterprise voice AI vendor checklist — buyers who ask about the hybrid line item separate serious vendors from demo-driven ones.

There is a UK-specific point worth flagging. Stanford's AI Index 2026 notes that fewer than 10% of enterprises have fully scaled AI in any function, and McKinsey's State of AI 2025 puts material EBIT impact at only 6%. The architecture choice partly explains that gap. Pure LLM pilots stall at compliance review. Pure scripted pilots stall at customer experience. The hybrid is what gets into production — and into the P&L. The wider economics around that transition sit in our voice AI TCO post, which makes the cost line items explicit.

How to procure the right shape of hybrid

Three procurement questions separate vendors who understand this architecture from vendors who do not.

One — show the refusal path. Ask the vendor to demonstrate what the agent does when a customer asks a question outside its sanctioned scope. If the answer is "the LLM handles it gracefully", they are selling you a hallucination engine with marketing on top. The right answer is "the agent declines, logs, and routes to a human within X seconds, with a recorded reason." Hallucination must be a hard gate, not a softened risk.

Two — show the audit map. Ask for the enumerable list of regulated paths and the LLM-decisioned spaces between them. A serious vendor will pull up the conversation map and walk you through which states are scripted and which are LLM-handled. A weaker vendor will hand-wave with "we have guardrails."

Three — show the latency budget under load. LLM reasoning is expensive in tokens and time. Under concurrency, latency curves diverge sharply. Ask for p95 latency at the call volume you intend to deploy, not the demo number. The difference between 600ms and 1.2s is the difference between conversation and discomfort. Our voice AI procurement framework drills into the May 2026 vendor map if you want a wider read.

Want to see this in production? Try Dilr Voice live, book an AI placement diagnostic, see our DATS methodology, or read about our approach to placing AI inside enterprise systems.

A note on the contrarian read. Some buyers want a pure scripted stack because procurement is easier. They are wrong on the customer experience side but right on the audit side. The fastest way through this trade-off is to script the regulated 30% of paths exhaustively, let the LLM reason over the other 70% with a refusal gate, and measure with AI voice quality scoring on every call. The pure-scripted buyer ends up with a hybrid anyway — they just refused to call it one. The same conclusion sits behind why voice AI orchestration vs platform is rarely a binary in production.

Service

AI Placement Diagnostic

External corroboration: the McKinsey "State of AI 2025" report confirms the 6% EBIT-impact figure and the saturation gap; the Stanford "AI Index 2026" backs the under-10% scaled-in-function reading. Both point to architecture choice as one of the load-bearing reasons why pilots stall.

Talk to the operators

Pick the architecture that survives procurement.

30-min scoping call · No deck · Confidential. We'll walk through your call patterns and tell you where to draw the LLM-vs-scripted line — and what to refuse rather than guess.

Book a call →Try Dilr Voice ↗

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

What each architecture actually is

Scripted flow builders

LLM-driven agents

The hybrid the production teams actually ship

How to procure the right shape of hybrid

Pick the architecture that survives procurement.

Related articles

Voice AI Traffic Ramp: Canary and Shadow Deployment for Enterprise

Voice AI Telephony: Selecting the Provider That Doesn't Become a Constraint

Voice AI A/B Testing: Experimenting on Live Calls Without Breaking CSAT

One email, once a month. No hype. Just what we learned shipping.