Voice AI Conversation Design: Scripting That Converts

You can have a technically perfect voice agent that still fails on the phone. The transcription is accurate, the language model reasons well, the integrations fire on time — and the caller still hangs up confused, or repeats themselves four times, or asks for a human before you have captured a single useful field. The model was never the problem. The conversation was.

Conversation design is the discipline that sits between a working model and a call that actually completes. It is not prompt engineering, and it is not the old IVR decision tree dressed up in a large language model. It is the turn-by-turn craft of how an agent opens, confirms, asks, recovers, and closes — the sequence of moves that decides whether the person on the other end of the line gets what they called for. Get it right and a deployment lifts containment, shortens handle time, and produces clean structured data. Get it wrong and you ship an agent that demos beautifully and converts terribly.

This guide sets out the enterprise framework we use to design conversations that convert: the five moments of a call, the patterns that hold each one together, and the governance that keeps a deployed script from drifting. It is written for the people who own the outcome — operations leaders, CX heads, and the teams running enterprise AI voice agents in production, not in a sandbox.

This guide is shipped by the team behind Dilr Voice — enterprise voice AI live in 40+ countries. Or see DATS, our five-stage AI consulting system for placing AI where the P&L actually moves.

What conversation design actually is — and what it is not

The fastest way to understand conversation design is to separate it from the two things it gets confused with.

It is not prompt engineering. Prompt engineering is the system-prompt craft — the persona, the constraints, the tool definitions, the guardrails that shape how the model behaves across every turn. It is upstream and global. We have written separately about treating the voice AI prompt as a production asset, and that work matters. But a flawless system prompt does not tell you what the agent should say in the first three seconds, how it confirms a postcode without sounding robotic, or what it does when the caller says "no, the other one." That is conversation design, and it is local, sequential, and turn-by-turn.

It is not an IVR flow. The classic interactive voice response tree was a rigid graph: press 1, press 2, dead-ends everywhere. Modern voice agents can hold an open conversation, which tempts teams to throw away structure entirely and "let the model handle it." That is the opposite error. The architecture choice between LLM-driven and scripted agents is real, but conversation design is not about picking a side — it is the layer that gives a flexible model a spine: a defined set of states it must move through, with freedom inside each state and discipline about the transitions between them.

The one-line definition

Conversation design is the deliberate shaping of an agent's turns — its openings, confirmations, asks, repairs, and closings — so that a call reliably reaches a defined outcome. Prompt engineering decides who the agent is. Conversation design decides what the call does.

The distinction matters commercially because the two failure modes look completely different in your analytics. A prompt problem shows up as the agent being off-brand, unsafe, or inconsistent across calls. A conversation-design problem shows up as a specific, repeatable cliff: callers drop at the same turn, the same field never gets captured, the same misunderstanding recurs. If your containment rate is good in testing and poor in production, the gap is almost always conversation design, not the model.

The five moments of a converting call

Every successful voice call — human or AI — moves through five moments. Most teams design exactly one of them (usually the middle: the task) and let the model improvise the rest. The agents that convert are the ones where all five are designed on purpose.

Each moment has a job, a failure mode, and a set of patterns that make it reliable. The rest of this guide walks through them in order — because order is the point. A call that nails the task but fumbles the opening loses the caller before the task begins. A call that completes the task but botches the closing leaves the customer unsure whether anything happened at all.

Moment one: the opening — the first eight seconds decide the call

The opening is the highest-leverage eight seconds in the entire interaction, and it is the moment teams most often leave to chance. Four things have to happen, fast, and in the right order: the agent must disclose that it is an AI, establish identity (who is calling and on whose behalf), state purpose (why), and hand the caller control (what happens next, and how they can steer).

Disclosure is no longer optional in most markets. Under the EU AI Act's Article 50 disclosure rules, an agent interacting with a person must make clear it is an AI system at the first point of interaction — which means your conversation design has to carry the disclosure as a designed line, not a buried disclaimer. The craft is doing it without killing momentum. "Hi, this is an AI assistant calling from [Brand] about your recent order — is now a good moment?" discloses, identifies, states purpose, and offers control in one breath.

Opening anti-patterns we see in production

The cold open. Jumping straight into a question before the caller knows who is speaking. Recognition accuracy on the first answer collapses because the caller is still orienting.
The monologue. A 25-second scripted preamble before the caller can say anything. Inbound callers in particular abandon — they called with a goal, not to be read to.
The buried disclosure. Disclosing the agent is AI only when asked, or at the end. A compliance risk and a trust risk in one.
No control handle. Never telling the caller they can interrupt, ask for a human, or steer. Callers who do not know they can interrupt will fight the agent.

The opening is also where you set the interaction contract — implicitly teaching the caller how this conversation works. If the agent pauses naturally and handles being interrupted gracefully, the caller relaxes into a normal conversation. This is why barge-in handling is a conversation-design concern, not just a latency spec: the opening is where the caller learns, in the first turn, whether interrupting this agent works or breaks it.

Moment two: confirmation — grounding the facts without sounding robotic

Confirmation is how the agent verifies it heard correctly before acting. It is the difference between a clean structured record and a costly downstream error — a wrong appointment, a misrouted claim, a payment against the wrong account. And it is the moment where naive design produces the most robotic-sounding agents, because the obvious approach ("You said 14 Maple Street, is that correct? You said the 3rd of July, is that correct?") turns every call into an interrogation.

Good confirmation design is graded: the level of explicit confirmation should match the cost of getting it wrong.

| Confirmation level | When to use it | Example | |---|---|---| | **Implicit** | Low-stakes facts, easily corrected later | "Great, booking you in for Tuesday — and which time works?" (Tuesday is confirmed by being used, not by a yes/no) | | **Explicit (light)** | Medium-stakes, single field | "That's the 3rd of July — got it." (states back, invites correction, does not demand a yes) | | **Explicit (hard)** | High-stakes, irreversible, financial, or identity | "Let me read that back: account ending 4471, paying £240 today. Shall I go ahead?" (requires an active yes) |

The skill is reserving hard confirmation for the moments that genuinely need it and letting implicit confirmation carry the rest. An agent that hard-confirms everything feels like filling in a form by voice; an agent that confirms nothing produces a clean-sounding call and a dirty database. The quality of the data your agent writes back is itself a metric — it is part of how we argue you should evaluate voice AI accuracy beyond word error rate, and confirmation design is the lever that moves it.

Confirmation also depends on having a reliable record of what was actually said. This is where the real-time transcription data layer becomes a design input: the agent's confirmations should reference what the transcription captured, and your post-call review should be able to replay any confirmation that went wrong.

Moment three: the task — the reason for the call

The task is the part most teams do design, so we will be brief on what it is and longer on where it goes wrong. The task is the core transaction: book the appointment, take the payment, qualify the lead, capture the claim, answer the question. Conversation design for the task is mostly about sequencing and slot-filling discipline — asking for one thing at a time, in an order that matches how the caller thinks, and not asking for information you can already infer or retrieve.

The dominant failure mode is the over-stuffed turn: asking for three pieces of information in one breath ("Can I take your full name, date of birth, and policy number?"). Callers answer one, forget the others, and the agent either re-asks (annoying) or proceeds with gaps (dangerous). One slot per turn is the safe default; combine only when the fields are naturally spoken together (a first and last name).

The second failure mode is asking for what you should already know. If the call is outbound to a known customer, the agent should not ask for the account number it dialled. Conversation design here is inseparable from your tool and data architecture — the agent's turns should be shaped by what it can retrieve through tool calls so it asks the human only for what the systems genuinely cannot supply. The best task design feels short because the agent did its homework before it spoke.

1/turn

Slots requested per turn — the safe default

3tries

Repair attempts before a designed escalation

5moments

Designed states every converting call moves through

Moment four: repair — the discipline that separates a demo from production

Repair is what the agent does when something goes wrong: the caller said nothing, the agent misheard, the caller went off-script, two people talked at once, or the request is something the agent cannot do. Repair is where demos and production diverge most violently, because demos are run by people who know how to talk to the agent and production is run by everyone else.

There are five repair situations worth designing explicitly, each with its own pattern:

| Situation | What happened | Design pattern | |---|---|---| | **No input** | Caller said nothing (silence, thinking, distracted) | Re-prompt once, more specifically; then offer a concrete option. Never repeat the identical sentence twice. | | **No match** | Agent could not parse the answer | Acknowledge, narrow the question, give an example of a valid answer. | | **Misrecognition** | Agent heard the wrong thing and acted on it | Make correction frictionless — "Sorry, let me fix that" — never make the caller fight to undo. | | **Off-script** | Caller asked something outside the task | Answer briefly if you can, then bridge back: "Happy to note that — first, let's finish the booking so you don't lose it." | | **Out of scope** | Request the agent genuinely cannot handle | Escalate cleanly, with context, before the caller has to demand it. |

Two rules carry most of the value. First, never repeat yourself identically. A human who is not understood rephrases; an agent that says the exact same sentence twice signals to the caller that no one is home, and the call dies. Each repair attempt should escalate in specificity. Second, escalate before frustration, not after. The single most damaging pattern in production voice AI is the agent that loops — three, four, five failed attempts — until the caller is furious, then transfers to a human who inherits an angry customer and no context. The fix is a designed repair budget (commonly three attempts) and a clean escalation and human handover that passes the full conversation context across, so the human starts informed.

Why repair is a conversion metric, not a UX nicety

Repair quality is directly legible in your numbers. Calls that enter a repair loop and never recover are lost containment and, often, lost revenue. Calls that repair gracefully and continue are saved. Because every repair is recorded, repair is also the richest source of design feedback you have — your worst repairs are your next sprint's backlog. Pair repair analysis with call sentiment analysis and the loops that damage CSAT light up immediately.

Moment five: the closing — confirm, commit, and leave no ambiguity

The closing is the moment teams design last and rush most, and it quietly determines whether the call felt successful even when the task technically completed. A good closing does three things: it confirms what just happened ("You're booked for Tuesday at 2pm"), it states the next step ("You'll get a text confirmation in the next few minutes"), and it closes the loop on anything deferred ("And I've noted your question about parking — someone will cover that in the confirmation").

The failure mode is the abrupt end: the task completes and the agent simply stops, or thanks the caller and hangs up without a recap. The caller is left wondering whether the booking actually went through, whether they will get a confirmation, whether they need to do anything. That uncertainty drives the most expensive follow-up call type there is — the "did my thing actually happen?" call — which lands right back in your queue and erases the efficiency the agent just created.

A closing should also be where any cross-channel handoff is set up cleanly: if the caller needs a link, a document, or a confirmation, the closing is where the agent commits to sending it and tells the caller to expect it. Designing the closing as a deliberate moment, rather than wherever the task happens to end, is one of the cheapest conversion improvements available.

Designing to an outcome, not to a transcript

Everything above is in service of one principle that separates enterprise conversation design from hobbyist scripting: you design backwards from a measurable outcome, not forwards from a nice-sounding transcript.

Before you write a single line of dialogue, you should be able to state the call's success metric — completed booking, captured claim with all required fields, qualified lead with consent, contained query — and the constraints the conversation must respect (disclosure, consent capture, data minimisation). Every design decision then resolves against that target. Should the agent hard-confirm the postcode? Depends whether a wrong postcode fails the outcome. Should it allow a long off-script tangent? Depends whether the tangent threatens completion. The transcript is the byproduct; the outcome is the design brief.

This is also what makes conversation design governable. When the success metric is explicit, you can A/B two openings and know which converts. You can watch containment rate against an 80% benchmark and trace a dip to the specific turn that introduced it. You can hand the script to QA with a definition of "good" that is measurable rather than aesthetic. Conversation design without a target is decoration; conversation design with a target is an optimisation loop.

| Outcome you are designing for | Primary metric | Conversation-design priority | |---|---|---| | **Inbound containment** | % calls resolved without human | Fast opening, ruthless task focus, clean escalation | | **Outbound qualification** | Qualified leads with consent captured | Disclosure + consent first, graded confirmation on key fields | | **Transactional (payment, booking)** | Completed transactions, data accuracy | Hard confirmation on irreversible steps, explicit closing | | **Servicing / information** | First-contact resolution, CSAT | Off-script tolerance, warm tone, deferred-item closing |

Governing conversation design over time

A conversation design is not a document you write once. It is a living artefact that drifts, decays, and improves — and like any production asset it needs versioning, testing, and a review cadence. Three practices keep it healthy.

Version it like code. Every change to the conversation — a reworded opening, a new repair branch, a tightened confirmation — should be a tracked change with a reason and a date. When a metric moves, you want to know which script change moved it. Treating the conversation as a versioned asset is part of what we mean by running a real AI operating model rather than a perpetual pilot.

Test it adversarially before it ships. The people who designed the conversation are the worst testers of it, because they know how to talk to it. Production callers interrupt, mumble, go off-script, and answer the wrong question. A proper voice agent QA and testing framework stress-tests the repair branches, the edge cases, and the escalation paths — not just the happy path that demos so well.

Review it on a cadence. The richest backlog for the next iteration is sitting in last week's calls: the turns where callers dropped, the confirmations that went wrong, the repairs that looped. A weekly review of the worst calls, fed straight back into the script, is the engine that turns a deployed agent from adequate to excellent. This is the operating-rhythm work that the broader DATS methodology and our approach build into every engagement, so the conversation keeps improving after go-live instead of quietly rotting.

The wider context is worth keeping in view. McKinsey's State of AI 2025 found that roughly 88% of enterprises now use AI but only about 6% capture material EBIT impact, with AI leaders earning 2.5× more EBIT than peers — and the gap is rarely the model. It is the operational discipline around it. In voice specifically, the discipline that most often decides whether a deployment lands in the 6% or the 88% is whether the conversation was designed to convert or left to improvise.

Want to see this in production? Try Dilr Voice live (free, with $20 of credits), book an AI placement diagnostic, or read how we think about placing voice AI agents inside enterprise systems.

Frequently asked questions

Is conversation design the same as prompt engineering?

No. Prompt engineering is the system-prompt craft — persona, constraints, tool definitions, and guardrails that shape the agent globally across every turn. Conversation design is the turn-by-turn craft — how the agent opens, confirms, asks, repairs, and closes. A perfect system prompt still leaves every sequential decision in the call undesigned. The two are complementary: prompt engineering decides who the agent is; conversation design decides what the call does.

Do LLM-driven agents make conversation design unnecessary?

The opposite. A flexible model without designed structure improvises the opening, the confirmations, and the repairs — which is exactly where production calls fail. Conversation design gives the model a spine: a defined set of states it must move through, with freedom inside each state and discipline about the transitions. The architecture choice between LLM-driven and scripted agents is real, but conversation design sits above both.

What is the single biggest conversation-design mistake in production?

The unbounded repair loop. An agent that fails to understand, repeats itself, fails again, and keeps looping until the caller is furious — then escalates to a human with no context — destroys both containment and CSAT in one pattern. The fix is a designed repair budget (commonly three attempts), repair turns that escalate in specificity rather than repeating, and a warm escalation that passes full context to the human.

How do we measure whether a conversation design is good?

Design backwards from a measurable outcome — completed booking, captured claim, qualified lead, contained query — and instrument it. Track containment rate, completion rate, data accuracy on captured fields, repair-loop frequency, and the specific turns where callers drop. A good conversation design is an optimisation loop against a target, not an aesthetic judgement about a transcript.

Where does compliance fit into conversation design?

Directly into the dialogue. AI disclosure (required at first interaction under the EU AI Act's Article 50) and consent capture are designed lines in the opening, not buried disclaimers. Data minimisation shapes the task — the agent should ask only for what the systems genuinely cannot retrieve. Designing compliance as part of the conversation, rather than bolting it on, is what keeps a converting agent inside the lines.

Pillar guide

Enterprise AI Voice Agents

Voice AI Prompt Engineering

Service

AI Operating Model

Talk to the operators

Design the call that actually converts.

30-min scoping call · No deck · Confidential. We'll show you where your current conversation design is leaking calls — and what to change first.

Book a call → Try Dilr Voice ↗

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

What conversation design actually is — and what it is not

The five moments of a converting call

Moment one: the opening — the first eight seconds decide the call

Moment two: confirmation — grounding the facts without sounding robotic

Moment three: the task — the reason for the call

Moment four: repair — the discipline that separates a demo from production

Moment five: the closing — confirm, commit, and leave no ambiguity

Designing to an outcome, not to a transcript

Governing conversation design over time

Frequently asked questions

Design the call that actually converts.

Related articles

Voice AI RAG: knowledge bases that work on live calls

Voice AI Tool Calling: Enterprise Architecture That Ships

Voice AI prompt engineering: from playground to production

One email, once a month. No hype. Just what we learned shipping.