Voice AI Incident Response: The Runbook for When It Breaks

It is 14:31 on a Tuesday. Your enterprise voice agent — the one handling thousands of calls a day across renewals, support, and outbound follow-up — has started doing something it should not. Maybe it is quoting a price that is £40 too low on every renewal. Maybe it has begun confidently citing a policy that was retired last quarter. Maybe a model provider pushed a silent update overnight and the agent now talks over callers, or stops mid-sentence, or fails to recognise a request for a human. The first you hear of it is a frontline manager forwarding three angry voicemails and asking, "is the AI broken?"

Here is the question that decides everything that happens next: who do you call, and what do you do first? In most enterprise voice AI programmes, there is no written answer. There is a Slack channel, a vendor support email, and a scramble. The agent stays live while people argue about whether it is really broken, who owns the decision to turn it off, and whether turning it off is even allowed. By the time someone with authority makes a call, the agent has handled another four hundred conversations. That gap — between an incident detected and an incident contained — is the difference between a twenty-minute operational hiccup and a regulatory notification, a remediation campaign, and a board conversation you do not want to have.

This runbook is shipped by the team behind Dilr Voice — enterprise voice AI live in 40+ countries with incident tooling built in, not bolted on. For the wider delivery system around it, see DATS, our five-stage AI consulting methodology.

The reliability discipline that separates programmes that scale from programmes that stall is rarely glamorous, and it is almost never in the demo. It is the answer to "what happens when it breaks" — written down, owned by a named person, and rehearsed before the day it is needed. This post is that answer: a six-phase voice AI incident runbook — detect, triage, contain, diagnose and roll back, notify, post-mortem — plus the severity model, the metrics, and the thirty-day plan to stand it up. It is the operational layer that turns an AI agent in production into one you can actually trust in production.

88%

of enterprises use AI in some form (McKinsey, 2025)

33%

have AI genuinely in production (McKinsey, 2025)

capture material EBIT impact (McKinsey, 2025)

2.5×

more EBIT for AI leaders vs peers (BCG, 2025)

Read those four numbers as one story. Adoption is near-universal, but only a third of organisations have anything genuinely in production, and only about one in sixteen capture material EBIT from it. The cliff between "we use AI" and "AI moves the P&L" is the move into production — and an agent in production is, by definition, an agent that can fail in production, on a live call, in front of a customer. The 6% who capture value are not the ones with the cleverest prompts. They are the ones who treated reliability as a first-class discipline, so that an incident is a contained event with a known response rather than an existential surprise. Incident readiness is not a tax on the ROI case — the credit your CFO eventually signs off only survives if a single bad week does not erase a quarter of value.

Why a voice AI incident is not a normal outage

Enterprise IT has decades of incident-management practice, and most of it transfers. But four things make a voice AI incident genuinely different from a web outage or a database failure, and a runbook copied wholesale from your platform team will quietly miss all four.

It is live and irreversible. When a website goes down, the user sees an error and comes back later. When a voice agent goes wrong, it has already said the wrong thing to a real person on a real call. You cannot un-say it. The remediation surface is not "restore service" — it is "find everyone the agent spoke to during the bad window and put it right." That is a fundamentally different and larger job, and it is why detection speed matters more here than almost anywhere else.

The failure is often probabilistic, not binary. A crashed server is unambiguous. A voice agent that gives the correct answer 96% of the time and a dangerous one 4% of the time is "working" by every uptime dashboard and broken in every way that matters. The non-determinism that makes LLM-driven agents feel natural also means your incidents will frequently be statistical drifts rather than clean breaks — which is exactly why customer-reported detection is too slow and why you have to watch the right signals.

The blast radius scales with your success. The whole point of voice AI is that one agent handles the volume of fifty humans. That leverage runs in both directions. A single bad prompt change or a silent model update does not affect one agent's calls — it affects every call, simultaneously, at machine speed. An incident that a human team would have contained to a handful of conversations propagates across your entire estate in minutes.

It is frequently regulated. A wrong answer from a voice agent in financial services, healthcare, or any regulated context is not just a service failure — it can be a reportable event. The clock that starts when you detect a personal-data breach, or a breach of the duties that FCA AI governance now extends to AI-assisted communications, runs whether or not you have noticed. Your runbook has to account for obligations that a generic IT runbook never contemplated, including the disclosure duties from the ICO AI Code of Practice and, for EU-facing deployments, EU AI Act Article 50 transparency obligations.

The implication is uncomfortable but clarifying: you cannot wait for an incident to design your incident response. The decisions that matter — what counts as severe, who can pull the kill switch, what you tell whom — are decisions you make calmly in advance or badly under pressure. There is no third option.

First, classify: the voice AI severity model

Before any runbook can function, everyone has to agree on what "broken" means and how bad this particular break is. Without a shared severity language, every incident becomes a negotiation, and negotiation costs minutes you do not have. A simple three-tier model, calibrated to voice AI specifically, removes the argument.

| Severity | What it means for a voice agent | Examples | Response | |---|---|---|---| | **SEV1 — Critical** | Active harm, regulatory breach, or estate-wide failure happening now | Wrong regulated information given (pricing, eligibility, medical, financial); failure to handle a vulnerable caller or a request for a human; mass mis-routing; a data exposure; the agent breaching its disclosure obligation | Immediate containment. Wake people up. Notification clock assumed running. | | **SEV2 — Major** | Material degradation affecting many calls, but not active harm | Containment rate collapses; latency spikes break the conversation; one major flow (e.g. payments) failing; tool-calling double-executing actions | Contain within the hour. Senior on-call engaged. Roll back the suspected change. | | **SEV3 — Minor** | Localised or cosmetic; quality below target but safe | One edge-case intent mishandled; an awkward but harmless phrasing; a single integration timing out intermittently | Ticket it, fix in the normal release cycle, monitor for escalation. |

The discipline here is not the table — it is agreeing the SEV1 triggers in advance and writing them down. For a financial-services deployment, "quotes a rate it is not authorised to quote" is a SEV1, full stop, because it is potentially FCA-reportable. For healthcare, "fails to escalate a caller in distress" is a SEV1 because it is a patient-safety event. The list of what makes your deployment a SEV1 should be specific, short, and owned by your risk function, not improvised by whoever happens to be online at 14:31. This is the same logic that drives treating hallucination as a procurement gate: you decide what is unacceptable while you are calm, so the live decision is a lookup, not a debate.

The six-phase voice AI incident runbook

With severity defined, the runbook itself is a sequence. Each phase has one job, a named owner, and a clear hand-off to the next. The structure below is non-negotiable in its order — you contain before you diagnose, you notify on a clock that does not wait for root cause — but every step is sized to your deployment.

The DILR voice AI incident runbook — six phases

01Detect — know it is broken before the customer tells you. Monitoring on the signals that actually move; alert thresholds set in advance.
02Triage & severity — is this real, and how bad? Assign SEV1/2/3 against the pre-agreed triggers. Name the incident lead.
03Contain — stop the bleeding. Kill switch or graceful fallback; scope the blast radius. This phase saves the most damage.
04Diagnose & roll back — what broke, and return to a known-good state. Roll back first, understand fully later.
05Notify — who needs to know, and when. Internal tree, affected customers, and the regulator if the clock is running.
06Post-mortem & prevent — blameless review; the fix that stops recurrence and the test that proves it.

The single most important property of this sequence is that earlier phases gate later ones. You cannot triage what you have not detected. You must contain before you indulge the engineer's instinct to diagnose — every minute spent understanding root cause while the agent is still live is another batch of calls going wrong. And you notify on a legal clock that does not pause for your investigation. Programmes that get this wrong almost always get it wrong by inverting it: they diagnose first, because diagnosis feels like progress, while the incident keeps running underneath them.

Phase 01 — Detect

The cardinal sin of voice AI operations is letting customers be your monitoring. By the time three voicemails reach a manager, the agent has handled hundreds of calls. Detection has to be automatic and it has to watch the signals that actually correlate with things going wrong — not just "is the service up."

The signals worth alerting on fall into three groups. Behavioural drift: a sudden change in containment rate, escalation rate, average handle time, or the rate at which callers ask for a human — any sharp move is a tell. Quality collapse: a drop in your automated agent quality score, a spike in negative sentiment, or transcription confidence falling off a cliff. Infrastructure: latency P95 breaching budget, tool-call error rates climbing, or telephony fault rates rising. The richest of these signals come straight from your real-time transcription layer, which is why that capability is infrastructure, not analytics garnish.

Two practices sharpen detection. The first is the synthetic canary call: a scripted test conversation run automatically every few minutes that exercises the critical flows and alerts the instant the agent's response deviates from the expected envelope. A canary catches a bad model update at 02:00 before a single real customer hits it. The second is alerting on rate-of-change, not just absolute thresholds — a containment rate of 78% might be normal, but a containment rate that fell from 84% to 78% in fifteen minutes is an incident, and only the derivative tells you that.

Phase 02 — Triage and severity

An alert is not an incident. The triage phase answers two questions fast: is this real, and how bad? The "is it real" check protects you from alert fatigue — a single noisy metric or a known maintenance window should not scramble a team. The "how bad" check is where your pre-agreed SEV1/2/3 triggers earn their keep: the on-call responder looks up the symptom against the list, assigns a severity, and — critically — names the incident lead. One person owns the response. In a SEV1, that person has standing authority to contain without seeking further approval; the time to debate who can pull the switch is not while the agent is live.

Triage also opens the incident record — a single timestamped log of what was seen, what was decided, and when. This is not bureaucracy; it is the raw material for both the post-mortem and any regulatory account you may later have to give, and it leans directly on your auditability and explainability foundations. If you cannot reconstruct what the agent decided and why, you cannot run a credible incident response.

Phase 03 — Contain

Containment is the phase that saves the most damage, and it is the phase most programmes are least prepared for. Its job is singular: stop the agent doing harm, now, before anyone understands why. There are three containment tools, in order of bluntness.

The first is the kill switch — a single, tested control that takes the agent offline and routes traffic to a human queue or a holding message. The hard requirements are that it is one action, that the on-call lead can trigger it alone, and that it has been tested in production conditions. A kill switch that lives in a config file only one engineer understands, or that has never actually been pulled, is not a kill switch — it is a hope. The second tool is graceful fallback: rather than going dark, the agent degrades to a safe mode — a simpler scripted flow, or an immediate handover. A well-designed human handover path is what makes containment feel like good service rather than an outage to the caller. The third is blast-radius scoping: if the problem is confined to one flow — say, renewals — you contain that flow and leave the rest running, rather than taking the whole estate offline. Partial containment requires that your agent is architected in isolatable segments; if everything is one monolith, your only option is the blunt instrument.

The contain decision involves a real trade-off — every minute the agent runs is more potential harm, but turning it off has its own cost in abandoned calls and customer friction. The severity model resolves this cleanly: for a SEV1, you contain first and count the cost of containment second. The asymmetry is the point. The cost of an over-cautious shutdown is some inconvenienced callers; the cost of leaving an actively harmful agent live is unbounded.

Phase 04 — Diagnose and roll back

Only once the bleeding has stopped do you work out what happened. The discipline here is roll back first, understand fully later. If the incident began shortly after a change — a prompt edit, a model version bump, a new tool integration, a knowledge-base update — your fastest safe path is almost always to revert to the last known-good version and restore service in safe mode, then diagnose at leisure. This is only possible if your configuration is versioned and your deployments are reversible, which is why treating the prompt as a managed, version-controlled asset is an operational requirement, not a nicety. A prompt you cannot roll back is a prompt you cannot safely deploy.

Diagnosis itself follows the layer map. A voice AI failure lives in one of a handful of layers, and naming them speeds the hunt: the prompt/policy layer (instructions changed or were misread), the model layer (a provider update changed behaviour silently), the grounding/knowledge layer (the agent is citing stale or wrong source data), the tool-calling layer (an action executed wrongly, double-executed, or failed — the failure mode covered in depth in our tool-calling architecture guide), the telephony layer (audio, latency, barge-in), or the integration layer (a system of record returned bad data). Most real incidents are a change in one layer interacting badly with an assumption in another — which is exactly what a disciplined QA and testing framework is supposed to catch before production, and the post-mortem will ask why it did not.

Phase 05 — Notify

Notification runs on three separate clocks, and confusing them is how programmes turn a contained incident into a compliance failure. The internal clock is about coordination: a pre-built communication tree so that the incident lead, the business owner, legal, and the executive sponsor learn what they need to, when they need to, without the lead stopping to compose bespoke updates. The customer clock is about trust and remediation: if the agent gave wrong information to a defined set of callers, those callers need to be identified and put right — a proactive re-contact is almost always cheaper than the alternative, and it is only possible because your transcription and logging let you enumerate exactly who was affected.

The regulatory clock is the one that does not wait for you. If the incident involved personal data, UK GDPR's 72-hour breach-notification window to the ICO may already be running from the moment of detection. If you operate under FCA expectations, a serious failure in customer communications may be separately reportable. The runbook's job is not to make the legal call in the moment — it is to ensure the right people are notified fast enough that the legal call can be made inside the window. The decision tree for "is this notifiable, and to whom" should be written in advance with your legal and compliance teams; the same data-protection thinking that shapes your retention policy determines what you are obliged to report when call data is involved.

Phase 06 — Post-mortem and prevent

An incident you do not learn from, you will have again. The post-mortem closes the loop, and it has two rules. It is blameless — the question is "what about our system let this happen," never "whose fault was it" — because the moment people fear blame, they stop reporting near-misses, and near-misses are your cheapest source of learning. And it is actionable — every post-mortem ends with a short list of changes, each with a named owner and a date, and at least one of them is a new automated test or guardrail that would have caught this class of failure earlier. The fix is not "we'll be more careful." The fix is a canary that checks the thing that broke, a threshold that alerts on the signal you missed, or a guardrail in the prompt that makes the bad output impossible. This is the feedback loop that connects incident response back to your governance framework: incidents are the highest-signal input your governance has, and a programme that metabolises them gets more reliable with every one.

How to stand up an incident runbook in 30 days

You do not need a six-month programme to get from "no runbook" to "credible runbook." The following five steps, run over a month, get you to a state where a 14:31 incident has a written, owned, rehearsed answer. Number them and work them in order — each depends on the one before.

Step 01 — Define severity (days 1–5). Sit your risk, compliance, and operations leads in a room and write the SEV1 trigger list for your deployment. Be specific to your regulated obligations and your highest-harm flows. Agree the three-tier model and who owns the severity call. Output: a one-page severity definition everyone has signed.

Step 02 — Build detection and alerting (days 6–14). Instrument the behavioural, quality, and infrastructure signals. Set rate-of-change alerts, not just thresholds. Stand up at least one synthetic canary call on your most critical flow. Output: an alert that fires before a customer would have called.

Step 03 — Build and test the kill switch (days 10–18). Implement the containment controls — kill switch, graceful fallback, and, if your architecture allows, flow-level isolation. Then actually pull it in a controlled window. An untested kill switch does not count. Output: a containment control the on-call lead has personally triggered.

Step 04 — Write the notification tree (days 15–22). With legal and compliance, build the decision tree for internal, customer, and regulatory notification, including the breach-clock logic. Pre-draft the templates so nobody is composing under pressure. Output: a notification flowchart and message templates.

Step 05 — Run a game day (days 23–30). Simulate an incident — inject a deliberately bad prompt into a staging agent, or table-top a SEV1 scenario — and run the whole team through the runbook end to end. Time it. The first game day always exposes a missing owner, a control that does not work, or a hand-off that breaks. Fix what it surfaces. Output: a rehearsed team and a runbook with the gaps closed.

A programme that completes these five steps has done something most of its competitors have not: it has made reliability a practice rather than an aspiration. This is also precisely the discipline that gets a programme out of pilot purgatory — the inability to answer "what happens when it breaks" is one of the most common reasons a successful pilot never gets the green light to scale, a transition we map in full in our guide to designing the pilot-to-scale path.

Sector calibration: what makes a SEV1 in your world

The runbook structure is universal; the SEV1 trigger list is not. What constitutes a critical incident — and therefore what your detection must watch for and what your notification clock responds to — is shaped entirely by your sector's regulatory and harm profile. The table below maps the single highest-stakes incident type per sector and the published cross-link that goes deeper.

| Sector | The incident that is always a SEV1 | What it forces into the runbook | |---|---|---| | **Financial services** | Agent gives unauthorised or wrong regulated information; fails a vulnerable caller | FCA-reportable assessment; vulnerability-cue detection; tight breach clock. See fintech collections and KYC. | | **Healthcare** | Failure to escalate a caller in distress; wrong clinical or appointment information | Patient-safety event handling; clinical sign-off in the post-mortem. See HIPAA-grade voice automation. | | **Insurance** | Mis-captured claim detail or wrong eligibility statement at first notice of loss | Claim-integrity remediation; re-contact of affected claimants. See insurance claims intake. | | **Public sector** | Mass mis-routing or wrong entitlement information to citizens | Accessibility duty; transparent public notification. See AI voice for UK councils. | | **Outbound sales** | Calls placed outside consent or to suppressed numbers; wrong pricing quoted | Immediate outbound halt; consent-state audit. See outbound enterprise sales. | | **Legal** | Privileged or wrong intake information mishandled | Conflict and privilege review; firm-risk escalation. See law firm client intake. |

The pattern across every row is the same: the sector decides what is unacceptable, and the runbook makes the unacceptable a detected, contained, and notified event rather than a discovered-later disaster. This is why a generic platform that treats all calls identically is a liability in a regulated context — the incident profile is sector-specific, and your tooling has to be, too, a point we develop in our guide to platform selection criteria.

What to measure: incident metrics that matter

If you do not measure your incident response, you cannot improve it, and you cannot prove to a sceptical board or a procurement team that your programme is reliable. Six metrics tell the story.

Mean time to detect (MTTD) — how long from the first bad call to the alert firing. The single most important number, because everything downstream depends on it; if MTTD is measured in hours, you have a monitoring problem, not an incident-response problem. Mean time to contain (MTTC) — from detection to the agent no longer doing harm. This is the number the kill switch directly improves. Mean time to recover (MTTR) — from detection to service fully restored in a good state. Incident rate per 10,000 calls, trended over time — a healthy programme drives this down as post-mortems compound. Percentage auto-detected versus customer-reported — you want this climbing toward 100%; every customer-reported incident is a monitoring gap. And affected-call remediation completeness — of the callers an incident touched, what fraction did you successfully put right.

These sit alongside, not instead of, your standing programme metrics. The incident metrics measure the resilience of the system; your broader programme KPIs measure its value. A board reviewing the programme wants both on one page: this much value, captured this reliably, with incidents detected this fast and contained this tightly. That combination is what turns "we deployed an AI agent" into "we run an AI agent we can trust" — and it is the unglamorous discipline behind the total cost of ownership that vendors rarely show you, because on-call, monitoring, and incident tooling are real line items the demo never mentions.

Want to pressure-test your own readiness? See how this works in production with Dilr Voice, book an AI placement diagnostic to find where reliability risk actually sits, or read about our approach to placing AI inside enterprise systems safely.

Where the runbook sits in the wider programme

An incident runbook is not a standalone document; it is one organ in a healthy operating system. It draws its detection signals from your monitoring and quality layers, its containment from your architecture, its notification logic from your compliance posture, and its prevention loop from your governance framework. It is also, increasingly, a procurement question: the buyers who do this seriously now ask vendors to demonstrate their incident tooling, their rollback story, and their support response times, and they write the answers into the contract — which is why incident and service-level obligations feature prominently in the MSA clauses enterprise legal demands. The reliability of your programme is partly your discipline and partly your vendor's, and the contract is where the two meet.

The deeper point is cultural. A blameless, well-rehearsed incident practice changes how an organisation relates to its AI. It moves the conversation from "can we trust this thing" — a question of faith — to "how fast do we detect, contain, and learn" — a question of measurable competence. That shift is most of what separates the organisations capturing real value from the ones still stuck in cautious pilots, and it is inseparable from the change-management work of building an on-call culture that takes a misbehaving agent as seriously as it would a misbehaving human team. The runbook is the artefact; the practice is the point.

How is a voice AI incident different from a normal IT outage?

Four ways: it is live and irreversible (the agent already said the wrong thing to a real caller), it is often probabilistic rather than a clean crash (96% right and 4% dangerous looks "up" on a dashboard), its blast radius scales with your volume because one agent handles the load of many humans, and it is frequently regulated, so a wrong answer can be a reportable event. A runbook copied from your platform team will miss all four.

What counts as a SEV1 for a voice agent?

Active harm, a regulatory breach, or an estate-wide failure happening now: wrong regulated information (pricing, eligibility, clinical, financial), failure to handle a vulnerable caller or a request for a human, mass mis-routing, a data exposure, or a breach of disclosure obligations. The exact list should be written in advance by your risk function and calibrated to your sector — for FS, an unauthorised quote is always SEV1; for healthcare, failing to escalate a distressed caller is.

Do we have to notify a regulator when the agent gets something wrong?

Sometimes, and the clock may already be running. If personal data was exposed, UK GDPR's 72-hour ICO notification window can start at detection. Under FCA expectations, a serious customer-communications failure may be separately reportable. The runbook's job is not to make the legal call in the heat of the moment — it is to notify the right people fast enough that the call can be made inside the window. Build the decision tree with legal in advance.

Isn't a kill switch enough? Why do we need a whole runbook?

A kill switch is one phase of six. It contains, but it does not detect (you still need to know to pull it), triage (how bad, who owns it), diagnose and roll back (what broke, returning to known-good), notify (customers and regulators on their own clocks), or prevent (the test that stops it recurring). A kill switch with no runbook around it means you find out late, turn everything off bluntly, and learn nothing. The runbook makes the switch part of a controlled response.

How do we test the runbook before a real incident?

Run a game day. Inject a deliberately bad prompt into a staging agent, or table-top a SEV1 scenario, and walk the whole team through the runbook end to end — detect, triage, contain, diagnose, notify, post-mortem — and time it. The first game day always exposes a missing owner, an untested control, or a broken hand-off. Fixing what it surfaces is how you find out your kill switch does not work on a Tuesday afternoon rather than at 14:31 during a real one.

Strategy

AI voice governance framework

Strategy

Pilot to enterprise scale

Voice AI

Agent quality scoring at scale

Talk to the operators

Run an AI agent you can trust in production.

We build the detection, containment, and incident discipline into the operating model from day one — so a bad Tuesday is a contained event, not a board crisis. 30-min scoping call · No deck · Confidential.

Book a call → See operating model →

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production and run the on-call rota that keeps it there. This article is operational guidance, not legal advice; confirm your specific notification obligations with your compliance function. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.

Voice AI Incident Response: The Runbook for When It Breaks

Why a voice AI incident is not a normal outage

First, classify: the voice AI severity model

The six-phase voice AI incident runbook

Phase 01 — Detect

Phase 02 — Triage and severity

Phase 03 — Contain

Phase 04 — Diagnose and roll back

Phase 05 — Notify

Phase 06 — Post-mortem and prevent

How to stand up an incident runbook in 30 days

Sector calibration: what makes a SEV1 in your world

What to measure: incident metrics that matter

Where the runbook sits in the wider programme

Run an AI agent you can trust in production.

Place AI where the P&L moves

One email, once a month. No hype. Just what we learned shipping.

Why a voice AI incident is not a normal outage

First, classify: the voice AI severity model

The six-phase voice AI incident runbook

Phase 01 — Detect

Phase 02 — Triage and severity

Phase 03 — Contain

Phase 04 — Diagnose and roll back

Phase 05 — Notify

Phase 06 — Post-mortem and prevent

How to stand up an incident runbook in 30 days

Sector calibration: what makes a SEV1 in your world

What to measure: incident metrics that matter

Where the runbook sits in the wider programme

Run an AI agent you can trust in production.

Place AI where the P&L moves

Related articles

The unit economics of an enterprise voice AI programme

Voice AI Release Management: The Change Control Guide

Voice AI Supply Chain Assurance: A 2026 Enterprise Guide

One email, once a month. No hype. Just what we learned shipping.