Voice AI SLAs: the service levels that actually bind

Every enterprise voice AI contract has a service-level agreement attached. Almost none of them protect the buyer the way the buyer assumes. The reason is not that the numbers are wrong. It is that the metric in the SLA is undefined, unmeasurable, or measured by the very party it is meant to bind — so when the service degrades, the breach is never recorded, the credit never fires, and the operational runbook never triggers. The SLA looks like protection on paper and behaves like a wish in production.

This guide is about the layer underneath the contract: how to design the service-level schedule as a measurement instrument. We have argued elsewhere that for voice agents latency, not uptime, is the operational service level that matters, and that the latency benchmarks that hold in production look nothing like the ones in a demo. This post sits one level deeper. It assumes you have already decided which metrics matter and asks the harder question: how do you define and measure each one so that the SLA actually binds — and so the service credit you negotiate has something real to attach to?

This guide is shipped by the team behind Dilr Voice — enterprise voice AI live in 40+ countries, instrumented so the service levels are measured, not asserted. Or see DATS, our five-stage AI consulting system for placing AI where the P&L moves.

8.76 hrs

Permitted downtime a year at 99.9% — arithmetic, not protection

P95

The percentile that bites — averages hide the calls that break trust

Source of truth the SLA must name — not the vendor's own dashboard

Of enterprises capture material EBIT from AI — the rest measure the wrong things

Why the uptime headline is the wrong headline

Start with the metric everyone reaches for, because understanding why it fails tells you how to build the ones that work. Availability — the famous string of nines — measures whether a service responds. A voice agent that answers every call within milliseconds, never drops a connection, and returns a clean HTTP 200 on every request can be "99.99% available" while delivering an experience that loses customers on every call. It can mishear the account number, take three seconds to respond, refuse to escalate, and hallucinate a policy that does not exist. None of that registers against an uptime SLA, because the system was, in the strict sense, up.

This is not a subtle point and it is not new to anyone who has run a voice programme. In 2026, with roughly 88% of enterprises using AI but only around 6% capturing material EBIT impact (McKinsey's State of AI 2025), the gap between deployment and value is rarely a model problem. It is a measurement problem. Teams measure what is easy to measure — availability — and leave the service levels that actually determine value undefined. The result is a contract that is technically honoured and operationally broken.

The fix is not to discard availability. It is to demote it to one line in a schedule of service levels, redefine it so it means successful call completion rather than API responds, and then build the metrics that genuinely bind around it. Those metrics — latency, accuracy-under-load, containment floor, escalation availability, and time-to-restore — share a single property: each one is only as strong as its definition. Get the definition loose and the vendor will hit the target while the service degrades. Get it tight and the SLA becomes the instrument that runs your programme. The contract clause language that carries these into a binding agreement is a separate exercise — the eleven MSA clauses enterprise legal teams should demand cover the wording; here we are designing what the wording has to point at.

The service-level schedule: the metrics that bind

A real voice AI SLA is not a single uptime number. It is a schedule — a small table of distinct service levels, each with its own definition, measurement method, and target. The table below is the shape of that schedule. Treat the target bands as illustrative of what we see in enterprise engagements, not as a market standard to quote: the value you negotiate matters far less than whether the metric is defined so it can be measured and whether the breach can be detected by you.

Service level	What it actually measures	The definition trap	Source of truth
Availability	Successful call completion, not API uptime	"Up" defined as HTTP 200, so a hallucinating agent counts as available	Customer-side call outcome log
Latency	Time from end of caller speech to start of agent response, at P95/P99	Measured as average, or as server compute time only, hiding turn-level lag	Instrumented turn timestamps, ideally customer-observed
Accuracy under load	Correctness of captured data and agent statements at peak concurrency	Measured on a curated sample in quiet hours, not on live peak traffic	Sampled human-graded call review
Containment floor	Calls resolved without human handover at an acceptable quality	Containment counted even when the caller hung up in frustration	Outcome-tagged transcripts
Escalation availability	Whether a human handover succeeds when the agent reaches its limit	"Escalation offered" counted as success even when no human was reachable	Handover completion log
Time to restore	How fast a degraded service returns to target after a breach	Clock starts when vendor acknowledges, not when degradation began	Incident timeline, jointly reconciled
Data freshness	How current the knowledge the agent speaks from is	No metric at all — stale knowledge bases silently break accuracy	Knowledge-base version + sync log

Read down the third column and the pattern is unmistakable. Every one of these service levels can be gamed by a loose definition. The vendor does not need to act in bad faith; a definition that was never tightened simply drifts toward the reading that is easiest to honour. The discipline of SLA design is closing each of those gaps before signature.

Three of these deserve their own published treatment, and you should hold the SLA to the standard those posts set rather than re-deriving it here: the 80% containment benchmark and how to read it for the containment floor, the accuracy evaluation that goes beyond word error rate for accuracy-under-load, and the human-handover pattern that decides ROI for escalation availability. The SLA is where those operational standards become contractual commitments — and where a breach has to wire into your voice AI incident-response runbook rather than sit in a quarterly review nobody reads.

The definitions problem: where SLAs quietly die

A service level is a sentence before it is a number. If the sentence is loose, no number can save it. This is the single most common reason a voice AI SLA fails to protect the buyer, and it is invisible until the first dispute. Walk through the four metrics that matter most and the trap becomes concrete.

Availability. Define "available" as the agent answered, understood, and completed or correctly escalated the call, measured from the customer's own call-outcome log. Defined that way, a stretch of calls where the agent answered but spoke from a knowledge base that had gone stale counts as unavailability, because the calls did not complete correctly. Define it as "the endpoint returned 200" and the same stretch counts as perfect uptime. Same service, opposite SLA outcome — the difference is one sentence.

Latency. The physics of voice latency is unforgiving: humans perceive a pause beyond roughly 800 milliseconds as awkward and beyond 1.5 seconds as broken. So the definition has to pin three things: the clock (end of caller utterance to start of agent audio, not server-side compute time), the percentile (P95 and P99, never the average, because the average hides exactly the tail of calls that breaks trust), and the window (a rolling period, not a calendar month that lets a bad week be diluted by a quiet one). Miss any of the three and a vendor can report a flattering average while every fifth caller waits two seconds.

Accuracy under load. Accuracy measured on a curated test set during quiet hours is a marketing number. The SLA has to specify that accuracy is sampled from live traffic, including peak concurrency, and graded against the metrics that matter operationally — was the data captured correct, was the summary right, did the agent avoid asserting something untrue. Pair the SLA threshold with a hallucination procurement gate so you know what number you are negotiating against, and hold the grading to the accuracy standard beyond WER. The definitional trap here is the sample: who draws it, how large, and judged by whom.

Containment. Containment is the most gameable metric in the entire schedule, because the obvious definition — "share of calls resolved without a human" — rewards the worst outcome. A caller who gives up and hangs up has been "contained." The containment rate that survives scrutiny is qualified containment: resolved without handover and without the caller abandoning and without a repeat call within a defined window. The SLA must encode that qualification or it will pay the vendor for frustrating your customers.

The through-line is simple and worth stating plainly: in a voice AI SLA, the definition is the control. The number is downstream. A team that spends its negotiation energy haggling the latency target from 800ms to 750ms while leaving "latency" defined as a server-side average has optimised the wrong variable entirely.

Who measures? The source-of-truth problem

Here is the question that decides whether any of the above is real: when the SLA says latency was within target last month, who says so? If the answer is "the vendor's dashboard," you do not have an SLA. You have a vendor marketing surface with a number on it, and a process by which the party being measured grades its own homework.

Every service level in the schedule needs a named source of truth, and the contract has to name it. There are four broad options, in ascending order of buyer protection:

Vendor telemetry, self-reported. The default, and the weakest. Accept it only for low-criticality tiers, and only with an audit right attached.
Vendor telemetry, customer-auditable. The vendor measures, but you have a contractual right to inspect the raw event data — the call logs, the turn timestamps, the outcome tags — and to reconcile it against your own records. This is where the auditability and explainability that most vendors fail at procurement becomes the load-bearing requirement: if the underlying events are not logged in a form you can inspect, the audit right is decorative.
Customer-side instrumentation. You measure independently — latency from your telephony edge, outcomes from your own CRM writeback, containment from your own transcript tagging. The vendor's number and yours are reconciled on a cadence. This is the strongest practical model for the metrics that matter, and it is increasingly the norm in regulated deployments where data retention and governance obligations already require you to hold the underlying records.
Third-party or shared measurement. A neutral monitoring layer both parties trust. Rare, expensive, reserved for the highest-criticality programmes.

Naming the source of truth also forces three sub-decisions the SLA must settle: the measurement window (rolling 30-day beats calendar-month, which can be gamed by timing maintenance and incidents), the reconciliation process (what happens when your number and the vendor's disagree — and they will), and the dispute window (how long either party has to challenge a reported figure before it is final). An SLA that names a metric but not a source of truth, a window, and a reconciliation path has specified an aspiration, not an obligation.

The exclusions trap: how a strong SLA gets hollowed out

You can define every metric perfectly and name a clean source of truth and still end up with an SLA that never pays out — because the exclusions clause quietly removes everything that would have triggered it. Exclusions are where vendors recover at the bottom of the contract what they conceded at the top.

Watch for four in particular. Maintenance windows that are unbounded ("scheduled maintenance is excluded from availability") let a vendor schedule degradation out of the metric; cap the total excluded hours and require advance notice. Force majeure drafted broadly enough to cover a routine cloud-provider blip turns ordinary infrastructure risk into excused downtime; tie it to genuinely extraordinary events. Third-party dependency carve-outs are the most dangerous for voice specifically — a vendor that excludes "degradation caused by upstream model providers or telephony carriers" has excluded the two most common failure modes of a voice agent, because the model and the carrier are the product. The carve-out is reasonable only if paired with a vendor obligation to manage those dependencies and a fallback design. And "caller behaviour" exclusions — degradation attributed to accents, background noise, or unusual phrasing — excuse the vendor from exactly the real-world conditions the agent exists to handle.

The exclusions test

Before you sign, run every exclusion through one question: does this carve-out remove a failure mode the vendor controls?

If yes, it does not belong in exclusions — it belongs in the SLA, with a credit attached.
If it removes a failure mode genuinely outside the vendor's control (a customer-side system, a true external catastrophe), it is fair — but it must be specific, bounded, and paired with a fallback obligation.
An unbounded, unspecified exclusion is a liability cap wearing a different hat. Treat it as one.

Tier the SLA to business criticality — don't gold-plate everything

Not every workflow deserves the same service level, and pretending otherwise is how SLA design becomes unaffordable theatre. A voice agent confirming appointment times and one handling regulated financial servicing carry different consequences when they degrade, and the SLA — and the price you pay for it — should reflect that. Map your call types to a small number of tiers by business criticality, and buy the strict service levels only where the P&L or the regulatory exposure justifies them.

The discipline here is the same one that governs voice AI ROI attribution: you protect most heavily the calls that move the most value or carry the most risk. A gold tier — tightest latency, highest containment-quality floor, fastest time-to-restore, customer-side measurement — for revenue-critical or compliance-critical journeys. A standard tier for high-volume, lower-stakes traffic. The mistake is uniform gold-plating, which inflates cost without changing outcomes on the calls that do not need it, and which shows up later as one of the hidden costs of voice AI total cost of ownership that nobody modelled. Tiering is also the answer to the CFO's procurement questions on voice AI: you are not buying one expensive SLA, you are buying differentiated protection matched to differentiated value.

Make the credit fire — the design view, not the clause view

Service credits are where SLA design hands off to contract drafting, and it is worth being precise about the boundary. The mechanics of the credit — how large it must be to change vendor behaviour, why "sole and exclusive remedy" framing must be rejected, when chronic breach should trigger termination — are clause-drafting questions, and they are covered in the MSA clause set enterprise legal demands. Do not re-litigate them here. The SLA-design question is narrower and prior: will the credit ever fire at all?

A credit only triggers if three things are true, and all three are decided in the schedule, not the clause. First, the metric is defined measurably — if "available" means HTTP 200, the credit for an unavailability that was actually a hallucination outbreak never triggers, no matter how large the credit. Second, the breach is detectable by you — if the only measurement is vendor self-report, the breach is detected by the party that owes the credit, which is to say it is not reliably detected at all. Third, the window and reconciliation let a breach become final before the dispute window closes. A 25% credit attached to an undefined, vendor-measured, calendar-month metric is worth nothing; a modest credit attached to a tightly defined, customer-measured, rolling-window metric is worth what it says. Design decides whether the clause has anything to grip.

Breach is a governance event, not just a billing event

When a service level breaches, the credit is the least interesting consequence. The interesting consequences are operational and governance. A well-designed SLA wires a breach into three things automatically. It triggers the incident-response runbook — degradation against a service level is an incident, and time-to-restore only means something if a defined response actually starts. It surfaces in the metrics directors actually want in board reporting — service-level adherence is a board-grade signal of programme health, not an operational footnote. And it feeds the KPI framework that runs the programme — a breach is a data point that should change script logic, escalation thresholds, or capacity, not just generate a credit note.

This is the difference between an SLA as a billing instrument and an SLA as a management instrument. The billing version asks "do we get money back?" The management version asks "what does this breach tell us to change, and did our response work?" The schedule you design determines which one you end up with, because only tightly defined, well-measured service levels produce signal clean enough to manage on.

Want to see this in production? Try Dilr Voice live (free, $20 credits) and watch the service levels instrument themselves, book an AI placement diagnostic to find where a measured SLA changes the economics, or read our approach to placing AI inside enterprise systems.

Your side of the SLA: reciprocal obligations and OLAs

A vendor SLA binds the vendor only to the extent that you hold up your end, and the well-drafted agreement says so explicitly. The vendor cannot guarantee accuracy if you feed the agent a knowledge base that has not been updated in three months. It cannot guarantee latency if your telephony edge is the bottleneck. It cannot guarantee containment if you change the call flows without agreeing them. These reciprocal obligations are the operating-level agreements (OLAs) that sit underneath the SLA — the internal commitments that make the external commitment achievable.

Name them. Your obligations to keep the knowledge base current, to route through agreed infrastructure, to manage change through a defined process rather than ad hoc, and to staff the human side of escalation. The in-house versus vendor operating model decision determines how many of these you own directly, but you own some of them no matter what you buy. And the human side — the OLAs that depend on your own teams behaving differently — is exactly where change management for voice AI deployment decides whether the SLA is achievable or aspirational. An SLA that names only the vendor's duties is half a contract; the other half is the discipline you commit to yourself.

A 90-day plan to put a real SLA in place

Designing a service-level schedule that binds is a project, not a paragraph. Here is the sequence we run with enterprise buyers, in day-bands, so the metrics are baselined before they are committed and the measurement exists before the credit is owed.

Step 01 — Days 0–15: Baseline current performance. Before you can commit to a service level you have to know what the service currently does. Instrument the metrics that matter on your existing traffic — real latency at P95/P99, real qualified containment, real accuracy-under-load on live peak calls. You cannot negotiate a target you have not measured.

Step 02 — Days 15–30: Define the metrics and the source of truth. Write each service level as a sentence first, then a number. For every metric, settle the definition, the percentile or denominator, the measurement window, and — the decisive one — who measures and from which system. This is the deliverable that determines whether the SLA binds.

Step 03 — Days 30–45: Draft the schedule and the tiers. Map call types to criticality tiers, assign service levels per tier, and decide where you pay for gold and where standard is sufficient. Produce the schedule as a table the vendor can respond to clause by clause.

Step 04 — Days 45–60: Negotiate — and hand the clause wording to legal. Take the schedule into the contract. The credit mechanics, sole-remedy rejection, and chronic-breach termination are drafted against the MSA clause set; your job in this band is to defend the definitions and the source-of-truth, because that is where vendors push back hardest and where the SLA lives or dies.

Step 05 — Days 60–75: Instrument the measurement. Stand up the customer-side instrumentation the SLA names, before go-live. If the SLA says you measure latency from your edge, the edge measurement has to exist and be reconciled against the vendor's on day one, not discovered missing at the first dispute.

Step 06 — Days 75–90: Run the first review and a dispute dry-run. Hold the first service-level review, reconcile both parties' numbers, and deliberately walk a hypothetical breach through the whole chain — detection, credit, runbook, board report. The dry-run finds the gaps while they are cheap to fix.

Strategy

Build vs orchestrate vs buy

Procurement

AI voice platform selection criteria

Product

Dilr Voice — measured by design

How the measurement layer fits together

The diagram is the whole argument in one picture. The service credit on the bottom-left is the consequence everyone negotiates. The reconciliation box in the middle is the part that decides whether the credit ever fires — and it is the part most SLAs never specify. Design from the middle out, not from the credit back.

The shift that makes SLAs work

The enterprises that get value from voice AI SLAs are not the ones that negotiate the hardest numbers. They are the ones that treat the SLA as a measurement instrument first and a contract second. They define each service level as a sentence before a number. They name a source of truth that is not the vendor's own dashboard. They close the exclusions that would otherwise hollow the whole thing out. They tier protection to value instead of gold-plating everything. And they wire a breach into a runbook and a board report, not just a credit note.

Do that, and the SLA stops being a document filed after signature and starts being the instrument that runs the programme — the thing that tells you, every month, in numbers you trust, whether the service you bought is the service you are getting. That is the difference between an SLA that looks like protection and one that is.

Talk to the operators

Design an SLA that actually binds.

30-min scoping call · No deck · Confidential. We'll show you where your current service levels are unmeasurable — and how to instrument the ones that move the P&L.

Book a call → See operating model →

Frequently asked questions

What is the difference between an uptime SLA and a latency SLA for voice AI?

An uptime (availability) SLA measures whether the service responds — the API is up, calls connect. A latency SLA measures how fast the agent responds within a call, which for voice is the difference between a usable and an unusable experience. A voice agent can be 99.99% "up" while every call lags two seconds and feels broken. For voice specifically, availability should be redefined as successful call completion, and latency — measured at P95/P99 from end-of-caller-speech to start-of-agent-response — is the service level that actually reflects quality. See our deeper treatment of voice agent latency benchmarks.

Who should measure the SLA — the vendor or the buyer?

An SLA measured solely by the vendor's own dashboard is the party being measured grading its own homework. At minimum, demand customer-auditable vendor telemetry with a contractual right to inspect raw event data. For the metrics that matter most — latency, containment, accuracy — customer-side instrumentation reconciled against the vendor's figures on a rolling window is the strongest practical model. The contract must name the source of truth, the measurement window, and the reconciliation and dispute process explicitly.

What service credit is "enough"?

Credit size is a contract-drafting question covered in the MSA clause set — broadly, large enough to change vendor behaviour and never framed as the sole and exclusive remedy. But credit size is downstream of design. A large credit attached to an undefined, vendor-measured metric is worth nothing, because the breach never registers. Fix the definition and the source of truth first; the credit only has value once the metric can actually trigger it.

Should every voice AI workflow have the same SLA?

No. Uniform gold-plating inflates cost without improving outcomes on calls that do not need it. Map call types to a small number of criticality tiers and buy the strictest service levels — tightest latency, highest containment-quality floor, customer-side measurement — only where the revenue or regulatory exposure justifies them. Tiering matches protection to value and is usually the more defensible answer to finance.

How is an SLA different from the MSA?

The MSA is the master agreement — the full set of legal clauses governing data ownership, indemnity, liability, exit, and the contractual mechanics of service credits and termination. The SLA is the schedule of service levels the agreement commits to: the metrics, definitions, targets, measurement methods, and tiers. This post designs the SLA schedule as a measurement instrument; the 11 MSA clauses enterprise legal demands covers the surrounding contract. You need both, and the SLA only binds if the MSA carries it correctly.

Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production and instrument the service levels that prove it. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.