Bland AI published its Fluent transcription benchmark{target="_blank" rel="noopener"} in April 2026, posting a 5.9% Word Error Rate across 250+ hours of production audio. Inside a fortnight, every voice AI procurement deck we saw at upper-mid-market clients had a new line item: "WER under X%". One CTO at a London insurer asked us, plainly, what number he should write next to it.
The honest answer is: that line item is the wrong line item. WER is a transcription metric. Enterprise voice AI is an operations product. The gap between those two ideas is where most pilots quietly fail — not because the model heard the wrong words, but because the system wrote the wrong field into the CRM, summarised the wrong intent into the case note, or escalated the wrong call to the wrong human.
This is the operational buyer's guide to voice AI accuracy. Four dimensions. Four metrics. Four pass bars an enterprise procurement team can defend at audit.
This guide is shipped by the team behind Dilr Voice — the enterprise voice AI platform we run for regulated UK and EMEA buyers. Or see voice AI agents, the productised deployment surface used by our DATS clients.
The same operational lens sits underneath the enterprise AI voice agents complete guide — the pillar piece this article extends. If you have not read that yet, read it after this one. It frames the architecture; this piece frames the evaluation.
WER measures whether the system heard the customer correctly. It does not measure whether anything useful happened next. Enterprises rolling out voice AI in 2026 should score four accuracy dimensions — transcription, CRM write, summary, and escalation — and only the first is WER.
Why WER alone misleads enterprise buyers
WER is the percentage of words a speech-to-text system gets wrong, calculated as (substitutions + deletions + insertions) divided by reference words. It is a useful, decades-old metric for benchmarking transcription models in isolation. On clean datasets like Common Voice, top systems land at 3–7% WER; on long-form domain audio like earnings calls, the same systems run 9–14% (AssemblyAI, 2026{target="_blank" rel="noopener"}). Bland's Fluent posted 5.9% on a 250-hour mixed corpus.
These are the numbers vendors put on their landing pages. They are not the numbers your operations director cares about.
A voice AI deployment at an enterprise level is a five-stage pipeline: audio captured, words transcribed, intent extracted, action taken (a CRM update, a calendar booking, a workflow handoff), and a record written. WER measures stage two. Stages three through five are where money is made or lost. A system can post a 4% WER and still write the wrong account number into Salesforce because the slot-filling logic confused two adjacent numbers under noise. A system can post an 8% WER and still hit 97% CRM field accuracy because its post-processing logic catches and normalises the errors that matter.
Buyers who anchor procurement to WER alone are optimising for a number that lives upstream of every outcome that hits their P&L. The right approach — and the one that survives an audit at an FCA-regulated insurer or NHS trust — is to evaluate accuracy at the layer the business actually consumes. That is the discipline behind our AI placement diagnostic, and the reason we have moved every regulated client onto a four-metric scorecard before any production traffic flips.
What the operational stack actually does
Inside Dilr Voice, the audio-to-action pipeline runs through five layers: real-time transcription, intent classification, slot filling, downstream system write, and post-call summary plus escalation routing. Each layer has its own error mode. You can read the architecture in our note on real-time transcription as the data layer — that piece explains why transcription confidence scores feed every layer above it.
WER, as multiple speech engineering teams have noted publicly through 2026, is "broken" for evaluating modern voice agents because it cannot distinguish a substitution that flips meaning from a substitution that does not. Semantic WER frameworks — using a reasoning model to judge meaning preservation — are emerging, but no enterprise buyer should wait for a perfect upstream metric when the downstream metrics are already measurable today.
The four accuracy dimensions enterprises must score
Below is the scorecard we run for clients before any production traffic flips. Each dimension has a definition, a test method, a pass bar, and an enterprise consequence when it fails. Read it as a procurement spec, not a feature list.
| Dimension | What it measures | How to test | Pass bar | Failure cost |
|---|---|---|---|---|
| Transcription (WER) | Words correctly transcribed from production audio | 50+ call golden set, accent and noise stratified | < 8% on real prod audio | Every downstream layer inherits the error |
| CRM field-write accuracy | Structured slots written correctly to system of record | Compare AI writes vs human annotator on 500 calls | > 96% on numeric/ID slots | Bad data poisons reporting, billing, KYC |
| Summary correctness | Call summary faithful to what actually happened | LLM-as-judge plus 10% human spot check | > 92% factually faithful | Wrong case notes, wrong handoff context |
| Escalation trigger F1 | Correct routing of calls that need a human | Confusion matrix on labelled outcome set | F1 > 0.90, miss-rate < 3% | Vulnerable customer reaches the wrong team |
The same scorecard underpins our voice AI agent quality scoring work — that piece details the automated QA infrastructure clients run in production to keep these four numbers within bar.
Dimension one: transcription accuracy
WER still belongs on the scorecard — but only as one of four. The test that matters is WER on your production audio, not on the vendor's curated corpus. A 50-call golden set, stratified by accent, channel quality, and background noise, will tell you in an afternoon what the vendor's marketing page cannot. UK and EMEA buyers should add a layer specifically for non-native English and code-switching — Speechmatics' 2025 enterprise review noted that buyers consistently underweight regional accent performance until production volumes expose it.
Dimension two: CRM field-write accuracy
This is the metric most enterprise buyers do not measure and most operations directors lose sleep over. Voice AI's commercial value is the structured data it puts into Salesforce, Dynamics, Veeva, or whatever the system of record is — appointment time, account number, opt-in flag, callback preference. Field-write accuracy is not WER; it is whether the right value landed in the right field. Test it by sampling 500 calls, having a human annotator score the structured writes, and comparing to the AI's writes. The pass bar is higher than transcription because the downstream cost is higher: a 4% error on an account number is a billing complaint.
Dimension three: summary correctness
Post-call summaries are how voice AI feeds the rest of the operations stack — handoff notes, case files, manager review, audit log. They are also where hallucination shows up most aggressively. The test is LLM-as-judge with a 10% human spot check: an evaluator model scores summaries against the call transcript, and humans validate the evaluator's judgements on a sample. We have written separately on voice AI hallucination as a procurement gate — the short version is that summaries should be the gate, because they are where fabrication has the most operational consequence.
Dimension four: escalation trigger accuracy
This is the safety dimension. Voice AI must hand the call to a human when the customer is vulnerable, when the request exceeds the agent's scope, or when the conversation is going sideways. Measure it with a labelled outcome set and an F1 score, weighted toward recall — a missed escalation is far worse than a false-positive one. The same logic appears in our note on AI voice sentiment analysis, which is one of the inputs into the escalation classifier.
The same diagnostic logic underpins our AI operating model consulting — a structured assessment of where AI fits inside the enterprise system before any rollout commitment. We use the four-dimension scorecard as the entry condition for production deployments.
Mapping the four dimensions to the QA test stack
Each pass bar should be re-evaluated quarterly against the most recent 30 days of production traffic. Drift is the silent failure mode of voice AI — a system that passes at week one and drifts by week twelve will not be caught by anyone who is not measuring continuously. This sits inside the broader AI voice program KPI framework we run with operating-office clients.
Why UK and EMEA buyers should weight CRM accuracy higher
In our deployments across UK financial services and EU regulated industries, CRM field-write accuracy is the dimension with the largest gap between vendor benchmark and production reality. Vendors test on US English, single-speaker, low-noise; UK enterprise audio is multi-accent, often cellular, and frequently includes regulated terminology (FCA suitability language, ICO consent phrasing) that generic models mis-handle. A 4% relative gap on field-write accuracy translates directly to FCA-reportable data quality issues if the call concerns a financial product. The FCA AI governance 2026 framework treats this as an explicit accountability vector. The ICO AI Code of Practice reinforces it from a data protection angle.
If your procurement RFP does not specifically require the vendor to demonstrate field-write accuracy on UK English production audio with regulated-product slot definitions, it is not a UK procurement RFP. The same point holds for the enterprise voice AI vendor checklist we recommend buyers run.
How this plays out in practice
A UK challenger bank ran a 30-day evaluation across three voice AI vendors for inbound card-services calls. Vendor A posted the best WER (4.1%). Vendor B posted 6.8%. Vendor C posted 7.4%. Procurement instinct said Vendor A. When the team scored the four-dimension scorecard on 500 production calls, Vendor B won decisively: 4.1% WER (Vendor A) translated to 91% CRM field-write accuracy because its slot-filling logic over-confidently wrote noisy phone-number fields; Vendor B's higher WER but stronger post-processing landed 97.2% field-write accuracy and a 0.94 escalation F1. The bank picked Vendor B. Six months in, the production scorecard remains within bar. Vendor A would have shipped a billing-data-quality problem the bank's risk function would have caught in quarter two.
Want to see this in production? Try Dilr Voice live on your own audio, book an AI placement diagnostic, see our DATS methodology, or read about our approach to placing AI inside regulated enterprise systems.
If you would prefer a conversation before the diagnostic, the fastest path is to book a scoping call with the team. Thirty minutes, no deck. We will tell you whether the four-metric scorecard fits your situation and where the EBIT actually moves.
Score voice AI accuracy on the metrics that move your P&L.
30-min scoping call. No deck. We will run the four-dimension scorecard against your shortlist and tell you which vendor actually clears the bar.
Written by the Dilr.ai engineering team — practitioners who ship enterprise voice AI in production for UK and EMEA regulated buyers. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.