Vapi processes more than 62 million calls a month. Retell, Bland and ElevenLabs sit close behind. Between them, they have made one decision feel inevitable for any enterprise evaluating voice AI: pick an orchestration layer, plug in your speech-to-text, your model, your voice, your telephony, and ship.
That framing is wrong for most enterprise buyers. The question is not which orchestrator wins the developer benchmark. The question is whether you want to operate orchestration at all — or buy a platform that has already done it, priced the outcome, and put one accountable vendor on the contract.
This is the build-vs-buy decision wearing a different hat. And in 2026, with Vapi pricing teardowns showing $0.05/min orchestration on top of provider costs that often double or triple the visible per-minute number, the maths increasingly favours platforms — not because orchestration is bad, but because operating it is a job most enterprises were never staffed to do.
This guide is shipped by the team behind Dilr Voice — enterprise voice AI live across 40+ countries with compliance, QA and integrations bundled. Or see Voice AI agents, the platform layer purpose-built for £10k+ ACV deployments.
Orchestrators like Vapi, Retell and Bland sell raw call infrastructure. Platforms sell working voice agents with compliance, QA, integrations and accountability priced in. Pick orchestration if you have a 5-engineer voice team and a multi-year roadmap. Pick a platform if you want production-grade outbound and inbound automation live in 60 days, and a single throat to choke when something breaks at 02:00.
For deeper context on the category, the pillar enterprise AI voice agents guide covers the full architecture stack. This post focuses on the layer-of-the-stack question that procurement teams keep getting wrong.
What an orchestrator actually sells
Strip the marketing back and Vapi, Retell and Bland are doing the same essential job: they sit between your application and the public switched telephone network, and they coordinate three ML systems (speech-to-text, an LLM, text-to-speech) plus a media pipeline that has to keep round-trip latency under roughly 800ms or the conversation feels broken.
The Vapi vs Retell vs Bland infrastructure analysis measured these systems across 1,200+ test calls. Retell averaged 580–620ms, Vapi hit 500–600ms with optimised provider pairings, and Bland averaged ~800ms with its proprietary in-house stack. These are real engineering wins. They are also wins on a slice of the problem most enterprise buyers do not actually own.
Here is what an orchestrator does not give you, even on the best day.
What stays on your plate
- Prompt engineering, flow design and red-teaming. The orchestrator runs the call. You write the agent.
- Compliance posture. GDPR lawful basis, PECR consent for outbound, EU AI Act Article 50 disclosure, TCPA written consent for the US, call recording retention. The orchestrator is a processor; you are the controller.
- Quality assurance. Sampling calls, scoring sentiment, catching hallucinations, retraining when an upstream model drifts. None of that ships in the box.
- Integration with the systems of record. CRM writes, calendar bookings, ticketing, billing — all custom work.
- Telephony economics. Carrier rates, number provisioning, DNC scrubbing, fraud handling, international routing.
- Vendor management. When the TTS provider degrades and your orchestration vendor blames the LLM provider and the LLM provider blames the carrier, you are the one chasing all three.
Why the per-minute number lies
The visible $0.05/min orchestration fee is the smallest line item in the actual cost stack. Layer in TTS at $0.07–0.18/min, STT at $0.02–0.04/min, LLM tokens that grow with conversation length, telephony at $0.01–0.03/min and the engineering FTEs to operate it, and the realistic loaded cost lands closer to $0.20–0.40/min for a non-trivial enterprise deployment. We covered this in detail in voice AI total cost of ownership — read it before you sign anything.
The enterprise decision framework
The honest answer is not "platforms always". Orchestration is the right call in specific, narrow conditions. Here is the decision tree we walk enterprise buyers through.
Three conditions where orchestration is the right call:
- You have voice AI engineering as a permanent function. Not a project. A team of three or more, with telephony, ML and compliance experience, who will be there in three years.
- Voice is product, not operations. If you are an outbound dialler company or a contact-centre-as-a-service vendor, you must own the stack. If you are a SaaS company automating internal call workflows, you almost certainly should not.
- You have a multi-vendor risk appetite. Orchestration is by definition a multi-vendor architecture. Each provider has its own SLA, pricing model and failure mode. Your procurement team has to be comfortable with that surface area.
Three conditions where a platform is the right call:
- You want outcomes priced, not minutes. Enterprise procurement runs on ACV and payback period. Orchestration runs on usage variance.
- You need compliance baked in, not bolted on. UK and EU buyers in regulated verticals — finance, healthcare, property, insurance — cannot make consent capture, data residency and disclosure into engineering tickets. They need them shipped.
- You want one vendor accountable for the call working. When the agent hangs up mid-conversation, you want one number to call, not five.
Where the UK and EMEA buyer lands differently
US-centric orchestrator marketing assumes a high-tolerance buyer who will trade complexity for control. UK and EMEA enterprise procurement does not work that way. Three pressures push buyers toward platforms:
- GDPR plus the EU AI Act is two compliance regimes, not one. Data residency, lawful basis, special-category data treatment for voice biometrics, and Article 50 disclosure all need to land in the same architecture. Doing this on top of a multi-vendor orchestration stack is theoretically possible and operationally brutal.
- Procurement cycles reward fewer suppliers. Vendor consolidation is a 2026 board-level theme across UK enterprise. A platform replaces five line items on the supplier register; an orchestrator adds them.
- Internal AI maturity is low. Per the ServiceNow Enterprise AI Maturity Index 2026, only 15% of enterprises sit in the "Optimizing" or "Leading" tiers. The other 85% do not have the in-house capability to run an orchestration stack to the standard their compliance officer will sign off.
Orchestration vs platform — capability by capability:
| Capability | Orchestration (Vapi / Retell / Bland) | Platform (Dilr Voice) | Enterprise impact |
|---|---|---|---|
| Pricing model | Per-minute orchestration + provider pass-through | ACV with outcome SLAs | Predictable budget vs usage variance — finance prefers ACV |
| Time to first production call | 8–16 weeks with internal team | 2–6 weeks | Payback period 3–4× faster on platform |
| Compliance scope | Customer is controller; vendor is processor only | GDPR, PECR, EU AI Act, TCPA shipped in product | One contract covers regulatory posture |
| QA + analytics | Bring your own | Sentiment, transcript review, drift alerts in-platform | QA team headcount avoided |
| Integrations (CRM, calendar, telephony) | Custom engineering | Pre-built connectors | 6–9 months saved per major integration |
| Vendor management surface | 5–7 vendors | 1 vendor | One throat to choke at 02:00 |
| Best fit | Voice-AI-as-product companies, R&D-heavy teams | Enterprise operations automating £1M+ in call workflows | Different buyer, different choice |
How this plays out in practice
A UK financial-services client came to us last quarter with a Vapi pilot already running. Three engineers had been on it for four months. They had latency under 600ms — a real engineering achievement. They did not have: PECR consent capture for outbound, retention controls that satisfied their DPO, a sentiment QA pipeline, or a CRM write-back that worked for more than 90% of calls.
The pilot worked technically and failed commercially. Their executive sponsor could not approve production rollout because the compliance and QA gaps were a 6-month rebuild on top of the engineering already spent. The honest version of their TCO was £400k+ to get to the place a platform would have started.
We replaced the pilot with Dilr Voice in 5 weeks. They went live across two business lines, with PECR-compliant consent, in-platform QA scoring, and CRM integration shipped. Net Year 1 cost: roughly 60% of the rebuild estimate, with a hard ACV instead of a usage curve.
That is not a slight on Vapi. Vapi is excellent at what it does. It is the wrong layer of the stack for a regulated UK enterprise that wants outcomes — and the right layer for a developer building a voice-first product.
If you are mid-evaluation on an orchestrator and the compliance and QA scope is creeping, that is the signal. Try Dilr Voice against your real use case, book an AI placement diagnostic, or read about our approach to voice AI placement inside enterprise systems.
For the procurement-side checklist on either path, the enterprise voice AI vendor checklist covers the questions to ask before signing.
Pick the layer of the stack you actually want to operate.
30-min scoping call. We'll tell you whether orchestration or a platform fits — and if it's a platform, whether Dilr Voice is the right one. No deck. Confidential.
Written by the Dilr.ai engineering team — practitioners who ship enterprise AI in production. Follow us on LinkedIn for shipping notes, or subscribe via the RSS feed.