Why voice AI costs spiral faster than expected
Voice AI costs are layered across independent metered systems, and the interaction between those layers is where budgets break. Twilio charges per minute of call time — $0.0085/min inbound, $0.014/min outbound at list price. ElevenLabs charges per 1,000 characters of text synthesised — approximately $0.30/1K chars for the Turbo tier (eleven_turbo_v2_5), $0.30/1K for the Standard tier. Your voice AI orchestration platform charges per minute of agent time or per call. A contact centre layer adds another metered variable on top.
Where the compounding happens: your AI agent's response length is controlled by an LLM that is optimising for helpfulness, not for character count. An agent answering a question about appointment availability might generate a 60-character response on a straightforward query and a 600-character response on an ambiguous one. The 10x response length difference drives a 10x TTS cost difference on that turn. Across hundreds of calls per day, this variance creates cost profiles that look random in aggregate but are entirely predictable at the individual configuration level.
Teams that discover their actual per-call cost breakdown for the first time — often by accident, triggered by an unexpectedly large invoice — almost universally find one or two agent configurations consuming 40–60% of their TTS budget while representing 20–30% of their call volume. These configurations were never audited because the cost was invisible in the aggregate. They persist for months before anyone looks at the right data.
The difference between aggregate spend reports and per-call cost tracking
Monthly spend reports are what your finance team needs for budget reconciliation. They are nearly useless for operational cost management. A €12,000 monthly ElevenLabs bill tells you that you used a lot of TTS. It does not tell you which agent was responsible, which call types were most expensive, or whether the spend corresponds to any measurable business outcome.
Per-call cost tracking answers the questions the aggregate cannot. Which specific agent configurations are generating the highest TTS character counts per call? What is the correlation between response length and conversion rate — are the most expensive calls also the most effective, or are you paying more for worse results? What percentage of your monthly TTS spend went on calls that ended in under 30 seconds before any value was delivered?
The practical decision-making difference: with aggregate spend, you can decide to talk to ElevenLabs about your contract. With per-call cost data, you can decide to add three words to your agent system prompt — 'be concise, under 80 words' — and watch your ElevenLabs bill drop 35% in the following billing cycle. The former is a negotiation. The latter is a configuration change deployable in five minutes.
Building a per-call cost calculation
Per-call cost tracking requires three data points per call: Twilio call duration (available from the status callback payload as CallDuration in seconds), ElevenLabs character count for all TTS generations on that call (available from the ElevenLabs history API), and the call outcome (converted, not converted, abandoned — from your CRM or call disposition data).
Direct cost per call = (CallDuration/60 × Twilio per-minute rate) + (total_characters/1000 × ElevenLabs per-1K-character rate) + any fixed per-call orchestration fees. For a 2-minute call with a 250-character total TTS input: (2 × $0.0085) + (0.25 × $0.30) = $0.017 + $0.075 = $0.092 direct cost.
Cost per converted call = sum of direct costs for all calls in a cohort / number of calls that converted. This is the metric that forces the honest evaluation of whether your voice AI investment is producing returns. A cohort with 18% conversion and $0.092 average cost per call has a cost per converted call of $0.51. A cohort with 8% conversion and $0.08 average cost per call has a cost per converted call of $1.00. The cheaper-per-call cohort is twice as expensive per conversion — a fact invisible in the per-call metric.
The five cost levers every voice AI operator should pull first
Once you have per-call cost data, five levers consistently produce the largest cost reductions in the shortest time.
Response length capping is lever one: adding 'keep all responses under 80 words' to your agent system prompt reduces average TTS character count by 40–60% in most deployments. This is the single highest-return cost change available and costs nothing to implement.
Model tier alignment is lever two: audit whether each agent configuration actually requires the TTS quality it is paying for. eleven_turbo_v2_5 is 40% cheaper than eleven_multilingual_v2 on most plans. On telephony (8kHz G.711 encoding), the audio quality difference is imperceptible to most callers. Defaulting new configurations to the Turbo tier and upgrading only when quality issues are specifically reported saves significant cost at scale.
Failed-call exclusion is lever three: identify calls that failed within the first 30 seconds before any value exchange and characterise their cost. These are calls you paid full price for that delivered nothing. If they represent 10% of your volume, they represent 10% of your costs with 0% of your conversions. Fix the failure mode driving them and you cut costs and improve conversion simultaneously.
Geographic routing optimisation is lever four: calls being routed through distant regions add call duration from network overhead without adding conversation value. Auditing your Twilio number geography against your calling patterns and consolidating to the correct regions reduces both call duration and ElevenLabs latency (reducing retry-induced character consumption).
Idle session billing is lever five: confirm that your voice AI orchestration layer is not billing for session time during periods where the caller is on hold or in a queue. Some orchestration platforms meter from session start regardless of active AI usage — sessions waiting for a human transfer can accumulate costs without any AI activity.
Building a cost monitoring practice that actually works
A cost monitoring practice that actually prevents budget overruns has three components running at different cadences.
Real-time: an alert that fires immediately when any individual call exceeds your defined cost threshold (typically 3–4x your average call cost). This catches the pathological cases — the 20-minute call that somehow stayed connected, the agent configuration that generated a 5,000-character response on a single turn — before they accumulate into a billing problem.
Daily: a Slack summary of the top 5 most expensive agent configurations from the prior day, with cost per call and conversion rate side by side. This is the report that surfaces the expensive-low-converting configurations that the per-call-only view misses. Review takes 90 seconds. Action rate is high because the data is immediately actionable.
Weekly: a cost trend report comparing current week to prior week, broken down by provider and agent configuration, with projected end-of-month spend at current burn rate. This is the report that gives finance the predictability they need and gives engineering the lead time to make configuration changes before an overage occurs.
Most teams that implement all three components report cutting costs 20–40% within the first month — not through provider renegotiation, but through configuration changes surfaced by the per-call data. The providers were charging correctly all along. The configurations were the problem. The monitoring made them visible.
Frequently asked questions
What does a typical voice AI deployment cost per call?
A typical voice AI call using Twilio for telephony and ElevenLabs for TTS costs between €0.05 and €0.40 per call, depending on call duration, TTS character count, and model tier. A 2-minute call with Twilio inbound ($0.0085/min) + ElevenLabs Turbo-tier TTS for a 200-character response ($0.06) + orchestration costs totals approximately $0.077. An agent with verbose responses (800 characters average) on the same call costs $0.257. The 3.3x cost multiplier from verbosity is the most common driver of cost overruns.
How do you set up per-call cost tracking for a voice AI deployment?
Per-call cost tracking requires storing three data points for each call: Twilio call duration (from status callback), ElevenLabs character count (from generation response or usage API), and the call outcome (converted / not converted). Multiply duration by your Twilio per-minute rate and character count by your ElevenLabs per-character rate to get the direct cost per call. Join this to the outcome data to get cost per converted call. Most voice AI observability tools can automate this calculation and surface it in a daily Slack summary.
What is a reasonable voice AI cost per converted call benchmark?
A reasonable target for B2B voice AI is under €2.00 cost per converted call. Consumer voice AI deployments with shorter conversations and higher volume often achieve €0.50–€1.00. Teams consistently above €3.00 per converted call typically have one of three issues: agent verbosity generating excessive TTS costs, an expensive TTS model tier being used where a cheaper tier would suffice, or a low conversion rate inflating the per-conversion metric. Fixing any one of these typically cuts cost per converted call by 30–60%.
How do I prevent ElevenLabs character budget overruns?
Three-step approach: (1) calculate your expected monthly character usage based on average response length × daily call volume × 30 days, and confirm your ElevenLabs subscription tier covers this with 30% headroom; (2) set up a daily character usage monitoring alert that fires in Slack when usage exceeds 70% of the monthly limit; (3) add a response length cap to your AI agent system prompt ('keep all responses under 80 words') to control the main variable driving character consumption.
Ready to investigate your own calls?
Connect Sherlock to your voice providers in under 2 minutes. Free to start — 100 credits, no credit card.