Self-hosting a voice agent: what it costs and why it's 3–4× cheaper
A practical look at the economics of owning your ASR + LLM + TTS stack versus paying per-API markups.
The single biggest lever on voice-AI margins is whether you own the model stack or rent it. Orchestrators pass through the cost of third-party ASR, LLM and TTS APIs — and add a markup. Self-hosting flips that.
- Self-hosted voice AI reaches roughly $0.035/min at scale.
- Typical effective orchestrator rates land around $0.09–0.12/min.
- The crossover is around ~50k minutes/month; above it, self-hosting wins decisively.
- Owning the stack also improves latency and data residency, not just cost.
Where the money goes
A voice agent runs three models per turn: speech-to-text, a language model, and text-to-speech. Rent all three through APIs and you pay per second, three times over, plus telephony. Serve them yourself on GPUs you already pay for, and marginal cost collapses.
self-hosted 100,000 min × $0.035 = $3,500
orchestrator 100,000 min × $0.11 = $11,000
─────────
monthly saving $7,500 (~68%)The trade-off is operational complexity — GPU capacity, streaming, diarization, PII redaction. That's the work Arivox does for you, with a self-hosted option when you want to run it in-house.