Show HN: Ava – AI Voice Agent for Traditional Phone Systems(Python+Asterisk/ARI)
My repo was shared here once before by someone else so I wanted to follow up with the progress since then.
https://news.ycombinator.com/item?id=46380399
I've been working with Asterisk/FreePBX systems for years. I wanted to add AI voice capabilities to legacy phone systems without paying per-minute SaaS fees or ripping out the entire telephony stack.
So I built AVA, a self-hosted AI voice agent that can integrate into any traditional phone system. While most solutions demand expensive migrations to cloud-only providers, AVA provides a self-hosted path to connect AI agents to existing phone systems while ensuring data privacy and lowering operational costs
AVA is a Dockerized Python app that sits alongside your Asterisk server. It connects via ARI (Asterisk REST Interface) and routes call audio to AI providers — OpenAI Realtime, Deepgram, Google Live API, ElevenLabs, Telnyx, or fully local models (Vosk + llama.cpp + Piper). You can mix and match STT/LLM/TTS in a modular pipeline, or use a single provider end-to-end.
Two audio transport paths: We support both AudioSocket (low-latency TCP with TLV framing) and ExternalMedia RTP (UDP, better for NAT). A transport orchestrator auto-negotiates sample rates and codecs between what Asterisk sends on the wire and what each AI provider expects — so you can run 8kHz ulaw from Asterisk into a provider that wants 24kHz linear16 without manual config.
Session lifecycle: A typed session store tracks every call from StasisStart through hangup — audio diagnostics, barge-in counts, provider state, conversation turns. Every call is fully observable and debuggable after the fact.
Barge-in and VAD were the hardest problems. We use a dual-mode VAD — WebRTC VAD combined with energy-based RMS detection, scored into a single confidence value (40% WebRTC weight, 40% energy ratio, 20% agreement bonus). Frame smoothing prevents single-frame glitches from triggering false interrupts. When barge-in fires, we kill active playback (both streaming and file-based) via ARI, flush provider audio buffers, release conversation gating tokens, and optionally suppress provider output for a configurable window to prevent pre-barge audio from re-queuing. The system supports three interrupt sources: local VAD, Asterisk's native talk detection events, and provider-side interruption signals.
The hardest latency challenge was bridging legacy SIP/RTP with modern WebSocket streams. We use a two-container architecture: a lightweight orchestrator for ARI state management and an optional heavier container for local model inference. There are 6 pre-validated golden baseline configs if you just want something working out of the box, plus an Admin UI for visual setup.
Try the live demo: (925)-736-6718 Option 5 for Google, 6 for Deepgram, 7 for Openai realtime, 8 for Local hybrid and 9 for Elevenlabs
Code is MIT. I'd love feedback on the transport layer (src/core/transport_orchestrator.py) and the VAD tuning (src/core/vad_manager.py).