Digital Workforce Platform — Voice AI, RAG, and real-time telephony
Multi-tenant SaaS that lets any business deploy a Voice AI receptionist on their own number in minutes.

The problem
The agency had a strong agent product but every new client was a custom integration:
- 2–3 weeks of telephony plumbing per onboard
- Knowledge base hand-built per business
- No way for clients to update their own agent
They wanted a product, not a service line. Same agent quality, but tenant-self-serve and 1-day onboarding.
The approach
The architecture has three layers:
Why Gemini Live (instead of OpenAI Realtime)
For this client, two reasons:
- Multimodal pricing. Vertex billing rolled into existing GCP credits.
- Latency from EU/US. Sub-300ms audio round-trip held in our tests.
I'd reach for OpenAI Realtime instead when:
- The client already lives in the OpenAI ecosystem
- They want voice cloning via parallel ElevenLabs hookups
- They need OpenAI's stronger function-calling reliability for complex tool chains
The RAG layer
Each tenant gets:
- An ingestion pipeline (PDF / website / docs) that chunks + embeds into pgvector with a
tenant_idfilter. - A retrieval prompt that runs before every model turn, scoped strictly to that tenant.
- A
tenant_id-scoped tool API the agent can call (look up an order, book an appointment).
Every retrieval and every tool call is bounded by tenant_id. We treat tenant data isolation the same way we'd treat row-level security in Postgres — defense in depth, not just a query filter.
Tech decisions in detail
Why Xano as the control plane
A NestJS service would have given more flexibility, but the agency's existing team was Xano-fluent. Putting tenant config + routing in Xano meant non-engineers could ship tenant-specific tweaks without a deploy.
Where Xano was the wrong tool: the call runtime itself. Streaming audio + tool orchestration belongs in a proper service. So we kept Xano on the slow path (config, webhooks, billing) and put a small NestJS service in front of Twilio for the fast path.
Twilio number provisioning
Each tenant provisions their own number through the console. Twilio's number search → purchase → webhook bind is a three-call flow; I wrapped it in a Xano function stack so the UI just calls provision_number(tenant_id, area_code).
Outcomes
Lessons
- Two backends are sometimes the right call. Don't force one tool to do both slow-path config and real-time orchestration.
- Streaming audio debug needs first-class tooling. I built a Loom-style transcript replay early on — saved a week of round-tripping with the client.
- Tenant isolation is policy, not query syntax. Every retrieval, every tool, every webhook needs to assert it.
Building something similar?
Send a quick note — happy to compare notes on the architecture.