Rebuilding Alexis on Claude's API: What I Actually Learned

So I rebuilt it on Claude’s API. Here’s what I actually learned — not the marketing version, the real version.

Why I switched

The honest answer: tool use consistency.

In a healthcare context, structured data collection isn’t optional. I needed Claude to reliably call collect_patient_info and validate_address at exactly the right moments in a conversation — not occasionally, not mostly, but every single time across a 10-step intake flow. GPT-4o’s function calling worked well in testing. Under realistic conversational variance, patients who ramble, give incomplete addresses, or jump ahead in the flow, it was less predictable than I needed. One missed function call in a healthcare intake isn’t a UX issue. It’s a data integrity issue. I wanted to test whether Claude’s tool use would be more consistent under that kind of pressure.

What I changed architecturally

The core pipeline stayed the same: Pipecat for the real-time voice framework, Deepgram Nova-3 for STT, ElevenLabs for TTS, Twilio for phone handling. What changed was the inference layer and how I managed context. Model routing: Not every turn in a conversation requires the same reasoning depth. Collecting a date of birth is a simple extraction task. Handling an ambiguous insurance response requires real judgment. I implemented routing between Claude Haiku (simple field collection) and Claude Sonnet (complex reasoning, edge cases, address disambiguation). This alone dropped per-turn latency significantly. Prompt caching: The system prompt in a healthcare voice agent is long — conversation flow instructions, field definitions, validation rules, fallback behaviors. With prompt caching, that context gets cached after the first turn. The difference was measurable: Claude per-turn latency dropped from 3.6s to ~0.8s. Context compression: A full healthcare intake runs 30–40 turns. Without compression, context windows balloon and costs scale linearly. I implemented structured turn summaries — after each completed section of the intake, the raw turns collapse into a structured summary. This kept context under 10K tokens per call and cut cost per 40-turn call from $0.78 to $0.12.

What surprised me

Two things I didn’t expect.

First: Claude’s tool use in conversational context is genuinely more consistent than I anticipated. The structured data collection across a 10-step intake flow worked reliably even when patients gave fragmented or out-of-order information. I haven’t fully stress-tested this at scale, but the early results are cleaner than what I was getting before. Second: the latency profile is different in a way that matters for voice specifically. The perceived end-to-end latency, from when the patient stops speaking to when Alexis starts responding, sits around 2.5s across STT, inference, and TTS. That’s within the range where conversations feel natural. Above 3s, patients start to wonder if the call dropped.

Getting to 2.5s required every optimization stacked together. Prompt caching, model routing, end-of-turn signaling to Deepgram, streaming TTS. Any one of them alone wasn’t enough. The combination is what made it work.

What I’d do differently

The benchmarking suite I built, measuring p50/p99 inference latency, token throughput, and per-call cost across simulated enterprise call volumes, should have been built first, not after. I was optimizing based on feel before I had real numbers. Build the eval harness before you build the product. Also: address validation in a voice context is harder than it looks. SmartyStreets catches invalid addresses, but when a patient gives an address that’s technically valid but slightly wrong, the agent has to handle the correction gracefully without making the patient feel interrogated. That UX problem took more iteration than any of the technical problems.

The broader takeaway

If you’re building voice agents for enterprise use cases — healthcare, financial services, anything with structured data collection requirements — the choice of inference provider isn’t just about benchmark numbers. It’s about how the model behaves at the edges of your conversation flow.

The code is open source: github.com/sumtzehern/health-twilio-agent

Happy to talk architecture with anyone building in this space.