Building VoxGate AI — Part 2: What I Learned Building It

May 18, 202627 min read

aillmagentslessons-learnedengineeringtypescript

Engineering challenges, failed approaches, and a build-vs-buy verdict from inside the project.

In Part 1 I described VoxGate AI: an LLM agent embedded inside a multi-tenant CRM for car-detailing shops, with a backend-orchestrated streaming UI protocol and a service layer the agent shares with the manual UI. This is what shipping it actually taught me — including the architectural mistake that almost killed it.

5. The Biggest Engineering Challenges

Four fights stand out. None look dramatic in the commit log; all ate days.

1. The form-card shape drift bug

A user types "Create a customer named Jordan Miles with phone 555-219-8841 and email jordan.miles@example.com." The chat renders a client form-card. Every field blank.

The cause was two card types with two prop shapes and one prompt. Quote and job cards used { quote: { ... } }. Client cards used { firstName, lastName, phone, email } at the top level. The "active card" examples in the system prompt were written entirely against the nested shape because quote cards shipped first. When the LLM hit a client card, it dutifully emitted { client: { firstName: "Jordan", ... } } — a reasonable response to a prompt that had only ever shown nested examples. The update_card handler did a flat merge, found no top-level firstName, and wrote nothing.

The fix was two-part: branch the prompt examples on cardType, and make update_card tolerate the wrong shape as a fallback. The second part felt like defensive code papering over a prompt failure. I shipped it anyway. Six lines of defensive merge cost less than an empty card in front of a customer.

Lesson: Tool-call payloads are an API the LLM consumes. The LLM uses the wrong shape if your docs are inconsistent. Your docs are your prompt examples. Audit them like an OpenAPI spec.

2. The in-chat form was a parallel app

The one I'd structure the whole article around.

Bugs that should have been one-line fixes were taking days because they had to be fixed twice. A new quote field? Add it to the quote form and the in-chat quote-form-card. A validation rule? Two places. An optimistic update? Two places. Mobile got a third variant.

The chat-embedded form-cards (client-form-card, quote-form-card, job-form-card) were different React components from the real side-panel forms. Same backend, two front-ends. When a user said "create a quote for Jordan," two paths existed — agent emits a card in the feed, or user clicks "New Quote" and the panel opens. Same destination. Different code, validation, optimistic state, mobile rendering.

The misread that kept me in the trap: the recurring user complaint "why didn't it just open the form?" sounded like a feature request. It wasn't. It was telling me I'd built a second UI when the user wanted the one they already knew.

The fix for create-client: stop emitting an in-chat form-card. Emit a panel-open-command event with whatever fields regex extracted from the user's message. The client intercepts the event, opens the real side panel, prefills it through the same initialValues path React Hook Form already exposed.

One UI. One form. One validation pipeline. The agent's job became "open the right thing pre-filled" instead of "be a second UI."

Lesson: A chat surface is a routing layer, not a UI surface. The agent figures out intent and opens the existing UI pre-filled. The moment you build "agent components" that duplicate existing app components, stop and reconsider.

3. Panel-mode coordination

Two writers, one piece of shared state, no protocol.

Panel-mode gives the agent patch_quote_field / patch_job_field tools when a side panel is open. The user can talk to the agent while editing the form by hand. That makes the UX feel collaborative — and means the agent's prompt context is a snapshot, the user can edit anything before the turn finishes, and conflicting writes are inevitable.

First attempt: lock the form during agent turns. Strong consistency, clean. It felt awful. Average turn was 2–4 seconds. Users hated the lockout and stopped using panel-mode within days.

What I landed on:

Last-write-wins, with the user as default winner. Agent field patches apply as setValue with shouldDirty: false. If the user has touched the field, the agent's patch is dropped.
The client pushes manual edits back into the session. A debounced panel.field.update event flows up the socket and into panelContext. The next agent turn sees the new values. Per-turn staleness is fine; cross-turn staleness isn't.
panel.close and goal.complete are separate events. Panel close is a UI fact; goal complete is a domain fact. They fire independently.

The mental shift that helped: think of the agent and the user as two collaborators on the same Google Doc, not as one app updating one form.

Lesson: Two writers on one piece of state need a coordination protocol whether you wanted one or not. Pick it explicitly. Locks borrowed from systems programming assume a participant that doesn't get bored. Humans do.

4. Prompt drift

Adding a new intent breaks an old one. Tightening one example loosens another. The agent worked yesterday and is hallucinating tool calls today, with no code change.

A specific incident: someone tightened the client_mutation examples from "do NOT call create_quote / create_job" to "do NOT call the create_* tool." Cleaner, more general — and accidentally instructions the LLM applied to create_client itself, breaking the entire client-creation flow.

The fix is what every codebase eventually applies to its config: treat the prompt as code.

Snapshot tests on the rendered prompt across a matrix of (state, intent, active card).
Per-intent unit tests asserting each INTENT_EXAMPLES.* block contains the phrases that make it work and lacks the phrases known to confuse the model.
End-to-end happy-path tests per intent against a recorded model fixture.

The highest-leverage testing infrastructure for an LLM agent isn't LLM evals. It's snapshot tests on the input to the LLM. The model is the variable you don't control; the prompt is the one you do.

Lesson: Your prompt is your most important config file. Version it, snapshot it, require review for changes — especially the cute ones. "Make the wording more general" is a one-line edit that can break six features.

6. Failed Approaches

Four detours worth writing down. The first one ate most of the build and is the one I'd most want another team to avoid.

1. The chat as a full UI surface (most of the build)

I touched this in challenge 2; here is the full arc, because the reasoning that got me into it is the part another team is most likely to repeat.

Why I did it. The pitch was clean: keep the user in flow. Customer's on the phone, user is in the chat, every action happens in the chat. No context switches. The user types, the agent renders a form-card right there in the feed, the user fills it in, the form-card animates into a confirmation. Reviewers loved the demo. I loved the demo.

The first crack: form-cards started feeling like they were lying. The chat-rendered quote-form-card showed five service line items. The real quote, once saved, had four — because the chat form-card had a different validator that allowed an empty row the real form rejected. A user on a call talked through the fifth service with a customer, hit save, watched the row vanish. I patched the validator. Then I patched it in three more places. Then I wrote a shared validator package — and realized I now had a shared validation library between two forms that should have been the same form.

The second crack: mobile. The web in-chat form-card was a portal-anchored React component. On mobile (Expo) there are no portals. The mobile chat-card had to be reimplemented as a bottom sheet with different keyboard handling, different field focus behavior, and a different animation. I now had three implementations of every form: side panel (web), chat-card (web), chat-card (mobile). A field rename was a four-PR ritual.

The third crack — the one that broke my patience: the optimistic state diverged from the real state. The chat-card showed what the user typed. The side panel, if the user happened to open it, showed what the backend had. Sometimes they disagreed for a couple of seconds. Sometimes they disagreed permanently because the chat-card's "save" was queued behind an agent turn that was still streaming. Users saw two versions of the same quote on screen and started not trusting either one.

The things I tried before the revert.

Sharing components between chat-card and side panel. Worked structurally but didn't solve the protocol problem — the agent still emitted a card; the chat-card still had its own optimistic state separate from the panel's; the divergence just moved one layer deeper.
Making the chat-card "preview-only" and forcing save to happen in the side panel. Felt dishonest. Why is the card editable if pressing save doesn't actually save?
Auto-opening the side panel as soon as the chat-card was emitted, so they showed the same data. Briefly worked; then the user had two open surfaces editing the same record and the panel-mode coordination problems from challenge 3 metastasized.

What worked. Stop emitting the card. Open the real side panel pre-filled. The chat does one thing now: route the user to the right existing UI, with the right fields populated, ready to save. The chat doesn't render forms anymore.

What I'd tell another team. The seductive thing about putting forms in the chat is that the demo is unbeatable. The reason it doesn't survive contact with a real product is that you already built the forms once. The instant the agent renders its own form, you've forked your front-end. Every domain change costs twice. Every platform costs twice again. The cost is invisible on day three and unaffordable by the end of week two.

Rule I now apply: the chat surface is for routing, not rendering. If the agent needs the user to fill in fields, the agent opens the form the user already knows how to use.

2. Locking the panel form during agent turns (a weekend)

The first attempt at panel-mode coordination. Mentioned briefly in challenge 3; worth its own callout because the speed with which it failed is instructive.

When I realized the agent and the user could write the same form simultaneously, the first instinct was to serialize: disable the form during agent turns. Re-enable it when the turn finishes. Strong consistency. Clean.

It survived internal testing because internal testing is patient. It survived the first day with two real customers. By day three the support inbox had three tickets saying "the form goes greyed out and I can't type."

The numbers were brutal:

Average agent turn: 2.4 seconds.
Average user typing speed: faster than that.
Users were holding fingers down on a key waiting for the form to come back, and characters were getting eaten.

The fundamental error was treating the user like a system component that could wait on a lock. Real users on real calls can't wait. I reverted in a single afternoon and replaced it with the last-write-wins-with-user-priority model from challenge 3.

The lesson: Locks borrowed from systems programming feel intuitive but assume a participant that doesn't get bored. Humans do.

3. Full conversation replay in the prompt (the first sprint)

The default agent loop in every tutorial: every turn, send the entire message history. The model has perfect memory by virtue of seeing everything.

It worked at first. It was simple. It was easy to debug. It was sending 800-token prompts.

A few days into real traffic, sessions were routinely 50+ turns long. Customers on hour-long calls weren't unusual. Prompts hit 30,000 tokens. Single turns cost dollars, latency crossed five seconds, and the quality got worse — the model started fixating on irrelevant turns from forty messages ago.

The intuition that bought time was wrong: more context ≠ better answers. Past a certain point, more context is worse — the model spends attention on noise.

What I tried that didn't work: sliding-window truncation (kept dropping the turn that established the goal); top-K relevance retrieval (good in theory, the relevance scoring became its own tuning problem).

What worked: an explicit summarization pass at session boundaries. Once a goal completes (goal.complete fires), the message history collapses into a structured summary in the session store. The next turn sees the summary, not the raw history. The summary is generated by a cheap model with a tight prompt.

Costs dropped roughly 7×. Latency improved. Answer quality went up. The agent stopped confusing itself with old turns.

The lesson: Conversation memory is not "send everything." It's a state-compression problem with a budget. Spend the budget on a real summarizer.

4. Adopting an agent framework (a week)

I evaluated a couple of the obvious agent frameworks early — names I'll leave out because the maintainers are doing fine work and the issue is fit, not quality. I prototyped enough to feel the shape of one and walked away after about a week.

The specific things that didn't fit:

Abstractions optimized for autonomy. The frameworks were built around "give the agent tools and let it loop until done." My domain wanted the opposite — one tool call per turn, deterministic dispatch, human confirmation before mutation. Every framework-level convenience I used I then had to fight against.
The streaming protocol was their protocol, not mine. I needed semantic UI events (ui.append, ui.replace, panel-open-command) flowing over my socket. The framework wanted to own the wire format. Wrapping it cost more than skipping it.
Debugging the agent loop through a framework is harder than debugging your own. When a turn went wrong, I wanted a stack trace and a Sentry breadcrumb. I got framework-internal abstraction layers.

I did keep two ideas: the shape of tool definitions (Anthropic-style function schemas, just rendered from Zod) and the idea of a "session" abstraction (just a Redis-backed object in my case).

What I'd tell another team. If your agent is doing general tasks for general users, the frameworks earn their weight. If your agent's job is to manipulate one specific domain through a well-defined service layer, the framework is mostly in the way. The line is roughly: is the model the product, or is the model a feature of the product? Frameworks are great for the first; in my way for the second.

7. What Worked Well

Most of the wins are unsexy. They're the patterns I'd start with on day one if I did this again — and the patterns I'd recognize immediately in another team's codebase as "they've already learned the lessons."

1. Exactly one path into the database

Of every architectural decision in this project, this is the one I'd defend hardest.

The rule: the agent uses the same service methods the manual UI uses. There is no AgentQuotesService next to QuotesService. There is no "for the agent" route, fast-path, or back-door. When the agent calls create_quote, the tool handler invokes the same quotesService.create() that the form's submit handler invokes. Same validation. Same authorization guards. Same audit log. Same database transaction.

The payoffs compound:

Multi-tenant safety is free. The UserContextGuard that protects the manual UI protects the agent path too. I never had to write a "is this org allowed to do this" check inside an agent tool — that check already exists, one layer down.
Business logic changes once. A new validation rule, a new side effect, a status transition — written once, observed by both paths.
Bugs surface from either entry point. A bad input from the agent fails the same validator a bad form submission would, with the same error message. Debugging stays uniform.

The discipline this requires is small but constant: every time someone writes a new tool handler, the temptation is to inline "just one quick query" or "skip this check, the agent is trusted." The answer is always no. The agent is not trusted. It's a different keyboard.

2. Stream semantic UI events, not tokens

Almost every LLM tutorial streams text. The model emits tokens; the client appends them to a buffer; the user watches the buffer fill. This works fine when the product is the text.

My product isn't the text. It's a CRM. The user doesn't want to read what the agent is "thinking" — they want a form to open, a field to fill, a quote to draft. Streaming tokens to the chat surface would have made the chat the main UI again, which is exactly the failure mode I'd just fought my way out of.

The protocol streams semantic UI operations instead:

ui.append  { type: 'quote-form-card', props: {} }
ui.replace { key: 'quote-form-card', props: { clientName: 'Jordan' } }
ui.replace { key: 'quote-form-card', props: { ..., services: [...] } }
ui.remove  { key: 'assistant-thinking' }

The client doesn't parse. It dispatches. Each op is a tiny structured update with a key and a payload. The same op renders as a portal panel on web and a bottom sheet on mobile — the protocol is UI-agnostic.

The win I didn't see coming: extending the protocol became zero-friction. When I needed a way to tell the client "open the side panel pre-filled," I didn't add a new wire op. I added a new node type (panel-open-command), let the client intercept it, and let the agentui-react library null-render it as a fallback. No breaking change. No client redeploy. The same ui.append op carried a new semantic meaning, and the protocol absorbed it.

If you take one wire-protocol idea from this article, take this one: stream structured updates, not bytes. Your reasoning layer changes faster than your UI; your UI changes faster than your transport. Pin the transport, version the structure, and let everything above move.

3. The LLM is a reasoning layer, never a system actor

Stated as a rule: the model can propose anything; it cannot do anything. Every mutation goes through a backend tool handler I wrote, which goes through the existing service layer, which goes through the existing guards.

This sounds defensive. It is. The reasons it was the right default:

It's the only way multi-tenancy stays safe. A model that has direct DB access has to be perfectly prompted forever. A model that can only call tools that already check org membership is structurally safe.
It bounds the blast radius of model mistakes. Worst case: the LLM calls a tool with bad arguments. The tool's validator rejects it. The model gets a structured error back and tries again or gives up. There is no "the agent ran a query and exposed another org's data" failure mode because there is no path for that to happen.
It makes the LLM swappable. I've changed model providers and model versions during this project. The tool layer didn't move. The contract the agent sees is "here are your tools, here are their schemas" — implementation underneath is mine.

The phrase that became internal shorthand: "the model proposes, the service layer disposes." Every time someone proposes letting the agent do something more "directly," that's the test. Does the proposal preserve the service layer as the dispose step? If not, no.

4. Prompt-as-code

The prompt was the most-edited file in the agent module and, for most of the build, the least-tested. Anyone could change it. Nobody reviewed it as code. Behavioral regressions traced back to one-line wording changes nobody had flagged as risky.

The shift was treating prompt-packer.ts and INTENT_EXAMPLES like a config file you'd never deploy unreviewed:

Snapshot test on the rendered prompt. For a representative matrix of (session state, intent, active card, panel mode), assert the exact prompt the model will see. Any prompt edit shows up as a diff in code review. Surprising changes get caught before they ship.
Per-intent unit tests. Each INTENT_EXAMPLES.* block has a unit test asserting it contains the phrases that make it work and lacks the phrases known to confuse the model. "Doesn't say create_*" is now a literal assertion since the day a wording change broke the entire client-creation flow.
End-to-end per-intent tests against a recorded model fixture. Hit a stubbed model with a deterministic response, verify the right tool gets called with the right shape. This catches shape drift (challenge 1) and prompt-prompted refusals.

The non-obvious payoff: the highest-leverage testing infrastructure for an LLM agent is not LLM evals. It's snapshot tests on the input to the LLM. The model is the variable you don't control. The prompt is the one you do. Lock the prompt down with the same rigor you lock down any other config, and you can actually debug model behavior. Don't, and every regression looks like the model "just acting weird."

The four-line summary of how to set this up well, if you take nothing else from this section:

Render the prompt deterministically from data.
Snapshot the rendered output.
Test each intent block's contents.
Require code review on prompt changes.

That's the whole pattern. It's not novel. It's just that for some reason nobody reaches for it on day one with prompts the way they would with, say, a feature flag config.

8. Lessons Learned

Five things I'd put on a single page and tape to the wall before starting the next project. In the order they'd save the most pain.

1. The chat is a routing layer, not a UI surface

The single most expensive mistake I made was treating the chat panel as a place where forms live. It demos beautifully and it forks your front-end forever. Every domain change costs twice. Every platform costs twice again. Within weeks you have three implementations of every form and a validation library that exists only to keep them in sync.

The rule I wish I'd adopted on day one:

The chat figures out the user's intent and opens the right existing UI pre-filled. It does not render forms.

If you find yourself building "agent components" that duplicate components your app already has, stop and ask why those components need to exist twice. Usually the answer is that they don't.

2. The LLM is a reasoning layer, never a system actor

Tool calls only. No direct database access. No raw SQL. No "trust the model with admin permissions because it's mostly right." Every mutation flows through the same service layer your manual UI uses, with the same guards and the same validators.

The payoffs are unglamorous and load-bearing: multi-tenant safety is structural rather than prompted, model providers are swappable, and the blast radius of any model mistake is bounded by your existing API surface. The internal shorthand that helped most:

The model proposes, the service layer disposes.

Every "let's just let the agent do it directly" proposal gets tested against this. The answer is almost always no.

3. The prompt is your most important config file. Test it like one.

The single most useful piece of testing infrastructure I built for the agent is not an LLM eval. It's snapshot tests on the input to the LLM, plus per-intent unit tests on each example block, plus end-to-end tests against a recorded model fixture.

The model is the variable you don't control. The prompt is the one you do. Lock the prompt down with the same rigor you'd apply to any other production config — code review, snapshots, semantic assertions — and a class of regressions that used to look like "the model is acting weird" turns into normal, diff-shaped engineering work.

The cute one-line prompt edit that "makes the wording more general" is the bug-shaped one. Catch it in review, not in production.

4. Most "agent" problems are orchestration problems

The instinct on a new LLM project is to reach for maximum model intelligence on every turn. Reading a phone number out of a sentence does not require maximum model intelligence. Picking a Twilio template does not require maximum model intelligence. Validating that a year is between 1980 and 2027 does not require maximum model intelligence.

Reserve the model for the genuinely hard turns — disambiguation, multi-entity reasoning, intent that doesn't fit a deterministic schema. Route everything else through code you wrote and can debug. The architecture that emerges is cheaper, faster, more reliable, and easier to staff — your team can fix the deterministic parts without an LLM specialist on the ticket.

The reframing that helped: minimize how much work the model needs to do, rather than maximize what it's allowed to do. This sounds like a cost optimization. It's actually a correctness optimization. The smaller the LLM's surface, the smaller the surface where it can go wrong.

5. Build vs. buy: be honest about what kind of agent you're building

The advice I'd give a founder or CTO making this exact decision, on a single sentence:

If you're building a chatbot, buy. If you're building a workflow layer on a domain you already own, build — but only if you can keep scope tight enough to ship it in weeks, not quarters.

The teams that succeed at building in-house are the ones whose agent's job is to manipulate a system of record they already control. The CRM, the scheduling system, the inventory database. You already own the domain, the data model, the permissions, the audit log, the validation rules. The agent extends an existing app by giving it a new entry point. That work is unglamorous, expensive, and worth it because it compounds with everything else you ship.

The teams that fail are the ones who set out to build "an AI assistant" with no specific domain to manipulate. That problem space has well-funded competitors, fast-moving infrastructure, and almost no defensibility. Buy.

The honest version of the cost: with ruthless scoping and the discipline to reuse your existing service layer instead of rebuilding it for the agent, you can ship something genuinely useful in a few weeks — I did. Without those constraints, the same project will eat quarters and still not feel done. The variable that decides between the two isn't model quality. It's how much architecture you're willing to not invent.

The trap to avoid in either direction: treating "should we build or buy an AI agent" as a marketing-feature decision. It isn't. It's an architectural decision about whether the model is the product (buy) or whether the model is a feature embedded inside the product (build — carefully).

9. Future Improvements

Three near-term bets. None of them is research. All of them are paying off existing debts or extending patterns I already trust.

1. Finish the side-panel migration for quotes and jobs

Create-client now opens the real side panel pre-filled. Create-quote and create-job still emit in-chat form-cards. This is the most expensive piece of unfinished work in the codebase.

The reason it's slower to migrate is that quotes and jobs are also the entities with panel-mode collaboration — the agent can patch fields live while the panel is open. The chat-card path and the panel-patch path currently both work and the migration has to keep both working through the transition. The plan:

Move the chat-card path behind a feature flag, default off for new orgs.
Wire create-quote and create-job to emit panel-open-command with regex-extracted prefill, the same shape I use for clients.
Keep panel-mode patching exactly as-is — it's already the right pattern; it just needs to be the only edit path once the side panel is open.

The end state I'm aiming for: the chat never renders forms. It opens forms, fills forms, and patches fields in forms the user is already looking at. That's the whole job.

2. Multi-step autonomy with a real undo model

The thing I deliberately cut from v1. "Create the quote, send the confirmation SMS, book the slot — go." Users have asked for it from day one. Every prototype I built failed the same way: the agent committed to step 2 before step 1's error was visible, and the user couldn't undo cleanly.

What's needed before this can ship:

A real transaction envelope around a multi-step plan. The agent declares the plan up front; the steps execute as a single logical unit; partial failure rolls back the visible state.
Per-step undo with a human-readable trail. Not just "revert the database write" — "unsend the SMS if it hasn't been delivered yet; restore the calendar to what it was before; surface what couldn't be undone."
An explicit confirmation gate for irreversible steps. SMS to a customer is irreversible. Calendar mutation is mostly reversible. The agent has to know the difference and pause for confirmation on the ones it can't take back.

The temptation is to ship the demo-shaped version of this — chain three tool calls, hope for the best, apologize when something breaks. I'm explicitly not doing that. Multi-step autonomy without an undo model is a worse product than no multi-step autonomy at all.

3. Voice in the loop

The piece I cut from v1, now closer to ready. Not full voice-AI (I'm not building a phone agent), but transcription-assisted form-filling during the call itself.

The shape: the user puts the call on speaker; AssemblyAI transcribes it in real time; the transcript streams into the same agent pipeline as typed input; the agent drafts the quote and the form-card as the conversation happens; the user confirms.

Most of the infrastructure already exists. The streaming UI protocol already handles incremental updates. The agent's intent classification doesn't care whether the input came from a keyboard or a microphone. The new pieces are bounded:

The transcription pipeline (AssemblyAI is already in the stack for other features).
A speaker-segmented prompt so the agent can distinguish the customer's words from the user's.
A consent and recording-disclosure flow — the legally non-negotiable piece.

If I ship this and it works, it does the thing the original problem statement asked for: the user stays in the customer conversation while the CRM fills itself in behind them. Everything in the article up to this point is what made it possible to even consider shipping it without rewriting the agent.

Closing

None of the above is novel. The interesting work isn't in any one of these features — it's in the fact that I can ship them by extending the same orchestration pipeline, the same stream protocol, and the same service layer I've been describing for the rest of this article. The architecture earns its keep when the next feature is straightforward to add.

Not long ago I would have called this article's worth of work "the AI feature." Today I think of it as the plumbing — the layer that turns the rest of the product into something a model can usefully manipulate. The feature is what gets built on top.

If you're a founder or CTO evaluating whether to add an agent to your product: the most useful question to ask isn't "which model?" or "which framework?" It's "do we own a domain rich enough that letting users manipulate it through language is genuinely useful?" If yes, build — carefully, with the patterns above, and with a realistic budget. If no, buy something off the shelf and spend your engineering time on the part of the product that's actually defensible.

The model is the easy part. Everything around it is the work.

Comments

No comments yet. Be the first to comment!