Building VoxGate AI — Part 1: How I Built It

May 18, 202620 min read

aillmagentsarchitecturenestjswebsocketstypescript

Architecture of an LLM agent inside a multi-tenant CRM.

1. Introduction

A few weeks ago I built VoxGate AI — an LLM agent embedded inside my SaaS for car-detailing businesses. The promise was the one everyone makes: type "create a quote for Jordan Miles, full detail next Tuesday at 2" and the system does the rest. Quotes, jobs, clients, scheduling — all reachable from one input box.

What I actually built, twice, was a parallel UI inside the chat that the rest of the app didn't know existed.

The first version rendered domain forms — client, quote, job — directly in the chat feed. The LLM would emit "form cards" that the user could fill in, save, and watch the agent populate field-by-field. It demoed beautifully. In production it produced empty cards, half-filled cards, cards that disagreed with the real side panel a click away, and a steady drip of "why didn't it just open the form?" tickets.

The realization, when it came, was uncomfortable: I had built two clients for the same backend. The chat had its own form components, its own validation, its own optimistic state, its own "save" buttons — none of which the side panel, list views, or mobile app had ever heard of. Every domain change now meant two implementations. The agent wasn't extending the app; it was forking it.

This article is what I did about that, and what I'd tell another team about to make the same trade. It's aimed at founders and CTOs deciding whether to build an in-house agent on top of a CRM or buy one — because the answer isn't "build" or "buy," it's "understand what you're actually committing to before you start."

The short version of that commitment: an LLM agent inside a CRM is not primarily an LLM project. It's a streaming UI protocol, a state-coordination problem between agent and app, a prompt-curation discipline, and a long argument with yourself about where the chat panel's responsibility ends. The model is the easy part.

2. The Problem I Wanted to Solve

The problem wasn't "I need an AI feature." It was "my users spend more time clicking through the CRM than talking to their customer."

One quote from a shop owner during a feedback call kept getting repeated internally:

"I spend more time clicking through the CRM than talking to the customer."

That sentence sat behind every decision I made.

Who is doing the clicking

The primary users are shop owners and front-desk staff at car-detailing and window-tinting businesses. In a small shop the owner is often the one answering the phone between bays — wet hands, phone wedged in the shoulder, walking to a desk to look something up. Larger shops have dedicated front-desk staff, but the pattern is the same: the user is always doing something else at the same time as using the CRM.

Most of the work happens on a phone. Managers occasionally retreat to a desktop for schedule review and reporting, but the live customer call — the moment the agent had to win — is a thumb on a phone screen.

The functional constraint that fell out of this: if the AI cost the user more than a few seconds, or required them to think about a new mental model, they would abandon it and go back to clicks. The bar wasn't "does it work." The bar was "does it work faster than the existing UI you already know."

The workflow I was trying to compress

A typical new customer call, pre-agent, looked something like this:

Customer calls. Staff asks the same qualifying questions every time: vehicle, service, package, timing, contact info.
Staff opens the customer lookup screen. Searches. Doesn't find them. Switches to the quote builder.
Builds the line items. Switches to the calendar to find a slot. Switches back.
Creates the customer record. Saves the quote. Books the appointment. Goes back to send a confirmation.

Realistically: 15–30 clicks, four to six screen transitions, and a running conversation the user is trying not to drop. Quote complexity made it worse — every custom request meant more navigation, more pricing lookups, more typing.

The agent's job was not "answer questions." It was to act as an operational copilot — extract the structured fields out of the customer's words, pre-fill the quote, suggest a slot, draft the SMS — so the human stayed in the conversation and the CRM stayed out of the way.

Business goal, stated honestly

The primary goal was reducing time-to-quote during a live call. That was the metric every product decision was eventually tested against.

A clear secondary goal was differentiation. Most CRMs in the detailing/tinting space have scheduling and invoicing; very few have anything that resembles an operational copilot. AI in this segment in 2025 was overwhelmingly cosmetic — autogenerated review replies, marketing SMS — and I had a strong internal conviction that the more defensible position was using the LLM to remove friction inside the existing workflow rather than slap another chatbot on the marketing page.

What I explicitly decided not to do

Scoping cuts mattered as much as the features. In the first releases the agent could prepare actions but not commit them autonomously. It could draft a quote, suggest a slot, draft a confirmation SMS — but the human pressed the button.

Off the table for v1:

Automatic booking without confirmation.
Automatic customer messaging.
Direct calendar mutations.
Multi-step autonomous workflows ("create the quote, send the SMS, book the slot — go").
Voice / phone integration.

Voice was the most attractive cut to make and the most painful. Voice in a shop environment introduces transcription noise, latency, interruption handling, and infrastructure cost — and crucially, I hadn't yet validated that users would trust the agent at all. Building voice first would have been optimizing the wrong end of the funnel.

The cut I kept second-guessing was multi-step autonomy. It was the most demo-able feature on the roadmap and the one users asked for most. But every version I prototyped failed the same way: the LLM would commit to step 2 before step 1's error was visible, and the user couldn't undo cleanly. I deferred it to a later phase with a real undo model, which I'll come back to in Part 2.

3. Technical Requirements

The non-functional requirements did more to shape the architecture than the feature list did. Most of them surfaced within the first two weeks of prototyping — and several of them invalidated my initial design.

Latency: perceived, not absolute

I never set a numerical SLA, and I don't recommend starting with one. What mattered was the psychological threshold: if anything happened on screen immediately, the user trusted the system; if the UI sat still for even a second or two, they assumed it was frozen and switched back to clicking through the form.

Concretely this meant:

Streaming had to start visibly within a few hundred milliseconds.
Structured output (a quote, a slot suggestion) had to land within a couple of seconds.
Partial, streaming responses with progressive UI updates beat fully-formed responses that arrived a second later — even when the streamed version finished slower in absolute time.

The single most useful framing I landed on: stop optimizing for completion time, start optimizing for time-to-first-evidence-the-system-heard-you.

Where the LLM lives

The architecture is backend-centric. All LLM calls go through the NestJS server. Clients never talk to a model provider directly. This gave me four things in one stroke:

API keys never touch the browser or the mobile binary.
Org-scoped permission enforcement happens at the same layer as the rest of the CRM.
CRM context (the user's services, their org's pricing, their recent customers) can be injected safely.
Tool calls execute against the existing service layer with the same authorization the manual UI uses.

I deliberately did not adopt a heavyweight agent framework. I looked at the obvious candidates and walked away — not because they're bad, but because CRM operations are highly deterministic and permission-sensitive. A generic agent abstraction tends to optimize for autonomy and flexibility, which is exactly the wrong default when every action is also a write to a customer-facing system of record. My orchestration code is custom and lives in server/src/agent/ — a few thousand lines of NestJS module code that I own end to end and can debug like any other backend.

The architecture that emerged is a hybrid: structured internal tools for deterministic operations, LLM reasoning for conversational handling and field extraction. The model proposes; the service layer disposes.

Multi-tenancy: the LLM is not a trusted actor

The single biggest non-negotiable was tenant isolation. The CRM is per-organization; nothing the agent does can ever leak across orgs. I enforced this at four layers:

Authenticated org scoping at the request edge (the same UserContextGuard the manual UI uses).
Tool execution scoped to the calling org's context.
Org-aware queries at the data layer.
Strict request-context injection into the orchestration pipeline — the LLM never sees an org ID it could lie about; the server attaches identity out-of-band.

The architectural principle I kept reaching back to: treat the LLM as a reasoning layer, not a system actor. It can suggest an action; it cannot perform one. Every CRM mutation goes through a validated backend service that already enforces tenant authorization. The agent path and the manual UI path converge on the same service methods. If a manual user can't do something, the agent can't either.

This sounds obvious in writing. In practice it constantly fights the temptation to "just let the model do the SQL." Don't.

Streaming was non-negotiable within a week

The first prototype did request/response. The agent waited until it had a complete answer, then returned it. It worked technically. Operationally it felt broken — the UI would freeze for the duration of generation, and a user on a live call would tap repeatedly, assume the page was stuck, and reload it.

I rebuilt around a custom stream protocol over WebSocket-equivalent transport, consumed on the client via @kibadist/agentui-react. The protocol carries semantic UI ops (ui.append, ui.replace, ui.remove, ui.toast) rather than raw text — so the server can stream not just words but structural updates: open a form-card, replace a field, dismiss a toast, append a clarification prompt.

The mental shift this forced was bigger than I expected. I stopped thinking of the agent as "a function that returns an answer" and started thinking of it as "a process that continuously updates operational state on a client." That reframing changed everything downstream — error handling, state reconciliation, how I modelled tool calls, how the UI handled interruption.

Mobile parity from day one

The CRM has a Next.js web app and an Expo mobile app. The agent had to work in both, identically, from v1. There was no "ship web, follow up with mobile" path — too many of my highest-value users are on phones during the workflow I was trying to compress.

Mobile parity dictated several decisions that look strange in isolation:

No hover-dependent interactions.
Modal complexity kept to a minimum — Dialog on mobile, portal-side-panel on desktop, but the same component tree.
Keyboard interruption handled as a first-class state (the chat bar collapses, the keyboard takes over, the agent's streaming continues underneath).
No deeply nested flows — every interaction had to be one or two taps.

Counter-intuitively, mobile users were less patient with latency but more forgiving of incomplete streaming states. Years of chat-app conditioning. Desktop users, used to forms, found partial states more disorienting.

The protocol absorbed this by being UI-agnostic: the server emits semantic events; each client decides how to render them. The same ui.append { type: 'quote-form-card', props: {...} } becomes a portal panel on web and a bottom-sheet card on mobile.

Reliability: never let the AI break the CRM

The single biggest product risk was making the core CRM feel less reliable because of the agent. The principle I landed on:

The AI layer must never prevent the user from completing the core task manually.

When generation fails:

User input is preserved.
Manual CRM controls remain available and visible.
The system degrades silently back to the deterministic workflow.

I deliberately avoided aggressive retry chains. Repeated retries against a flaky model call usually hurt more than they helped — they extended the perceived latency for the user without meaningfully increasing success rate. Better to fail fast, surface a small inline notice, and let the user fall through to the form they already know.

Users tolerated AI failure surprisingly well. They tolerated UI failure not at all. That asymmetry shaped most of my error-handling decisions.

Cost: a structural constraint, not a knob

Operational CRM workflows generate frequent, short-lived interactions, which means prompt inefficiency compounds fast. I didn't set a hard per-org cap; I set a discipline:

No full-conversation replay on every turn — the session store carries compacted state.
Aggressive prompt size minimization.
Structured tool outputs over verbose natural-language responses.
Context injection scoped to what the current intent needs, not the full org snapshot.

The realization that shifted my cost trajectory the most: most CRM interactions are orchestration problems, not reasoning problems. Once I accepted that, I could route simple operations through smaller, cheaper models and reserve larger ones for the genuinely hard cases (vehicle disambiguation, multi-entity quote construction). Over time, the architecture evolved toward minimizing how much work the model needed to do rather than maximizing what it was allowed to do.

4. System Architecture

Most of the architecture is unremarkable NestJS. The interesting parts are at three seams: how the orchestrator turns a user message into a plan, how the server pushes UI updates into a running client session, and how the agent stays coordinated with a side panel the user might be editing manually at the same time.

The module map

Everything agent-related lives under server/src/agent/. The four directories that matter:

server/src/agent/
  orchestration/       # prompt assembly, intent routing, the turn loop
    prompt-packer.ts   # composes the system prompt + INTENT_EXAMPLES + active card context
    agent.service.ts   # the turn loop: classify → call → emit
  cards/               # operations that mutate visible UI on the client
    open-card.ts       # emit an in-chat form-card
    update-card.ts     # merge field updates into an open card
    open-side-panel.ts # tell the client to open a real side panel pre-filled
  nlp/                 # deterministic extraction outside the LLM
    client-fields.ts   # regex extraction for phone/email/name
    vehicle-fields.ts  # year/make/model parsing with clarification contract
  state/
    session-store.ts   # per-session memory: history, currentGoal, panelContext

An incoming user message flows: AgentService receives it → looks up SessionStore state → prompt-packer composes the system prompt with the active card / panel context attached → LLM call with tools → tool dispatch (cards, side-panel, domain services) → emitUI pushes events onto the open stream → SessionStore is updated with the new state for the next turn.

The thing to notice: the LLM never does anything. It only emits structured tool calls. Every actual mutation — to the database, to the UI, to the session state — is executed by code I wrote and can step through with a debugger.

The session/state model

SessionStore holds, per session:

Message history — but compacted. I don't replay full conversations; older turns are summarized.
currentGoal — a string like create_quote, create_client, edit_job. Set when the agent commits to an intent; consulted by the next turn so context survives mid-flow interruptions.
Open card references — if the agent emitted a client-form-card, the session remembers it's open and which fields are populated, so a follow-up update_card call can do a partial merge instead of a clobbering rewrite.
panelContext — when the side panel is open, this records which entity is being edited (quote:abc123, job:def456). This is what unlocks panel-mode (more below).

Sessions live in Redis with a TTL. I rejected Postgres for this — too much write pressure, and the data is recoverable from message history if it ever matters. In-memory was the original prototype and lasted exactly as long as it took to deploy a second server instance.

Intent classification

No separate classifier model. Intent is inferred inside the same LLM call that does the work, via a layered prompt:

The system prompt enumerates the intents the agent understands.
INTENT_EXAMPLES provides a few-shot block per intent (client_mutation, quote_mutation, job_mutation, query, clarification_response, chitchat).
The LLM picks the intent by which tool it calls (or by replying conversationally if no tool fits).

This was the right call for my scale. A second classifier round-trip would have doubled latency for what is, in practice, a tool-selection problem the main model already solves. The cost is that the prompt is long — every turn carries every intent's examples — but the prompt cache absorbs most of that, and I'd rather pay tokens than a round-trip.

Ambiguity is handled explicitly: tools can return needs_clarification with a structured prompt, which the agent surfaces and the session remembers as the current goal. The clarificationCount counter on agent.tool events lets me watch this metric in production — if it climbs, the prompt examples need work.

Tool calling shape

Tools are Anthropic-style function schemas, defined as Zod schemas server-side and serialized to the tool-definition format the SDK expects. The full set is small — under twenty — and groups into four kinds:

Kind	Examples	Effect
Domain mutations	`create_quote`, `create_client`, `create_job`, `update_quote_status`	Hit the existing CRM service layer. Same code path as the manual UI.
Card ops	`open_card`, `update_card`	Render or mutate an in-chat form-card. UI-only.
Panel ops	`open_side_panel`, `patch_quote_field`, `patch_job_field`	Open the real side panel, or (when one is open) edit its live form.
Conversational	`ask_clarification`, `reply`	Surface text without a domain action.

The most important design rule: a tool either mutates data or mutates UI, never both implicitly. When the agent wants to create a quote and show a form-card to confirm it, that's two tool calls. This made the turn replayable for debugging — every state change is an explicit, logged event.

The stream protocol on the wire

WebSockets, plain. I evaluated SSE and rejected it because I needed bidirectional traffic for user-cancel and for the (deferred) panel-state-up channel. Each session opens one socket; the server multiplexes UI events down it.

A typical event sequence for "create a quote for Jordan, full detail tomorrow at 2":

1. ui.append  { type: 'user-message', props: { text: '...' } }
2. ui.append  { type: 'assistant-thinking' }         # immediate, perceived latency win
3. ui.append  { type: 'quote-form-card', props: {} } # empty card opens
4. ui.replace { key: 'quote-form-card', props: { clientName: 'Jordan' } }
5. ui.replace { key: 'quote-form-card', props: { ..., services: ['full-detail'] } }
6. ui.replace { key: 'quote-form-card', props: { ..., scheduledAt: '...' } }
7. ui.remove  { key: 'assistant-thinking' }
8. ui.append  { type: 'assistant-message', props: { text: 'Quote drafted...' } }

The client just renders whatever it's told. There is no client-side state machine deciding what should happen next; the server decides, the client paints. This was load-bearing for cross-platform consistency: web and mobile share the same protocol, and neither has to know about the agent's internal state.

Hidden node types (panel-open-command, session-meta, optimistic-patch) are filtered by a client-side TRANSIENT_NODE_TYPES set — they're protocol events, not rendered content. This let me add new agent → client behaviors (like "open the side panel") without ever introducing a new wire op.

Panel-mode collaboration

The most useful thing in the architecture, and the most fiddly to get right.

When the user opens a side panel — manually, by clicking "New Quote," or via the agent, by saying "create a quote" — the client emits a panel.open event up the same socket. The server records the panel target in panelContext. From that moment, the next user turn gets a different tool set: instead of create_quote, the agent sees patch_quote_field and patch_job_field. A message like "actually make that ceramic coating instead" routes to a field patch, not a new quote.

Three details that took longer than expected to get right:

The agent's view can drift. If the user manually edits a field while the agent is mid-turn, the agent's prompt context is stale. The client pushes field-level updates back into the panel context so the next prompt is correct; the current turn just has to tolerate occasional stale reads.
Panel close is not "done." Closing the panel doesn't mean the goal is complete; it might mean the user gave up. I separate panel.close (UI event) from goal.complete (agent decides based on whether the entity was saved).
Not every entity gets panel-mode. PanelMode in packages/validation/src/panel-context.ts only covers quotes and jobs. Clients deliberately don't have a panel-mode contract — they're simple enough that opening a fresh side panel pre-filled, and letting the form's native validation drive, was strictly better than building a patch_client_field path.

The one diagram

           ┌─────────────────┐         ┌─────────────────┐
           │   Web (Next.js) │         │  Mobile (Expo)  │
           │  agentui-react  │         │  agentui-react  │
           └────────┬────────┘         └────────┬────────┘
                    │   WebSocket: ui.* events  │
                    └─────────────┬─────────────┘
                                  │
                         ┌────────▼────────┐
                         │  AgentGateway   │  (NestJS, Fastify ws)
                         └────────┬────────┘
                                  │
               ┌──────────────────▼──────────────────┐
               │           AgentService              │
               │  ┌──────────────┐  ┌─────────────┐  │
               │  │ prompt-packer│  │ SessionStore│──┼──► Redis
               │  └──────┬───────┘  └─────────────┘  │
               │         │                           │
               │    LLM call (Anthropic)             │
               │         │                           │
               │  ┌──────▼───────┐                   │
               │  │ Tool dispatch│                   │
               │  └──┬──┬──┬──┬──┘                   │
               └─────┼──┼──┼──┼──────────────────────┘
                     │  │  │  │
                     │  │  │  └──► cards/open-side-panel   ── emitUI ──┐
                     │  │  └─────► cards/update-card       ── emitUI ──┤
                     │  └────────► nlp/* (deterministic, no LLM)       │
                     │                                                  │
                     └───────────► Domain services (Quotes/Clients/    │
                                   Jobs/Services) — same code path as ──┘
                                   the manual UI ─► Postgres via Prisma

Two things to call out on this diagram:

There is exactly one path into the database. The agent uses the same services the manual UI uses. I never wrote a "for the agent" alternative.
Everything the user sees comes through emitUI. Whether a tool's job was "render a card" or "create a quote and show a card," the visible side-effect is always a UI event on the same stream. This is what let me reorganize the chat → side panel migration without touching domain code.

Bridge to Part 2

The architecture above is what I ship today. It is not what I shipped first.

In Part 2, I walk through the four engineering fights that reshaped it, the approaches I tried and abandoned, the patterns that earned their keep, and what I'd tell a founder or CTO making the same build-vs-buy call today.

Comments

No comments yet. Be the first to comment!