How We Built an AI Support System That Resolved 73% of Tickets Automatically
A technical deep dive into the AI customer support platform we built for a US enterprise client.
Last year, we built an AI-powered customer support system for a mid-size US SaaS company processing around 3,000 support tickets per month. After six months in production, the system is automatically resolving 73 percent of incoming tickets without human intervention. Here is how we built it.
I want to share the technical details because most articles about AI in customer support are either press releases full of vague promises or vendor comparisons that do not touch the actual engineering. This is the real story — what worked, what did not, and what we would do differently if we started over today.
Why We Did Not Use an Off-the-Shelf Solution
The client evaluated several existing AI support platforms before coming to us — Intercom Fin, Zendesk AI agents, and Ada. They all work reasonably well for generic support scenarios. The problem was that this client's product is a specialized B2B platform for managing clinical trial data, and the support tickets are highly technical and domain-specific. Off-the-shelf AI support tools were resolving only about 15 percent of tickets accurately because they did not understand the product-specific terminology, workflows, and edge cases.
That 15 percent resolution rate was actually worse than having no AI at all, because the 85 percent of wrong answers created frustration. Customers were getting confidently incorrect responses about data validation rules, compliance requirements, and API integration steps. The client pulled the plug on the off-the-shelf solution after six weeks.
The Architecture
The system has three layers. The first is an intent classification model that analyzes incoming tickets and categorizes them into one of forty-two predefined categories. This model was trained on eighteen months of historical ticket data — approximately 54,000 labeled examples.
We debated between building a custom classification model and using a large language model for classification. We went with a fine-tuned model for several reasons: it is faster with classification taking under 50 milliseconds versus 1-2 seconds for an LLM API call. It is cheaper as we process 3,000 tickets per month. And it is more predictable since the fine-tuned model outputs consistent category labels, while LLMs occasionally hallucinate new categories.
The second layer is a retrieval-augmented generation system that pulls relevant documentation, previous ticket resolutions, and knowledge base articles to generate a contextual response. We use vector embeddings stored in Supabase with pgvector for semantic search.
The RAG implementation deserves more detail because this is where the magic happens. We embedded the client's entire knowledge base — 450 help articles, 12,000 resolved tickets with agent notes, API documentation, and internal runbooks — into vector space. When a new ticket comes in, we generate an embedding for the ticket text and perform a similarity search to find the most relevant documents.
But simple similarity search was not enough. We found that returning the five most similar documents often included redundant information or missed critical context. So we implemented a re-ranking step using a cross-encoder model that evaluates the relevance of each retrieved document against the specific ticket. The cross-encoder runs on only the top twenty candidates so the total latency is acceptable — about 200 milliseconds for the full retrieval and re-ranking pipeline.
The third layer is a confidence scoring system. Every AI-generated response receives a confidence score. Tickets with scores above 0.85 are auto-responded. Tickets between 0.6 and 0.85 are sent to agents with a suggested response. Tickets below 0.6 go directly to human agents.
The confidence score is not just the model's internal probability estimate — those tend to be poorly calibrated. We built a separate confidence model that considers multiple signals: the intent classification probability, the semantic similarity between the ticket and the retrieved documents, a coverage score indicating how much of the ticket content is addressed by the generated response, and a novelty score that flags tickets significantly different from the training distribution.
The Data Pipeline
The most underappreciated part of this system is the data pipeline that keeps it improving over time. Every interaction generates training signal.
When a ticket is auto-resolved and the customer marks it as helpful, that is a positive training example. When a customer marks an auto-response as unhelpful or follows up with additional questions, that is a negative example. When a human agent receives an AI suggestion and uses it without modification, that is a strong positive signal. When an agent modifies the suggestion before sending, the modification is captured as a correction.
We built this feedback pipeline using Supabase real-time subscriptions and a background processing queue. Every week, a batch process retrains the confidence model and updates the RAG index with new resolved tickets. Every month, we evaluate whether the intent classification model needs retraining based on accuracy metrics.
The feedback loop is why the system keeps getting better. The resolution rate improved from 41 percent in month one to 73 percent in month six, not because we made dramatic architectural changes, but because the system learned from thousands of interactions.
Handling Multi-Turn Conversations
Support tickets are rarely one-shot interactions. A customer might submit a ticket, receive an AI response, and then follow up with clarifying questions. We maintain a conversation context window that includes the full ticket history, all previous responses, and any actions taken such as account lookups or configuration changes.
For conversations that require account-specific actions — looking up configuration, checking subscription status, or reviewing API usage — we built tool-use functions that the AI can invoke. The AI generates a structured function call, the system executes it against the client's database, and the results are incorporated into the response. This allows personalized answers rather than generic instructions.
The Results
After the first month: 41 percent auto-resolution rate. After three months: 62 percent. After six months: 73 percent. The improvement came from continuous fine-tuning based on agent feedback — every time a human agent modified an AI suggestion, that correction was fed back into the training pipeline.
Average response time dropped from 4.2 hours to 8 minutes for auto-resolved tickets. Customer satisfaction scores actually improved by twelve percent, because speed matters more than perfect prose in support interactions.
Some additional metrics that tell the story. The support team went from twelve agents to eight through natural attrition. The remaining agents handle more complex, higher-value tickets and report higher job satisfaction. The cost per ticket dropped from $12.40 to $4.80. And the client estimates the system paid for its development cost within four months.
What We Got Wrong
No project is perfect. First, we over-engineered the initial intent classification system. We could have started with an LLM-based classifier and only moved to a custom model once we validated the categories. Second, we underestimated the importance of the customer-facing UI. Our first version had a chatbot-style interface that customers found impersonal. We redesigned it to look like a traditional email response with a subtle AI-assisted badge, and satisfaction scores improved significantly. Third, we should have implemented multi-language support earlier — when we added Spanish support, it immediately covered an additional 8 percent of tickets.
Key Lessons
First, do not try to build a general-purpose AI. Build a specialist. Our model only handles support tickets for one specific product. It knows nothing about the world — and that is why it is accurate.
Second, the confidence threshold matters enormously. Setting it too low creates bad customer experiences. Setting it too high means the AI barely resolves anything. We spent three weeks calibrating the threshold using A/B testing. We started at 0.9 which gave us a 28 percent resolution rate with 97 percent accuracy. We gradually lowered it to 0.85 which gave us a better resolution rate with 94 percent accuracy.
Third, always have a human escape hatch. Every AI response includes a clear option to connect with a human agent. This is not just good UX — it is essential for the edge cases where AI confidently gives wrong answers. About 4 percent of auto-resolved tickets are escalated to humans after the initial AI response.
Fourth, invest in the feedback loop from day one. The difference between a static AI system and one that improves over time is the quality of your feedback pipeline. Every customer interaction is a training opportunity. Every agent correction is a lesson.
The Technology Stack
For teams considering building something similar, here is what we used. Supabase with pgvector for the vector database and application backend. Next.js for the agent dashboard. Claude API for response generation in the RAG pipeline. A fine-tuned classification model hosted on an edge function. Resend for email-format response delivery. And a custom evaluation pipeline with Python for weekly model assessment. Total infrastructure cost in production: approximately $340 per month for 3,000 tickets.
Is This Right for Your Company?
AI customer support makes sense when you have at least six months of historical ticket data to train on, your tickets follow repeatable patterns, your product is complex enough that generic AI tools cannot handle the domain specificity, and your ticket volume is high enough that the cost savings justify the investment. If you process fewer than 500 tickets per month, use an off-the-shelf solution. If you process more than 1,000 with a specialized product, a custom system can deliver significant ROI within six months.
We are happy to evaluate whether AI support makes sense for your specific situation. The analysis takes about two days and involves reviewing your ticket data, categorizing resolution patterns, and estimating the potential automation rate before any engineering work begins.
Want to discuss this topic?
Our team is ready to help you implement the ideas from this article.
