CloudAnts
AI

Adding AI to Your Product: What Actually Works

A practical guide to adding AI to your product: where it earns its keep, where it is theatre, the non-negotiables for production, and how to start small.

Arvin Chu
Arvin Chu · Jun 14, 2026 · 10 min read
Adding AI to Your Product: What Actually Works, CloudAnts

Half the "add AI to your product" requests we get are good ideas. The other half are a board slide. Someone heard a competitor shipped a copilot, and now there is pressure to bolt a chat bubble onto a working app so the next investor deck has the word in it. We do this work for a living, and we will still talk a client out of it when the use case is theatre. A feature that exists to be mentioned, not used, is a maintenance cost with marketing upside that fades in a quarter.

This post is the version of the conversation we have across a table with a founder or PM who wants AI in their product but has been burned by hype. Where it genuinely earns its keep today, where it does not, the things that separate a production feature from a demo, and how to start small enough that you find out cheaply.

Where AI actually earns its keep today

The pattern that works is narrow and unglamorous: AI is good at turning messy human input into structured action or structured answers, inside a bounded domain you control. Five categories deliver real value right now.

Assistants that take actions against live data. A support or receptionist bot that does not just answer but completes the job: books the appointment, files the ticket, updates the record. The action is the value. A bot that only talks is a brochure with a typing indicator.

RAG question-answering over your own documents, with citations. Point a model at your policies, your manuals, your knowledge base, and let staff or customers ask in plain language. The non-negotiable is the citation: every answer links to the source paragraph so a human can verify it. Answers without sources are confident guesses, and confident guesses over your own docs are worse than no feature.

Classification and extraction. Routing tickets by topic, tagging messages by sentiment, pulling fields out of invoices and emails and PDFs into structured rows. This is some of the highest-ROI, lowest-drama AI work there is. It runs in the background, it is easy to measure, and nobody has to be charmed by a chat UI.

Drafting and summarizing. First-draft replies, meeting summaries, release notes, long-thread digests. The model does the boring 80%; a person edits the last 20%. The win is real because the human stays in the loop by design.

Semantic search and recommendations. Search that understands "something for back pain" without the exact keyword, or "more like this" that actually feels like this. Embeddings plus a vector store turn fuzzy intent into relevant results.

What unites all five: a clear success criterion, a human nearby, and a bounded domain. That is the whole shortlist.

Where it is theatre, and where it is dangerous

The bad fits split into two groups, and they fail differently.

The first group is theatre: AI slapped on for marketing. A chat widget on a five-page brochure site. A "smart" feature that wraps a model around a task a dropdown already did better. These do not hurt anyone; they just cost you build time and maintenance to ship something users ignore. The tell is that nobody can describe the job the feature does, only the category it belongs to.

The second group is dangerous: anything that needs a guaranteed-correct answer with no human in the loop. Math that must reconcile. Legal advice. Medical diagnosis. Tax numbers. A final dollar figure on an invoice. Language models are fluent, not correct, and the gap between those two is exactly where you get sued. We do not let a model produce the number that has to be right. We let it draft, draft, extract, and route, and we let deterministic code and a person own anything that must be exact.

The non-negotiables that separate production from a demo

A demo is easy. You can wire a model to a chat box in an afternoon and it will dazzle in a controlled walkthrough. Production is a different animal, because production means it runs unattended, with real users, real data, and a real bill. Five things make the difference, and a feature missing any of them is a demo wearing a deadline.

It takes a real action or cites a real source. Not just chat. Either it does something in your system, or it answers with a verifiable reference. Chat-for-the-sake-of-chat plateaus at "slightly faster FAQ" no matter how good the model is.

Guardrails: the model never touches your database directly. This is the one we are most stubborn about. The model proposes; deterministic code disposes. It requests an action, and your own validated code checks that request against live data and business rules before anything is written. The model is a suggestion engine, not a privileged user. Wire it straight to your DB and you have handed write access to a system that hallucinates.

Human handoff. When the model is unsure, or the topic is sensitive, or a user asks, it must escalate to a person cleanly and stay out of the way once it does. Users forgive "let me get someone for you." They do not forgive fighting a bot to reach a human.

Evals and measurement. You cannot ship what you cannot measure. Before launch you need a set of real cases with known-good answers and a way to score the model against them, so you can tell whether a prompt change made things better or just different. "It felt smarter" is not a release gate.

Cost controls. Every model call is logged with its token usage and cost, and you set a ceiling that alerts before it blows. An AI feature without spend monitoring is a bar tab someone else is running up in your name. We log per-call spend and cap it, with alerts at 80% and 100% of budget, on every AI build.

Build vs. buy: usually start by wrapping a strong model

There are three real options for the engine, and the right starting point is almost always the simplest one.

Wrap a strong general model behind an API. Claude or OpenAI, called through your own thin service layer. This is where to begin for the overwhelming majority of products. The frontier models are good enough that prompt design, your data, and your guardrails matter far more than which model you picked. Keep the provider switchable behind one interface; it keeps you honest on cost and quality and gives you a fallback when one provider has a bad week.

Add a framework when orchestration gets real. Tools like LangChain, a vector store such as pgvector, and an eval harness earn their place once you have multi-step retrieval, several tools the model can call, or knowledge that has to stay current. Reach for them when wrapping a model alone starts to creak, not on day one because a tutorial used them.

Fine-tune last, and rarely. Fine-tuning is for narrow, high-volume, stable tasks where prompting genuinely is not enough, and it is a real commitment: you own a dataset, a training loop, and a model that drifts from the base over time. We have shipped a lot of production AI without it. If a vendor leads with fine-tuning before they have tried a good prompt and good retrieval, be suspicious.

A worked example: our Denti receptionist

The clearest proof we have is one we built and run ourselves. Denti is our own multi-tenant dental SaaS, and it ships with an AI receptionist that lives in Facebook Messenger, books real appointments against live schedules, and runs in production today. It is a tidy illustration of every non-negotiable above.

Conversation memory lives in Redis with a 24-hour window per person, so a reply makes sense in context without yesterday's chat bleeding into today's. A language model, Claude or OpenAI behind one switchable interface, reads the conversation and decides one of three things: answer from clinic data, take an action, or hand off to a human. Actions run through an executor with guardrails, and the model never touches the database. It requests an action, check this slot or book that appointment, and deterministic code validates that request against live data, scoped to the one clinic, before anything is written.

The safety rails are the part casual demos skip. Emergency-phrase detection runs before the model and routes straight to a person; we do not let a model triage bleeding. Staff can take over any conversation instantly. When staff edit a bot reply, that correction is stored and re-injected into future conversations as a few-shot example, so the bot picks up clinic-specific answers without anyone retraining a model. And every call is logged and capped. You can read the full field report in our writeup on building an AI receptionist that takes real actions, and the engineering underneath it in how we built a multi-tenant dental SaaS. There is a live version of the Messenger booking running on our homepage if you would rather watch one work than read about it.

The point for your product is not the dentistry. It is the shape: memory, a model that decides, an executor that does the work behind validation, a clean path to a human, and a meter on the bill. Change the domain and the shape holds.

Honest limits, and how to start small

Three limits worth saying out loud. Language models are fluent before they are correct, so anything that must be exact stays behind deterministic code and a person. They drift as providers ship new versions, which is why evals are not optional; your golden cases are how you notice a quiet regression. And they cost money per call, so a feature that is cheap at demo scale can surprise you at real traffic if nobody set a cap.

Given all that, start small and measure. Pick one narrow job with a clear success criterion, the most repetitive, most bounded task you have. Build the thin version: wrap a strong model, put your guardrails and human handoff around it, write a dozen eval cases, and turn on cost logging. Then watch three numbers. Resolution rate, the share of cases the feature completes without a human. Task completion, did it actually do the thing or just talk about it. And cost per resolved task, which should be a small fraction of what doing it by hand costs. Baseline those before you launch, or you will never know what the feature changed.

We scope AI features as fixed-scope projects, typically 4 to 10 weeks for a conversational assistant or a RAG system, depending on how many real actions it has to take and how much of your data it has to reach safely. That includes the unglamorous half nobody demos: the guardrails, the evals, the cost caps, and the handoff. The full breakdown of how we scope and price this work is on our services page.

If you want a frank read on whether AI fits your product, including when the honest answer is "not yet," start a conversation with us. We would rather tell you that across a table now than build you a feature you turn off in a quarter.

Let's talk

Got a project
in mind?

Tell us what you're building. We'll respond within 24 hours.

Start a project →