Why You Should Not Put Everything into an LLM
LLMs have a dangerous charm: they make demos look amazing. You send a question, it answers correctly, naturally, completely. Ten minutes and you have a prototype. Everyone thinks: "let's just put everything in here."
I thought the same. Then I went and built something real.
The gap between demo and production is where most AI products die. A demo does not have 100K users calling at the same time. A demo does not pay for every API call. A demo does not need to explain why it recommends different products every time you refresh. A demo does not get sued when it hallucinates wrong ingredients for a skincare product.
The problem is not that AI is bad. The problem is when you turn AI into your architecture instead of your tool, you lose control — over cost, over quality, over your ability to debug. And you only realize this when it is too late to change.
I built a skincare recommendation system. 58,000 products. Four core features: personalized recommendations, finding cheaper alternatives, product lookup from photos, and natural language search. The kind of product where "just use AI" sounds like the obvious answer.
I tried. Here is what happened.
Each recommendation call costs about $0.01 to $0.05. Sounds cheap. But users do not make one call per day. They retry, change filters, compare options — five or more calls per session on average. At 100K users, that is $5K to $25K per month just for AI. That number is larger than the revenue of most startups at this stage.
Your startup looks profitable on paper but bleeds money on infrastructure. You die not because you have no users — you die because you have too many.
In a regular chatbot, hallucination is annoying. In skincare, it is dangerous. The LLM recommends products that do not exist in your database. Or worse: it recommends the right product but gets the ingredients wrong — says "retinol-free" for a product that contains retinol, to a pregnant woman.
Trust lost once does not come back. Users do not give you a second chance when their health is involved. And here is the thing: you need a verification layer outside the LLM anyway. So why not let that layer do the core work?
Same input, different output every time. Even temperature zero does not fully guarantee consistency. What do you write unit tests for when the output keeps changing? How does your QA engineer verify anything?
The result: your team loses confidence in deployments. Every release is a gamble. Velocity slows down because nobody dares to merge without manual checking.
An LLM call takes one to three seconds on average. For recommendations — something users expect to happen instantly — three seconds is forever. Users do not know and do not care that you are calling AI behind the scenes. They only see: slow.
The result: bounce rate goes up, engagement goes down, then you spend more money on marketing to bring users back — a downward spiral.
After hitting all four walls, I arrived at one principle: separate "understanding" from "deciding."
AI is great at understanding — reading natural language, recognizing images, parsing intent from vague questions. That is its irreplaceable strength. But AI is bad at deciding — which product to pick, how to rank results, what to filter out. These things need to be deterministic: same input, same output, fast, testable, debuggable.
When you throw both into an LLM, you force AI to do what it is not good at. When you separate them, each part gets solved by the right tool.
This is not theory. Here are four engineering patterns I used, and they apply to any AI product — not just skincare.
Instead of one LLM call that returns the final answer, you build a pipeline with clear stages: filter, score, adjust, rank. Each stage has defined inputs and outputs. Each can be tested alone. Each can be replaced without touching the others.
In the skincare system, it works like this: a Hard Filter removes unsafe products (pregnancy risks, allergies). A Scorer calculates points based on skin profile and ingredients. A Weather Adjuster shifts scores based on real-time humidity and UV. A Ranker picks the top results. The whole thing runs in under 100 milliseconds. Costs zero. Fully deterministic.
Why this is better than an LLM: when the result is wrong, you know exactly which stage failed. With an LLM, your only tool is prompt engineering and hope.
AI does not touch the decision logic. It only shows up at the edge — where user input is fuzzy and needs to become structured data.
The key: force AI to return JSON with a fixed schema. This is the contract between the fuzzy part (AI) and the deterministic part (your code). A user types "lightweight sunscreen for oily skin in summer." AI parses it into { skinType: "oily", category: "sunscreen", season: "summer" }. From there, the pipeline handles everything with plain code.
When AI makes a mistake at the boundary, you catch it immediately — schema mismatch, invalid values, easy to detect. When AI makes a mistake at the core, you only find out when a user complains.
Ask yourself: does this output change in real-time, or can it be computed ahead of time?
Many things that seem to need real-time AI can actually be pre-computed. Finding cheaper product alternatives? I compute similarity between 58K products every night using a batch job — TF-IDF vectors, cosine similarity, ingredient matching. Store the top 50 similar products for each item. Result: less than 5 milliseconds per query, zero dollars, and no dependency on any AI model.
Embeddings work the same way. Compute once when a new product enters the database, store the vector, use pgvector for similarity search. No AI call per request.
This pattern applies broadly: recommendation scores, content classification, fraud signals — anything that can be batched should be batched.
Not every AI call needs the most powerful model. Parsing search intent into structured JSON? A lightweight model at $0.10 per million tokens is enough — it only needs to output valid JSON. Reading product labels from photos? A mid-tier model with vision capability at $0.50 per million tokens. Never use a premium model for a task that a lite model can handle.
The general rule: map each AI task to the cheapest model that meets your accuracy threshold. Measure accuracy at each tier. Only upgrade when the numbers say you must.
Now look at the four problems again. All solved — by engineering, not by hope.
Cost: Most traffic hits the deterministic pipeline. Zero dollars. AI only at the boundary, with model tiering and pre-computation. AI cost drops by more than 90%.
Hallucination: AI has no power to "make up" results. It only parses input. The pipeline decides output from verified data in the database.
Non-deterministic: The pipeline is fully deterministic. Same input, same output. Unit tests pass. CI is green. Ship with confidence.
Latency: Under 100 milliseconds for recommendations. Under 5 milliseconds for finding alternatives. AI calls only happen for photo recognition and semantic search — places where users expect to wait.
| Scale | AI calls/month | AI cost |
|---|---|---|
| 0 users (dev) | ~0 | ~$0 |
| 1K users | ~50K | ~$2 |
| 10K users | ~500K | ~$15 |
| 100K users | ~5M | ~$100 |
$100 per month at 100K users. Compare that to the $5K–$25K you would spend if every request hit an LLM. The architecture stays the same from 10 users to 100K — no re-architecture, no panic scaling.
These days, whenever someone on my team suggests "let's use AI for this," my first reaction is: can we not? Not because I dislike AI — I clearly use it. But the default should be deterministic code. AI should be the exception you justify, not the starting point.
Most of the time, plain rules kill the AI call before it starts. And the product is better for it — faster, cheaper, testable. Users see something that feels intelligent. They have no idea that 80% of the "intelligence" is just well-engineered code doing exactly what it is told.
The smartest thing I did was not picking a better model. It was knowing where to stop using one.