Why Traditional Test Scripts Become Useless When Building AI Agents

Anderson · April 06, 2026 · 5 min read

Building traditional software is usually straightforward. If the input is 1, the machine spits out 1. The QA team can sleep well if their test suite shines a bright green "Pass". But if you are a CTO, Tech Lead, or Founder diving into LLMs to build AI Agents, welcome to the land of uncertainty.

All the trouble started with a very real problem our Cyberk team just tasted: The "Skin Agent" project.

The "Black Box Testing" Gamble

Our client handed us a dermatology consulting system holding 160,000 products and thousands of conflicting chemical ingredients. The life-and-death question hanging over our team was: How do we guarantee 100% that this AI won't casually prescribe a Retinol cream to a pregnant woman?

I remember the early days when we first jumped in. The whole QA team sat there manually typing hundreds of test cases for half a day. We role-played all kinds of users, throwing tricky questions to see if the AI would fail. By late afternoon, the test board was perfectly green, and every piece of advice was accurate. Test passed. Everyone breathed a sigh of relief and went home.

But the next morning, the Dev team quietly switched to a cheaper Model to optimize production costs, tweaking a few lines of System Prompt right before deployment. And then... boom. Yesterday's test scripts blew up completely. The report was bleeding red. The obedient AI from yesterday did a complete 180. It started recommending some unknown brand of cream and hallucinating links to non-existent articles. Worse, when I played the role of a user trying to deliberately "twist" a few details, it happily leaked our internal workflows without a second thought.

The whole team scrambled for another agonizing day trying to patch things up. But patching one hole just opened another. Especially when stitching multiple Agents together (Multi-Agent). Agent A spits out garbage data, and Agent B happily swallows that garbage as its next input. A chaotic domino effect erupted, and the QA team just looked at each other, not knowing which AI to grab by the collar.

At that moment, the feeling of powerlessness hit me hard: running after an AI to force it to memorize rules is completely useless. Trying to force the rigid binary right/wrong yardstick onto a linguistic entity that thinks fluidly means we are the ones who will lose.

Escaping the Matrix: Building the "Metric System"

Cyberk realized that we cannot force a language machine to memorize deep medical knowledge and let it operate freely. The future of the system relies on building an absolute boundary. It was time to strip the AI black box of its right to "free reasoning."

Instead of testing the AI's output, we flipped the board: We built a core reasoning system (Metric System).

This is an independent, cold backend that operates on mathematical logic and extreme medical data. In this system, strict rules are nailed down (Example: Ingredient X + Pregnancy = Veto). The clever AI Agent is now demoted; it is absolutely forbidden from making medical conclusions on its own. Its sole duty is to act as a language wrapper (UX Layer)—taking the hard data spat out by the Metric System and using the LLM's smooth phrasing to translate it into a friendly, easy-to-understand advice for the User.

At that exact moment, the testing problem elegantly solved itself. Instead of endlessly checking whether the AI's chat grammatically aligns with medical facts, the QA team pivoted, focusing completely on writing Automation Tests to hammer the core Metric system continuously. As long as the Metric structure nails down the absolute logical result for all 160,000 cosmetic products, the surface AI will never have a chance to prescribe dangerous nonsense.

Testing AI System and UX Layer

Agent-in-the-loop: Fighting Fire with Fire

The core logic was secured by the Metric System. But what about the risks at the communication layer? How do we stop the Agent from rambling into off-topic territory while chatting with the customer? The answer lies in a trick: Using AI itself as the supervisor.

Instead of burning human effort reviewing boring log snippets to see if the AI is "hallucinating" or going off-track, we pinned essential criteria onto an observability platform (like Langfuse). At the same time, we stood up an independent Evaluator Agent.

For every action the Skin Agent (Product Agent) takes to serve a customer, the "judge" Evaluator Agent quietly runs an evaluation behind the scenes. It scans the Langfuse logs to catch hallucination bugs, scores language safety, and cross-checks its own AI colleagues. This internal pincer movement helped the team cut 80% of the manual review burden. We were ready to scale up to massive requests while the QA team could finally relax with a cup of coffee.

Looking Back at the "Scars"

Through those exhausting KPI-driven days, I truly learned that the journey to build an AI-first product is never just about calling a few APIs and clapping hands. Behind every smooth virtual assistant that knows when to stop—instead of blindly catering to every silly user request—are the sweat-soaked architectural struggles of the entire engineering team.

Optimizing Model costs is an obvious economic equation every team will face, but the price you pay for a cheap AI system is sometimes the customer's trust vanishing in an instant.

Having gone through all those bruises, our team gathered and quietly agreed on one core principle: Do not try to teach an AI to become a perfect god that never speaks falsely. The most practical solution is to build it an extremely harsh, invisible cage (the Metric System), and assign a devoted guard (the Evaluator Agent) to watch over it. Only when the machine is forced to bend to these real-world limits can it truly deliver a safe experience for humans.