The Testing Problem with AI Agents

The Testing Problem with AI Agents

You can't unit test an AI agent. I mean, you can — you can test individual functions, mock API responses, verify that the right prompts are constructed. But you can't test the thing that actually matters: what the agent decides to do when it encounters something you didn't anticipate.

This is the fundamental testing problem with autonomous AI, and after building several agents for client projects, I don't think the traditional testing playbook applies anymore.

Why Traditional Testing Breaks

Traditional software follows a script. Given input X, produce output Y. You test edge cases, boundary conditions, and error states. The space of possible behaviours is large but bounded.

Agentic AI doesn't follow a script. It's given a broad objective and must figure out how to meet it. The range of possible actions isn't just large — it's unbounded. Your agent might call APIs you didn't know existed, interpret instructions in ways you didn't consider, or compose multi-step plans that are internally consistent but wrong in ways that no individual step reveals.

Three specific challenges make this hard:

Underspecification. "Handle customer support emails" sounds like a clear objective until you encounter an email that's half complaint, half feature request, sent from a personal email that doesn't match any customer record. Every vague goal creates a decision surface the agent must navigate without rules.

Emergent behaviour. Small changes in input can drastically alter agent behaviour. A slightly different phrasing in a customer email might route it to a completely different workflow. This isn't a bug — it's how language models work. But it makes test coverage essentially impossible in the traditional sense.

Long-term dependencies. An agent's choice today might cause a problem weeks later. If it categorises a customer as "low priority" based on limited information, every subsequent interaction with that customer might be underserviced. By the time anyone notices, the cause is buried in a decision chain from three weeks ago.

What Works Instead

After shipping agents that handle real-world tasks — email triage, research synthesis, content generation, ticket management — I've converged on a few approaches that actually work.

Scenario-based testing over unit testing. Instead of testing individual functions, I test complete workflows with realistic scenarios. "A customer emails saying their login doesn't work, they're angry, and they cc their boss." Run the scenario. Check every output and intermediate decision. Repeat with variations.

Golden set monitoring. I maintain a set of known-good inputs with expected outputs. These run periodically against the production agent — not as gates, but as canaries. If the agent's responses to known inputs start drifting, something has changed.

Human review sampling. For agents that interact with users, I sample 5-10% of interactions for human review. Not as a quality gate for individual responses, but as a trend signal. Are the agent's decisions getting better or worse over time?

Adversarial testing. I deliberately try to break the agent before deployment. Contradictory instructions. Ambiguous inputs. Edge cases the training data definitely didn't cover. The goal isn't to make it handle everything perfectly — it's to understand where it breaks and ensure those failure modes are safe.

The Dashboard as Safety Net

The most valuable testing tool I've built isn't a test suite. It's a monitoring dashboard that shows the agent's decision patterns in real time.

Traditional analytics tell you what happened. Agent dashboards need to show you why things happened — the reasoning chain, the information retrieved, the decision points. When something goes wrong, you need to trace the full path from input to output.

This isn't an afterthought. For any autonomous system, the monitoring layer is as important as the capability layer.

Where I Land

Testing AI agents is fundamentally different from testing traditional software. The sooner teams accept that and adapt their practices, the sooner they'll ship agents that actually work in production.

The goal isn't eliminating mistakes. It's making mistakes visible, traceable, and recoverable. An agent that makes a bad decision and gets caught is fine. An agent that makes a bad decision and nobody notices for three weeks is dangerous.

Build for visibility. Test for failure modes. And accept that "comprehensive test coverage" is a concept that doesn't transfer cleanly to autonomous systems.