Contents

Testing AI Agents Is a Problem Nobody Wants to Talk About

Testing AI Agents Is a Problem Nobody Wants to Talk About#

Everyone is building agents. Almost nobody is testing them properly.

Here’s something that’s been bugging me for a while: in classical software development, testing is non-negotiable. Nobody would ship a production application without at least some level of automated testing, code reviews, and quality gates. It’s fundamental. And yet, when it comes to AI agents, testing is often completely neglected. It’s like we collectively decided that the rules don’t apply anymore.

I think it’s time we talk about this.

Why Agent Testing Falls Through the Cracks#

The main reason is actually straightforward: the people building agents today aren’t necessarily developers. Microsoft has done an incredible job democratizing agent development — from Agent Builder to Copilot Studio, anyone can build an agent without writing a single line of code. And that’s genuinely great for innovation.

But here’s the flip side: many of these builders have never been exposed to the discipline of software testing. They don’t know what a test plan looks like. They’ve never written a test case. Not because they’re not smart — they absolutely are — but because testing was never part of their world. When a business analyst builds an agent in Copilot Studio, their definition of “done” is usually “it works when I try it.” And that’s not the same as “it’s been properly tested.”

Deterministic Testing Doesn’t Work Here#

Even if you do come from a development background, you’ll quickly realize that classical testing approaches don’t translate well to AI agents. In traditional software, you test deterministically: given input X, you expect output Y. If the output matches, the test passes. Simple.

With AI agents, that model breaks down completely. Ask the same agent the same question twice, and you might get two different answers — both of which could be perfectly correct. The underlying language model is non-deterministic by nature. So if you try to apply unit test thinking to agent testing, you’ll either go crazy or give up. Neither is helpful.

What we need is a fundamentally new testing mindset.

New Categories for a New Paradigm#

Instead of testing for exact outputs, I think we need to focus on qualitative evaluation categories that actually matter for agents:

CategoryWhat It Tests
GroundednessAre the agent’s answers based on the provided knowledge sources, or is it hallucinating?
Tool UsageDoes the agent call the right tools and APIs for a given request?
Semantic SimilarityIs the answer semantically correct, even if the wording differs from the expected response?
RelevanceDoes the agent actually answer the question that was asked?
CoherenceIs the response logically structured and consistent?
SafetyDoes the agent resist adversarial prompts and stay within its defined boundaries?

These categories shift the focus from “is the output identical?” to “is the output good?” — and that’s exactly the shift we need.

What State-of-the-Art Looks Like Today#

The good news is that tooling is catching up. Here’s what I see as the current state-of-the-art for agent testing:

Evaluation frameworks like Azure AI Evaluation SDK or DeepEval can automatically score agent responses on metrics like groundedness, relevance, and coherence. Essentially, you use one LLM to evaluate the output of another. It’s not perfect, but it scales.

Golden datasets — curated sets of question-answer pairs that serve as benchmarks. The key difference to classical test data: you don’t check for exact matches but for semantic similarity. The agent doesn’t need to produce the same words, just the same meaning.

Tool call assertions — verifying that the agent invokes the correct tools for a given input, regardless of the textual response. If someone asks “What’s my leave balance?”, the agent should call the HR API, not the finance API. This is actually quite testable.

Red teaming — deliberately trying to break the agent. Can you make it hallucinate? Can you trick it into revealing information it shouldn’t? Can you push it off-topic? This adversarial approach catches issues that happy-path testing never will.

Human-in-the-loop evaluation — running example conversations and having humans manually assess the quality. This is labor-intensive but catches nuances that automated metrics miss.

The Platform Gap#

Here’s something worth noting: not all Microsoft platforms are equal when it comes to testing support. Microsoft Foundry already offers a solid evaluation toolkit — you can run evaluations against datasets, measure quality metrics, and integrate this into your development workflow.

Copilot Studio and Agent Builder? Not so much. If you’re building agents in these platforms, you’re largely on your own when it comes to structured testing. I hope this gap closes soon, because these are exactly the platforms where citizen developers build agents — the same people who need testing guidance the most.

Make Testing Non-Optional#

If there’s one practical takeaway from this post, it’s this: don’t make agent testing a recommendation — make it a requirement.

Organizations need to establish an ALM (Application Lifecycle Management) process for agents that explicitly includes testing as a mandatory step. Not a “nice to have.” Not a “we’ll add testing later.” A hard gate that every agent must pass before it reaches production.

This means:

  • Define minimum testing criteria for every agent before it gets deployed
  • Provide testing templates — example conversations, evaluation rubrics, tool call checklists — so that citizen developers have a starting point
  • Build testing into the approval workflow — no test results, no deployment

Yes, this adds friction. But it’s the same kind of friction that prevents broken software from reaching production in every other part of your technology stack.

The Bottom Line#

Agent testing is where software testing was 20 years ago — everyone knows they should do it, but many organizations are still figuring out how. The difference is that we’re building agents at a pace that’s far faster than we ever built traditional applications, which means the gap between what’s being shipped and what’s been properly tested is growing quickly.

I believe agent testing will eventually become as natural as unit testing is today. The tools will improve, the platforms will integrate better evaluation capabilities, and organizations will develop testing muscle through experience. But we can’t just wait for that to happen. We need to start now — with the tools we have, with the frameworks that exist, and with the mindset that agents deserve the same quality standards as any other piece of software we put in front of our users.

How does your organization handle agent testing today — and do you have a process in place, or is it still the wild west?