Testing in the Dark: How AI Is Breaking Traditional Software Verification

Breaking: Traditional Code Testing Fails Against AI-Generated, Non-Deterministic Software

Traditional software testing methodologies are crumbling in the face of AI-generated code and non-deterministic agents, according to a leading industry expert. Fitz Nowlan, Vice President of AI and Architecture at SmartBear, told a recent podcast that developers can no longer rely on knowing the contents of their codebase to verify it works.

Testing in the Dark: How AI Is Breaking Traditional Software Verification — Source: stackoverflow.blog

“We are moving away from old assumptions that you must see the source to test it,” Nowlan said. “When you don’t know what’s in your code, you need entirely new verification strategies.”

The Core Challenge: Non‑Deterministic MCP Servers

The rise of Large Language Model (LLM)-driven agents has introduced non-determinism into software behavior. Model Context Protocol (MCP) servers, which integrate AI agents, produce outputs that vary even with identical inputs, breaking traditional test oracles.

“Non-determinism means you can’t write a simple assertion like ‘output equals expected,’” Nowlan explained. “You have to accept a range of possible valid outputs, changing the entire testing paradigm.”

Background: The Shift from Known Code to Invisible Logic

Historically, developers tested against a known, human-written codebase. With LLM-generated code and AI agents, the source becomes opaque or is generated in real-time. Companies now deploy systems without full control over their internal logic.

MCP servers act as bridges between AI agents and external tools, but their behavior is inherently probabilistic. This forces testers to focus on outcome constraints rather than step‑by‑step verification.

What This Means: Data Locality and Construction Take Center Stage

When source code is cheap to generate through AI, the value shifts to data quality and locality. Nowlan argues that constructing high-quality, diverse datasets becomes more critical than static code analysis.

“If you can generate code instantly, testing becomes about the data your system consumes and produces,” he said. “Data locality—having the right data near your test environment—is now strategic.”

Testers must build rich, representative datasets that trigger the full range of AI outputs, then use statistical or property-based tests to validate behavior. This marks a move from functional correctness to behavioral reliability.

Industry Implications: Urgent Need for New Tools and Mindsets

Engineering teams are urged to adopt test harnesses that can measure probabilistic outcomes, such as is used in reinforcement learning. Traditional pass/fail metrics need to be replaced by confidence scores and tolerance ranges.

“Organizations that fail to update their testing approach risk deploying AI systems they cannot verify,” Nowlan warned. “That’s a recipe for unpredictable failures in production.”

Internal Links for Further Context

Jump to the Core Challenge of Non‑Determinism
Read the Background on How We Got Here
See What This Means for Testers

What Comes Next: A Call for Research and Pragmatism

Nowlan urged the testing community to share case studies and build open frameworks for non-deterministic testing. He noted that SmartBear is actively researching tools that shift focus from code inspection to data construction and behavior monitoring.

“We are at an inflection point—the future of reliable software depends on solving this,” he concluded. “The next five years will define how AI is tested at scale.”

💬 Comments ↑ Share ☆ Save