The testing workflow at most AI teams looks the same. Launch the agent. Watch it run. Notice it drifted somewhere around step four. Adjust the prompt. Launch again. Catch a different failure. Fix that. Run it again.

This is craftsmanship. It’s not engineering.

Manufacturing looked like this before Toyota. Workers stood at the line, catching defects when they could see them. Toyota’s insight wasn’t to inspect more carefully. It was to build the system that does the inspecting: sensors detecting defects automatically, an andon cord that stops the line, an engineer who fixes the root cause, a restart. Defect patterns accumulate. The system improves. The engineer’s job becomes designing what the line detects, not standing at it.

Most AI teams haven’t made this shift yet.

What breaks that you won’t see in a single run

Agent failure modes are structurally different from traditional bugs. A function that returns the wrong value fails the same way every time. Agent failures are statistical.

Context drift shows up in long chains. The agent loses the thread around step eight, after steps one through seven looked fine. You won’t catch it in the run you watched.

Tool cascade failures appear when one tool returns unexpected output. The agent doesn’t catch it and proceeds on bad data. In 80% of runs, that tool behaves normally. The other 20% is where production incidents come from.

Instruction ambiguity surfaces on edge cases. The prompt worked for the ten scenarios you tested. It fails on the scenarios you haven’t imagined yet.

State corruption is invisible without full tracing. The agent completes a step correctly but leaves a side effect that breaks the next one. The downstream failure looks unrelated to what caused it.

Retry loops only emerge under specific conditions. The agent gets stuck cycling through self-corrections, making things worse each pass. You’ve seen it once or twice. You can’t reproduce it on demand.

None of these appear in a manual run. They show up statistically - in the fifteenth run, in the input combination you didn’t try, in the edge case you didn’t imagine.

What the test loop looks like

The Toyota equivalent for agents has five pieces.

A runner that launches your agent against a scenario set automatically - not when you remember to run it.

A logger that records every step: the input, the output, the tool calls, any failures - with full trace.

A failure detector that defines what “failure” means for your specific agent. Not just exceptions. Wrong outputs count. Incomplete tasks count. Silent wrong outputs especially count.

A pattern aggregator that groups failures by type, frequency, and failure point. Not a log dump - a pattern. Context drift is happening at step eight, in 18% of runs, in scenarios with more than three documents.

A reporter that surfaces the summary: here are the three places the system is breaking. You fix them. The harness runs again.

Tools like LangSmith, Braintrust, and Langfuse have been building this infrastructure for two years. The tooling exists.

The adoption hasn’t followed. 79% of teams have agents in production; 37% run systematic online evals (Zapier, State of Agentic AI, 2026). Although for teams still in early prototype mode, a full harness is probably premature - it makes sense when you have a body of scenarios worth running, not before. But for teams shipping production agents to real users, that 37% adoption rate is the gap.

The mental model, not the tools

At Finsi, the main work stopped being “launch the agent and watch it.” It became: build what runs the agents, logs where they fail, and surfaces what to fix. The agents improve because the feedback loop improves.

The definition of done changed with it. Not “I ran it and it looked fine.” Something closer to: N runs across scenario coverage, failure patterns reviewed, nothing above threshold. The kind of done you can hand off to someone who wasn’t watching.

That’s the shift. You stop being the operator at the line. You become the engineer who designed what the line detects.