Silent Failures - Why AI breaks when no one is watching

1. A trace we didn't check

At Vera we've spent five years building AI products for security and compliance, working with hundreds of customers.

Over the last year and a half we built a multi-agent system to answer security and due-diligence questionnaires on their behalf.

In late 2025, one of our customers' chatbots was asked, on a vendor's security questionnaire, whether the company had undergone any investigations or lawsuits in the last ten years. It answered "No" with confidence. Nothing in our logs, our metrics, or our trace viewer told us anything was wrong, and traces like it ran through our customers' deployments every day.

Here is what the trace showed:

Question (from the vendor's security questionnaire): "3.3 Has your company been the subject of any investigations, lawsuits, settlements, or regulatory action in the last 10 years?"

What the retrieval agent surfaced: "Based on company records and past responses, we find no indication that the company has been the subject of any investigations or regulatory action... in the last 12 months... in the last 7 years... in the last 3 years."

What the final-response agent shipped: "Our company has not been the subject of any investigations, lawsuits, settlements, or regulatory action in the last 10 years."

The sources covered seven years. The answer claimed ten. The post-processor approved it. The response shipped 200, well-formatted, denying ten years of legal history the records couldn't actually verify.

That wasn't a polish issue. It was a fabricated denial of legal history in a vendor questionnaire, the kind of claim that becomes part of a procurement record, and ends contracts when it's wrong.

The system was made of multiple agent nodes that handed off in a chain: a supervisor that orchestrated the pipeline, a route planner, a retrieval agent, a post-processor, a final-response agent. Customers extended it by adding custom rules and instructions at each of those nodes. Every customer's deployment was effectively running a different program.

We lost our best customers because traces like this one ran through their deployments every day, and we couldn't catch them. We couldn't manually check every trace. And we couldn't predict the failure modes that any particular customer's rules would create.

So we started checking. We started by manually annotating the last 50 production traces, 3 devs per trace. The same kind of failure kept showing up: the output looked right, the process behind it didn't, and nothing in our stack flagged it.

We started calling them silent failures. This post is what they look like, why your stack misses them, and what we're building instead.

2. What to look for

A silent failure has three properties. The agent completes. The output is plausible. The steps taken deviate from the plan in ways that are hard to check for.

Every architecture has its own failure surface. CrewAI orchestrators and LangGraph state machines fail differently at the handoffs. RAG pipelines add retrieval and citation failures. Customer-configurable agents fail per-tenant, in configurations you never wrote tests for.

Here are some of the failures we saw in our own production traces:

An agent claims more than its sources support.
A verifier reaches the wrong conclusion.
A verification step that should have run is missing entirely.
An agent silently picks one reading of an ambiguous question.
Information that an upstream agent had is dropped before it reaches the downstream agent that needed it.

None failed an eval. None moved a metric. None even threw an error. To catch any of them we had to walk the trace by hand and read each step against the user's intent and the context of their deployment.

The first problem is naming the errors and defining what each one means, given the high variance in shapes and context. A few resources we found useful when annotating our own traces:

MAST: multi-agent system failure taxonomy (Cemri et al., 2025)
TRAIL: agentic trace reasoning and issue localization (Patronus AI, 2025)
RAGAS: retrieval-augmented generation evaluation

3. Why your current stack doesn't catch this

The second problem is your stack. Three layers: evals, traces, observability. Each works for what it was built for. None of them was built for this.

Evals score outputs. They work for closed-domain tasks where there's one right answer: a JSON schema, a classification label, a unit-test pass. They don't work for open-ended agentic work where the output is plausible but the process is wrong. Hamel Husain and Eugene Yan have written the canonical guides on building output-based evals well. Even when done perfectly, they can't see the gap between what the agent did and what it should have done. And evals are static. The suite you wrote yesterday doesn't know about the rule your customer added to a node this morning.

Traces show what happened. LangSmith, LangFuse, and Arize give you the full timeline of an agent run: what each node received, what it produced, where it called the model. The information is all there. What's missing is judgment: traces tell you the sequence ran, not whether the sequence was right. A trace of a silent failure looks clean. To catch the failure you have to read the trace, and you have to know what you're reading against.

Observability watches for measurable signals. The observability stack (Honeycomb, Datadog, Grafana) was built for distributed software: dashboards, metrics, alerts. These tools work when something measurable changes: slow latency, climbing error rate, anomaly in a metric. Silent failures don't change any of those. Latency is normal. The error rate is zero. The dashboard is green. Nothing fires.

Each layer gives you part of the picture. Evals score the answer. Traces show the steps. Observability watches for measurable signals. None of them measure whether the steps the agent took match the steps it was supposed to take.

4. What we're building

What does a real solution to silent failures look like? It has to do three things.

Detect failure modes deterministically, across architectures. The detection layer has to read the trace and identify what went wrong, whether you built on CrewAI or LangGraph or rolled your own. The catalog of failure shapes doesn't change with customer configuration. This is the part that transfers across deployments.

Evaluate impact against the user's intent and the agent's instructions. A failure mode in isolation is noise. The impact layer has to filter the detections down to the ones that actually mattered for this particular request, in this particular customer's deployment. The same failure mode can be a deviation in one context and irrelevant in another.

Close the loop. Detection without remediation is observability that no one acts on. The solution has to propose a fix: sometimes a prompt change, sometimes a new verification layer, sometimes a restructure of how the system is built. And then it has to generate custom evals shaped to the deviation it found, so the same shape of failure can't quietly degrade back into production.

We've been building this. We call it Helix.

5. Join the waitlist

We are building Helix in the open. There will be an open source repo, a self-hosted version you can run inside your own infrastructure, and a hosted version we manage for you. None of it is ready yet.

If you are building a multi-agent system in production and silent failures sound familiar, join the waitlist and we will let you know when it ships.

You can also reach out at founders@getvera.ai - if you're interested in the space we'd love to chat!