How to really stop your agents from making the same mistakes

OtherFreeARTICLE2mo ago

About

LangChain has raised $160 million. Three years of development. A billion-dollar valuation. LangSmith, their testing platform, is genuinely sophisticated: trajectory evals, trace-to-dataset pipelines, LLM-as-judge, regression suites, unit test frameworks for tools. They have the pieces. Credit where it's due. But pieces aren't a practice. LangChain gives you testing tools. It never tells you what to test, in what order, or when you're done. There's no opinionated workflow that says, in order: this failure happened now write a skill now write the deterministic code now write unit tests now write LLM evals now add a resolver trigger now eval the resolver now audit for duplicates now smoke test now file correctly That loop doesn't exist. You have to invent it yourself from scattered primitives. $160 million in funding, and most LangChain users still don't test their agents, because the framework gave them a gym membership without a workout plan. Most AI agent "reliability" is vibes-based. Prompt tweaks. Bigger system messages. "Please don't hallucinate" incantations. That stuff decays the moment the conversation gets complex. The frameworks that raised hundreds of millions of dollars to solve this gave you monitoring dashboards and unit test helpers and said "good luck." My agent screwed up twice this week. Neither failure can happen again. Not because I asked nicely. Because I turned each failure into a permanent structural fix: a skill with tests that run every day, forever. What hundreds of millions of dollars of VC capital couldn't buy you, I am going to give it to you today for free in open source. I call the practice "skillify." Once you use it, your agents won't keep making the same mistakes. Here's how it works. Failure 1: The Trip That Was Already in the Database I asked my OpenClaw about an old business trip, nearly ten years back, buried somewhere in calendar history. Simple question. Should take one second. Instead the agent did this: Called the live calend

Why it made the leaderboard

A skeptical look at agent evaluation and observability — using LangSmith's sophistication as a foil — and what actually stops an agent repeating failures versus what merely measures them. Read it if you are drowning in traces but the agent keeps making the same mistake.

Comments

No comments yet.