[{"content":"Hi, I am Minh, and I am currently working on the safety layer for AI agents - that is, the mechanisms that determine when the model can act independently, when a human has to intervene, and what the system should do after something goes wrong. I studied CS at Hanoi University of Science and Technology.\nIn my experience, AI-agent demos tend to have one shot, a “flaky” success, and then quietly fail on the second run. One fails a test on a flaky feature, another fails to implement a safety measure that was written as a statement and assumed to be a guard. And most importantly, that the failure is hidden until it comes back to haunt you, and no one can tell you what assumptions were critical. The space between what a system believes and what the documentation says is where I operate.\nSo, if you asked me to describe what I do in two sentences, I would say something like this:\nI build the guardrails and the guardrails on the guardrails; I encode the decisions that must not be delegated as mechanisms, not as habits I hope to remember.\nI’m an early-career by almost every metric, but I am incredibly careful to not conflate the human-agent boundary: I design the space that I keep for myself and make the system’s automatic procedures reversible, and I make the system fail loudly and not at all.\nThat is the claim. The rest of this note is me talking about what I actually did, mostly decisions that I made, including the ones that failed. I will describe myself through action rather than adjectives. And since I am writing this note on the 30th of June, some of these things may have changed by the time you read this, but most will not. Please enjoy my ideas; I hope you will like my thoughts.\nYou can take a look into my real work on contributions page and my learning posts on posts page.\ni stopped writing rules for my agent and started building walls At some point while discussing possible moves for my agent, it suggested pushing to main as a result of reasoning through a confusing git state. It concluded that the mess would be best cleaned up by a push, which violated a rule I\u0026rsquo;d written in natural language for the agent to avoid. The agent read the rule, balanced it against the tangle it was in, and decided to push anyway. It wasn\u0026rsquo;t being willful; it was operating as designed within a probabilistic framework. This was my lightbulb moment, and it rewrote how I approached the whole system.\nI hadn\u0026rsquo;t written a rule; I\u0026rsquo;d made a documentation request. A prompt is something you issue to a model. A wall is something you build around it. The model may well push back at a request. It has no equivalent when it encounters a wall because it\u0026rsquo;s built into the model\u0026rsquo;s ability to operate within constraints. That\u0026rsquo;s why I changed my approach to rule-writing. I broke the one rule I\u0026rsquo;d written into three discrete layers, the last of which was a natural language prompt. The push to main would never happen twice, because neither the deny-listing, the blocking rule, nor the prompt could be tripped by a lapse in concentration or a moment of enthusiasm. I\u0026rsquo;d built the kind of defensiveness you do when you\u0026rsquo;re facing a single point of failure that could see you lose money, blow up a server, or get yourself fired. In that situation, you\u0026rsquo;re not looking for a 95% solution; you want something deterministic and absolute that doesn\u0026rsquo;t rely on anyone being sober enough most of the time to handle whatever disaster is unfolding.\nI reserve the right to the same certainty when I\u0026rsquo;m working. Any session I\u0026rsquo;m running as an agent must have a set of gates it has to navigate before getting anywhere. What\u0026rsquo;s the goal? Is this the right task to be working on? Do we have a reasonable plan before proceeding? None of these questions accept\u0026quot;\nI think I should, but I\u0026rsquo;m not sure.\u0026quot; In all three cases, the answer has to be concrete or the agent is kicked back immediately. I\u0026rsquo;ve trained my own fail-closed instincts to apply to the same risks that I\u0026rsquo;ve outlined for the agent to follow, because the worst-case scenario for me is that everything looks good and nothing goes wrong until it does. I don\u0026rsquo;t want to lose sight of the fact that these rules aren\u0026rsquo;t suggestions; they\u0026rsquo;re the absolute minimum when I\u0026rsquo;m working.\nThe deny-list was never really about main for me. It was the first rule I wrote that encapsulated the philosophy behind the rest of the system. My agents get to make all their own little decisions freely inside a box I drew around them, and they have to ask me for permission to cross any barriers I\u0026rsquo;ve put up. Everything they\u0026rsquo;re likely to want to experiment with - branches, commits, running through ideas and scenarios - is available inside their sandbox. Everything that has the potential to irreversibly alter the bigger picture - a merge to main or an action affecting production - is locked behind walls I build individually. This is how I let the agent move quickly while still feeling safe, because the walls stop it from doing anything that could expose me to risk.\ni don\u0026rsquo;t trust the agent — and i also don\u0026rsquo;t trust my own guardrails either Then one of them decided to bite me\nI have a second guardrail, cousin to the one that bit me - the agent can\u0026rsquo;t edit files while I\u0026rsquo;m on main. The intention was to make me work on feature branches, so I made a feature branch inside a worktree, told it to write to that feature branch, and got blocked on the branch it was supposed to be writing to. The intended safeguard had malfunctioned, punishing me for doing exactly what it was designed to incentivize.\nThe reason for this malfunction was simple, if buried: the script looked at the branch of the folder I\u0026rsquo;d initialized my session in, not the branch of the file I was writing. The two are usually the same, which is why the bug manifested rarely and unpredictably - worktrees make it easy to work in one branch while writing to another, and the script has no idea which one it should check\nI have written something wrong as English, but correct as code\nI instructed the agent not to edit files on main, not to move the process when it should\u0026rsquo;ve moved the file. The bug was in assuming the two were linked when they weren\u0026rsquo;t.\nThis is the part where I tell you that, when faced with this situation, I didn\u0026rsquo;t turn off the guardrail to bypass the bug, like a responsible developer would. I fixed the bug, because when I first encountered it, I assumed an honest mistake: I believed, wrongly, that the control plane itself didn\u0026rsquo;t know what it was doing. The change to fix it was simple: one line of code, changing the query from one that grabbed the folder\u0026rsquo;s branch to one that grabbed the file\u0026rsquo;s. I learned something far more valuable than a working fix, however: a control that fires at the wrong time is often more dangerous than no control at all, because the absence of control is obvious, while a control that covers up its failure to control is much harder to separate from reality.\nThe reason I caught the error in the first place was because I don\u0026rsquo;t trust my control plane to do its job, even when it does do its job. I am the one who wrote the safeguards, and so I am the one who writes the things that check if the safeguards are doing their job. At the beginning of every session, some code reads my config, checks every single guardrail I\u0026rsquo;ve written, and reports in red if any of them appear to be missing. Every command the agent issues is written to an audit log, and when one of my guardrails fails, it reminds me to recheck my permissions at the end of the session. There is nothing flashy about any of these practices - on the contrary, there\u0026rsquo;s a certain dryness to them, even a boringness. None of them exist because I asked myself, \u0026ldquo;What could go wrong?\u0026rdquo; and acted on the answer. All of them exist because I asked myself, \u0026ldquo;What did I think would go wrong, that did not in fact go wrong,\u0026rdquo; and acted on the answer instead. I orchestrate the agent and I verify the agent. The day that I started verifying my own verification tools is the day that I stopped being fooled by my own delusions of safety.\nthe part that isn\u0026rsquo;t the agent The question most people who use an agent ask themselves while using it is whether they can trust it on the task. I stopped asking myself this question. It has an answer that is mostly right, but mostly is a bad place to be when working with something you can only partially control. Instead, I begin asking myself a question that I used to have to answer for the agents I used to work with: can this be made reversible? And, if it cannot, do not do it. This is the single question that underlies the rest of this particular mode of operation, which has three parts.\nGated entry. No work proceeds until intent, effort, and a general plan are discussed and pinned down. The gates do not open even to begin work, and they do not care how much I am in a hurry. Most of my aborted conversations with agents began with this very step not being completed, and I was never able to reverse an action I had taken after replying affirmatively to one of these questions.\nIsolated work. By working across contexts, I prevent any given investigation from becoming compromised by the conclusions of another. When the agent is deployed, each thread must have its own context, brief, and purpose, and the agent\u0026rsquo;s time spent on any one investigation must be seen as an investment that cannot be recovered. In this way, I treat the attention budget of the agent as a fixed, non-renewable resource.\nStaged handoffs. Work is handed off to the next stage only when the current one is nearing a point of no return, and work is able to proceed at a rapid pace while it can be stopped short at any time. In this model, the reversible steps are those that can proceed without human supervision, and irreversible steps are where humans have to step in. It is not a particular concern of mine to avoid irreversible steps, since that is how human agents work: I can let the system do its work as quickly as possible while I am prepared to stop it when it needs to slow down in order to make a choice.\nThis is the part where I begin working on something else. It is the boundary condition for a human agent, and it is how a team of ten would be able to scale the same system: fast where it can be fast, stopped where it needs to be stopped, and not wasting time or human hours on work that can be done by something else. I am not a better typist with a model attached, but I have designed an operating system for the model that allows it to make judgments that have permanent consequences.\nSo, that\u0026rsquo;s me I work on the bits of agents that no one demos: the control, the boundaries, the failure modes; I am most interested in a system that fails loudly. I am in the earlier part of the stack, building in public, where I rather expose my own guardrails\u0026rsquo; bugs than claim publicly that they are not there.\nIf you build agentic systems and have ever felt the particular existential dread when a safeguard you thought was there is revealed to be absent, I would appreciate your help, and I would like to hear what you think about my work. You can read my code and contributions here: my contributions or see my learning notes here: my learning posts. I prefer you judge for yourself.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/i-build-the-guardrails-around-ai-agents/","summary":"A self-intro. I work on the controls around AI agents - what they can do, what stays human, and how the system fails when something slips. Here\u0026rsquo;s how I think, shown through the work, including its limits.","title":"Code, Context, and Me"},{"content":"I had a line in my global instructions saying that Claude should never push to main . Claude obeyed. Then, during some reasoning about a situation involving a problematic git history, it suggested a git push origin main to \u0026ldquo;fix\u0026rdquo; a branch mismatch. It had read the instructions and concluded correctly that this situation was an exception.\nThe thing about the instructions file is that it serves as a strong prior for the model, rather than a strict set of constraints. The model processes and applies the CLAUDE. MD file on every interaction and generally follows it to the letter, with one or two caveats. However, \u0026ldquo;mostly\u0026rdquo; is not a reliable foundation for an absolute rule, so any instructions that have an element of risk in their enforcement are better stated elsewhere.\nI split my agent config between the markdown file and actual hooks, with the latter being executed by the harness before any tool execution. This way, the model cannot second-guess them and try to \u0026ldquo;get around\u0026rdquo; them since it has not processed them.\nWhat the two layers actually are A CLAUDE.md rule is text the model reads and weights against everything else in context. A hook is a PreToolUse (or PostToolUse, Stop, SessionStart…) script the harness runs around a tool call; it returns deny, and the call never happens. No reasoning, no exception, no \u0026ldquo;this case is different.\u0026rdquo;\nHere\u0026rsquo;s the same rule — don\u0026rsquo;t edit on a protected branch — as a hook rather than a sentence:\n# PreToolUse on Edit|Write|MultiEdit $file = $input_json.tool_input.file_path $branch = git -C (Split-Path -Parent $file) rev-parse --abbrev-ref HEAD 2\u0026gt;$null if (@(\u0026#39;main\u0026#39;, \u0026#39;master\u0026#39;) -contains $branch) { # return permissionDecision = \u0026#34;deny\u0026#34; } The difference from a line in a CLAUDE.md saying \u0026ldquo;don\u0026rsquo;t edit on main\u0026rdquo; is critical. The textual version is something the model is typically trained to obey. This version is a wall; the model suggests the edit, the harness executes the script, the script refuses, and the edit doesn\u0026rsquo;t happen. It is not possible to convince the script otherwise.\nThis is the price you pay, and the value you gain. A hook can\u0026rsquo;t make judgements. It will refuse the edit even in situations where editing on main would have been acceptable. You are trading flexibility for certainty, and you should only do that when certainty has value.\nThe line I draw My rough rule for which layer a rule goes in:\nMarkdown ( CLAUDE.md / path-scoped rules) - preferences and conventions that are generally right 95% of the time and benign if wrong: naming conventions, “explain technical terms”, “no nested ternaries”, “ask before doing destructive things”. If the model misses one, I catch it in review and nothing burns down.\nHooks - anything that is catastrophically wrong if done even once: pushing to main, rm -rf, reading .env or ssh keys, making edits on protected branches. Also, anything that needs to be done no matter what the model says.\nThat last one is the important one people miss. Hooks aren’t just about not-doing-things - they’re about doing-things the model may forget. Every time a file is edited, a PostToolUse hook runs to make sure they’re formatted and type-checked:\n# PostToolUse on Edit|Write|MultiEdit — runs unconditionally if ($file -match \u0026#39;\\.py$\u0026#39;) { ruff format $file ruff check $file --fix pyright $file } I could put \u0026ldquo;run ruff after editing Python\u0026rdquo; in CLAUDE.md. It would mostly work. But \u0026ldquo;mostly formats the code\u0026rdquo; is a strictly worse outcome than \u0026ldquo;always formats the code\u0026rdquo;, and there\u0026rsquo;s no judgement involved in running a formatter, so there\u0026rsquo;s no reason to leave it to the model\u0026rsquo;s attention budget. Deterministic work gets put in deterministic code.\nSame goes for the shell injection stuff; I\u0026rsquo;ve got a PreToolUse hook on Bash that hard-denies a short list of patterns - rm -rf, sudo, git push --force, chmod 777, etc etc - and a few prompt injection strings like ignore previous instructions and you are now. The model isn\u0026rsquo;t getting to decide whether rm -rf is a dangerous command or not; it\u0026rsquo;s being blocked before it can be run.\nThe hook that exists because a hook didn\u0026rsquo;t fire The case that taught me the most was a hook I wanted but the harness couldn\u0026rsquo;t reliably deliver.\nI run Claude Code in the desktop app and there\u0026rsquo;s a SessionEnd event. The obvious place to snapshot the git state at the end of a session is the SessionEnd hook, but SessionEnd isn\u0026rsquo;t reliably called when you close the desktop window (it\u0026rsquo;s an ungraceful close). At the very moment I needed it most, the lifecycle event wasn\u0026rsquo;t there. So I moved to a Stop hook, which fires at the end of every turn, and I throttled it so it only did the actual work once every five minutes. So I moved the snapshot to the Stop hook, which fires at the end of every turn, and throttled it so it only does real work once every five minutes:\n# Stop hook — fires every turn, throttled to once / 5 min. # SessionEnd is unreliable on window-close, so snapshot here instead. if ($due) { $status = git status --short $snapshot = @(\u0026#34;Last turn state — $stamp\u0026#34;, \u0026#34;Branch: $branch\u0026#34;, \u0026#34;Uncommitted files: $dirtyCount\u0026#34;) + $status Set-Content -Path $stateFile -Value $snapshot # overwrite, never grows } Now even if the window dies ungraciously, the last turn\u0026rsquo;s git state is already on disk from the most recent throttled snapshot. The insight that generalizes: a hook is only a guarantee if the event it\u0026rsquo;s attached to is a guarantee. Now, PreToolUse is always called before the tool, every time. Stop is called every turn. But SessionEnd on a GUI app isn\u0026rsquo;t guaranteed to be called at all. Pick the event that actually happens, not the one whose name sounds most appealing.\nThe one cost I didn\u0026rsquo;t anticipate Hooks are silent, and a deterministic guardrail not announced is a problem of its own. When a PreToolUse hook denies a call, I need to know that it denied it and why, lest the agent act like it never happened or worse, bend around the rule without me knowing that it was even there.\nThat\u0026rsquo;s why all my hooks loudly declare themselves. Blocks emit a systemMessage (🛑 [PreToolUse] branch-guard BLOCKED Edit on protected branch \u0026lsquo;main\u0026rsquo;) and the instruction set tells Claude to report those lines in chat, rather than ignore them. The ability to observe the system is critical to the hook\u0026rsquo;s utility - a guardrail unseen is no different from a bug, the agent does something unexpected, and there is no way to distinguish between the two.\nWhat I\u0026rsquo;m not sure about is calibration - the art of writing rules that are mostly deterministic but occasionally allow exceptions. \u0026ldquo;Don\u0026rsquo;t add a new dependency without asking\u0026rdquo; is too important to leave as markdown, but a hard-coded hook would wrongly penalize those times I actually do want to add a new dependency. The answer is probably a third category of instruction, one that asks rather than blocks, but I\u0026rsquo;m not certain what guardrails should promptfully ask the user for forgiveness.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/prompts-ask-hooks-enforce/","summary":"A rule stated in the markdown is typically followed, but when it\u0026rsquo;s not, that rule belongs in code that runs before the tool itself.","title":"Using Hooks as Deterministic Guardrail"},{"content":"I have a rule that Claude can\u0026rsquo;t edit files on my computer while I\u0026rsquo;m on main. It\u0026rsquo;s enforced by a hook, not a prompt: a powershell script that runs on every Edit/Write command, checks the branch, and denies the operation if it\u0026rsquo;s protected. The intention was to enforce working in feature branches via git worktrees, which is the only git discipline I care to follow.\nIt prevented me from working in a worktree. The guardrail denied the very workflow it was designed to enforce.\nThis incident, along with another one I\u0026rsquo;ll describe later, is what made me stop thinking about my agent config as \u0026ldquo;rules I\u0026rsquo;ve written\u0026rdquo;, and start thinking about it as \u0026ldquo;software that may have bugs\u0026rdquo;. Guardrails are code. Code is prone to errors. My guardrails had two notable ones.\nThe branch-guard that blocked feature branches Here\u0026rsquo;s the hook. It\u0026rsquo;s wired to PreToolUse on Edit|Write|MultiEdit:\n$file = $input_json.tool_input.file_path $branch = git rev-parse --abbrev-ref HEAD 2\u0026gt;$null # the bug $protected = @(\u0026#39;main\u0026#39;, \u0026#39;master\u0026#39;) if ($protected -contains $branch) { # return permissionDecision = \u0026#34;deny\u0026#34; } Read where git rev-parse runs. It is run in the current directory of the session. I always start my code sessions from the main project directory because starting from a worktree subdirectory caused the harness to reask for permissions for tools it would have automatically inherited otherwise. The session cwd is main\nWhen I create a worktree on a branch and ask Claude to write a file into that worktree, the hook does not look at the worktree. It runs git rev-parse in the session cwd and rejects the write because that resolves to main . The file the user is trying to write is on feat/whatever; the hook has no awareness of that.\nThe guard\u0026rsquo;s understanding of what branch it is on is the ambient process state. What the guard is actually guarding is a file. Those are two different things once you start using worktrees, which is exactly the scenario this guard was supposed to protect against.\nThe fix is obvious: resolve the branch from the file being written to rather than the session cwd.\n# before — reads the branch of the session\u0026#39;s CWD $branch = git rev-parse --abbrev-ref HEAD 2\u0026gt;$null # after — reads the branch of the file being edited $branch = git -C (Split-Path -Parent $file) rev-parse --abbrev-ref HEAD 2\u0026gt;$null git -C \u0026lt;dir\u0026gt; runs the command as if you were in that directory. Now a write into a worktree on a feature branch resolves to that feature branch (and passes), while a write into the main-rooted tree resolves to main (and still blocks). The guard finally guards the file, and not the process.\nThe lesson is cheap to state and easy to make a mistake with: where-ever a guardrail thinks it is must be the same as the unit of control it is trying to guard. The hook guards file edits, and so it must evaluate the branch of the file being edited, and not the branch of the shell. I had written \u0026ldquo;block edits on main\u0026rdquo; and then assumed that the CWD and the file\u0026rsquo;s branch were the same thing. Worktrees are the case where they are not.\nThe permission rule that was never reached The second bug was a silent failure. The permission rule was written, but had no effect.\nMy settings.json defines three permission categories: deny (never run), ask (prompt me), and allow (run silently). For a while I had in-directory file mutations (mv, sed -i) fall into ask, meaning that I would be prompted before Claude made any in-place changes. Seems sensible.\nWhat actually happened is that I was never prompted. The in-directory file writes always ran.\nI found the bug while writing a 25-case permission self-test, a script that runs through a representative sample of commands and builds a report of which were denied, prompted, or run. 21/25 worked as intended. The four that failed were all in-directory writes I\u0026rsquo;d put in ask - they were all running silently as if they were in allow.\nThe cause is obvious in hindsight: the permission rules are evaluated in the order of the allow/deny blocks in the code, not the order that I had placed them in my settings.json\nPreToolUse hook → deny → mode → sandbox → ask → allow I run with the sandbox on (enabled: true, autoAllowBashIfSandboxed: true) and filesystem.allowWrite includes \u0026ldquo;.\u0026rdquo; - the current directory. So any command that only writes inside the project directory is auto-allowed by the sandbox gate. Sandbox comes before ask in the pipeline; by the time execution would reach my ask rule, the sandbox had already allowed the command and made the decision. The ask entry was dead config, a guardrail that no longer guarded anything.\nIt\u0026rsquo;s worse than useless, really - at least, worse than not having the rule at all - because the absence of a rule is honest. A dead rule is like a tripwire that wasn\u0026rsquo;t hooked to anything.\nThe real trade-off came between keeping the sandbox on, but accepting that all the commands I wanted to allow were silently auto-allowed, or turning off autoAllowBashIfSandboxed to restore my guardrail. I chose to keep the sandbox, but the containment it provided was still a better guarantee - those commands physically can\u0026rsquo;t write anywhere outside the project directory, sandbox or no sandbox, because of the separate PreToolUse hook that runs at the front of the pipeline, before any of the prompt-based rules. Turning them off would\u0026rsquo;ve been a silent allow-all for every read-only command, which is most of the commands I use the sandbox for anyway. (Not a deal breaker, but not an improvement either.)\nSo I deleted the dead ask entries rather than trying to un-break the sandbox. Reinstalling them would\u0026rsquo;ve been ineffective - they\u0026rsquo;d be running after the sandbox and would only ever trigger on commands that the sandbox didn\u0026rsquo;t already allow. The only way to add in-directory safety checks is to write a PreToolUse hook that returns ask, because hooks run at the front of the pipeline - the fix for \u0026ldquo;my rule is at the wrong level\u0026rdquo; is always to move it to the right level, not to shout harder.\nWhat both bugs have in common Both times, I made the same class of error: I wrote a rule that acted as if it was in a certain place in the system, but it wasn\u0026rsquo;t. The branch-guard assumed its environment (the CWD) was always going to be in a particular place relative to the file it was trying to modify, but that wasn\u0026rsquo;t always the case. The ask rules thought they\u0026rsquo;d be run on every relevant command, but they were overridden and dead. In both cases, I had correct English sentences - \u0026ldquo;don\u0026rsquo;t edit on main branch\u0026rdquo;, \u0026ldquo;ask before mv\u0026rdquo; - that weren\u0026rsquo;t correct code because I hadn\u0026rsquo;t taken into account all the places the rule had to be true.\nWhat made the second bug easier to spot than the first was a test. The branch-guard was discovered the slow way - by stumbling onto it while using the tool. The dead permission rules were found because I\u0026rsquo;d written a self-test that walked through the permission matrix, checking that all the intent rules were actually installed. The self-test is the thing I\u0026rsquo;m advocating for here; when I started working on this, guardrails were features, not separately-security-audited things. I think a lot of people make this same class of error when writing prompt-based rules because they can\u0026rsquo;t easily observe all the different execution paths that apply to a rule.\nThe thing I\u0026rsquo;m worrying about is that now I have six phases in my permission pipeline plus eight hooks and six path-scoped rule files, and I can barely keep track of the evaluation order anymore. The next step isn\u0026rsquo;t necessarily more clever rules - it may just be to make that 25-case self-test a thing I commit along with the config, the way I would never write an application without also writing tests for it.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/two-bugs-in-my-own-guardrails/","summary":"A guardrail that blocked the workflow it was meant to protect, and a permission rule that was silently dead. Both were bugs in my own config.","title":"Debugging My Agent Guardrails"},{"content":"Three hours into one particularly lengthy coding agent run, I noticed the agent was behaving strangely. Tool calls made by the agent began to become less sophisticated; certain previously-captured context the agent had explicitly noted wasn\u0026rsquo;t being used in the thinking process. The agent continued to perform its task, but it was no longer doing so as effectively as before. Yet not a single log entry suggested something had gone wrong.\nThat\u0026rsquo;s the instrumentation gap. The agent was producing a wealth of logs, but none of them captured the truly interesting behaviors. Contextual information, reasoning, proper escalation, and the fidelity of information surviving a compaction event were all left out, resulting in an agent that could only be considered as \u0026ldquo;working\u0026rdquo; or \u0026ldquo;broken\u0026rdquo;, with nothing in between.\nThe calibration cascade you should know Multi-step agents compound errors silently. For example, if each step has a 75% chance of succeeding (a reasonable estimate for moderately difficult reasoning tasks), a three-step pipeline has 75%^3 = 42% system reliability. A five-step pipeline would have 75%^5 = 24%.\ndef system_reliability(per_step_accuracy: float, steps: int) -\u0026gt; float: return per_step_accuracy ** steps # What most people assume they have: system_reliability(0.95, 5) # → 0.77 # What they may actually have: system_reliability(0.75, 5) # → 0.24 The math isn\u0026rsquo;t the insight. The insight is that the design isn\u0026rsquo;t run by anyone at design time. Per step accuracy is assumed to be \u0026ldquo;good enough\u0026rdquo; for each individual step, without considering the impact on the product of the whole. A five step research agent, with each step accurately performed, is a safe agent. A five step research agent, with seventy five percent accuracy per step, has a twenty four percent chance of giving a wrong answer, and that wrong answer will seem just as correct as the right one, since the agent will have followed through on all steps.\nThe right question to ask in determining an agent\u0026rsquo;s design is \u0026ldquo;What per step accuracy is needed to get my desired system accuracy, given the number of steps I\u0026rsquo;ve chosen?\u0026rdquo; \u0026ldquo;If you need 85% system reliability, then two steps of 92% are better, while two steps of 80% give you 64%.\u0026rdquo;\nWhat /compact actually destroys A typical longish codebase exploration session ends with the model generating a mental image of the code structure: what files call what, where the state is, what are the constraints, and so on. This takes 30-45 minutes to form and ends up taking 80,000 tokens to represent in the model’s mind.\n/compact shrinks that to 2,000 tokens. But what gets thrown away in the process? The edges, primarily. Particulars. Implementation details. That edge case in auth.ts:247 that was never formally documented but was the reason why the session state couldn’t be shared between two auth modules. The reason why the model tried to implement the session storage in Redis three hours ago and then abandoned the idea in favor of something else.\nThe model is unaware of those details and, crucially, of the loss: when asked to perform any task that requires those discarded details, it will make up plausible-sounding but incorrect statements: give the appearance of competence while being unaware of its limitations. Said task will fail, often in ways that are hard to diagnose.\nThe solution, obviously, is not to avoid the /compact step. A three-hour exploration session followed by a 30-minute /compact inevitably loses some information. The solution is to make sure that the information loss is tracked and acknowledged by the model: prior to running the /compact command, generate a ‘state manifest’ in a separate scratchpad file that documents what is being thrown away.\n# Write this before compaction, update as work progresses state_manifest = \u0026#34;\u0026#34;\u0026#34; ## Active investigation Task: Refactor auth middleware to support OAuth alongside session tokens. ## Confirmed facts (do not re-derive) - Session tokens stored in Redis, 24h TTL, key pattern: session:{user_id}:{nonce} - auth.ts:247 has a known race condition on concurrent refresh — do not touch - middleware.ts imports from auth.ts but NOT from session.ts directly - The `is_admin` flag comes from the user record, not the token ## Ruled-out approaches - JWT: rejected because existing mobile clients don\u0026#39;t support it (established in first pass) - Dual-auth flag: creates state machine complexity the team explicitly didn\u0026#39;t want ## Current position Halfway through updating middleware.ts. auth.ts refactor complete and tested. session.ts not yet modified. \u0026#34;\u0026#34;\u0026#34; After being compacted, the agent loads this file first. It contains the texture, but with the particular line number, ruled out approaches, and current position. What’s on the scratchpad file is accurate since a human-readable agent wrote it explicitly; what was in the compressed context was a lossy summary, not written with knowledge of what would be relevant later down the line.\nSilent failure is a monitoring design problem A paper from 2026 on LLM agent failure modes actually suggests that silent failure is a given - an inevitable consequence of the system\u0026rsquo;s operation, akin to entropy. The paper suggests that failures are gradual - often happening long before the final output is reached. This renders the monitoring process non-trivial, as the events worth logging are not the final output, but rather the intermediate results that point to an impending failure.\nThe paper actually suggests that the only way to avoid silent failure is to have monitoring in place that can detect these intermediate failures and intervene before the final output is reached.\nimport json from datetime import datetime class AgentObservabilityLogger: def __init__(self): self.log = [] def record_turn( self, turn: int, tool_calls: int, context_tokens: int, tool_result_tokens: int, reasoning_references_prior_facts: bool, confidence: float | None, ): self.log.append({ \u0026#34;turn\u0026#34;: turn, \u0026#34;ts\u0026#34;: datetime.utcnow().isoformat(), \u0026#34;context_tokens\u0026#34;: context_tokens, \u0026#34;tool_calls_this_turn\u0026#34;: tool_calls, \u0026#34;tool_result_tokens\u0026#34;: tool_result_tokens, \u0026#34;prior_fact_reference\u0026#34;: reasoning_references_prior_facts, \u0026#34;confidence\u0026#34;: confidence, }) def degradation_signal(self) -\u0026gt; str | None: if len(self.log) \u0026lt; 3: return None recent = self.log[-3:] # Flag if context grew \u0026gt;40% in 3 turns with no compaction token_growth = recent[-1][\u0026#34;context_tokens\u0026#34;] / recent[0][\u0026#34;context_tokens\u0026#34;] if token_growth \u0026gt; 1.4: return f\u0026#34;context grew {token_growth:.1f}x in 3 turns — approaching noise floor\u0026#34; # Flag if tool result tokens dominate context growth result_fraction = recent[-1][\u0026#34;tool_result_tokens\u0026#34;] / recent[-1][\u0026#34;context_tokens\u0026#34;] if result_fraction \u0026gt; 0.6: return f\u0026#34;tool results are {result_fraction:.0%} of context — trim or summarize\u0026#34; return None None of these metrics are available in default agent frameworks. They require a instrumentation pass before you can run the agent, because you can’t extract them from a post-hoc log of the execution.\nConsider the turn budget. It’s easy to forget that the passage of each turn includes a re-reading of everything that has been said before and the result of prior tool uses. A tool returning 600 tokens when it should return 20 not only wastes the 580 unnecessary tokens on the current turn, but also all future turns will have to read and process them. A PostToolUse hook that transforms verbose tool responses into a consistent summary is a cumulative investment of time:\n# Without normalization: 600 tokens enter context, re-read every subsequent turn raw_tool_result = { \u0026#34;customer_id\u0026#34;: \u0026#34;C-48291\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Alice Chen\u0026#34;, \u0026#34;account_status\u0026#34;: \u0026#34;active\u0026#34;, \u0026#34;created_at\u0026#34;: \u0026#34;2019-03-14T09:22:11Z\u0026#34;, # ... 200 lines of order history, 80 lines of preferences, 150 lines of tickets } # With PostToolUse normalization: 20 tokens, same signal normalized = \u0026#34;customer_id: C-48291 | status: active | open_tickets: 2 | last_order: 2024-10-28\u0026#34; The 580-token difference per call compounds across every turn after it.\nEscalation over-gating is a reliability failure Standard advice for escalation design with human-in-the-loop systems is to escalate when unsure. The error mode for that approach is not underescalation but overescalation resulting in reduced reliability in an insidious way.\nPeople who review escalated items are likely to rubber stamp them as approved due to high volume. The reasoning is that if every request for approval contains \u0026ldquo;I\u0026rsquo;m not sure about this\u0026rdquo;, and there are 40 such requests per hour, then the approvals cannot be processed with the required care and attention and get rubber stamped. The true safety check is not happening as humans are \u0026ldquo;in the loop\u0026rdquo; but not actually reviewing the requests.\nHow to resolve this issue? The solution is to treat escalation calibration as an engineering problem, not a UX problem\ndef escalation_decision( action_category: str, # \u0026#34;read\u0026#34; | \u0026#34;write\u0026#34; | \u0026#34;financial\u0026#34; | \u0026#34;external_api\u0026#34; confidence: float, prior_escalation_rate: float, # fraction of recent actions escalated human_approval_latency_p50: float, # seconds ) -\u0026gt; str: # If humans are approving \u0026gt;80% of escalations in under 5 seconds, # they\u0026#39;re rubber-stamping — escalation threshold is too low. if prior_escalation_rate \u0026gt; 0.8 and human_approval_latency_p50 \u0026lt; 5: return \u0026#34;threshold_too_low\u0026#34; risk_floor = {\u0026#34;read\u0026#34;: 0.3, \u0026#34;write\u0026#34;: 0.6, \u0026#34;financial\u0026#34;: 0.85, \u0026#34;external_api\u0026#34;: 0.7} threshold = risk_floor.get(action_category, 0.7) if confidence \u0026lt; threshold: return \u0026#34;escalate\u0026#34; return \u0026#34;proceed\u0026#34; The \u0026ldquo;prior_escalation_rate\u0026rdquo; and \u0026ldquo;human_approval_latency_p50\u0026rdquo; fields are not inherent properties of an action, but rather represent information that would need to be logged and calculated from past instances of escalation. If the measurement of these fields is not made, it becomes impossible to distinguish between a gate that actually provides meaningful oversight and a rubber stamp queue.\nThe requirement in EU AI Act Article 14 that humans performing oversight retain \u0026ldquo;meaningful oversight\u0026rdquo; would not be provided by a gate in which the humans involved are likely to approve requests reflexively. Meaningful oversight requires the ability for humans to evaluate the request being escalated to them, requiring escalation rates to be low enough to justify such evaluation.\nThe instrumentation checklist Before shipping a long-running agent, these are the events worth logging - not for debugging after failure, but for detecting degradation before it manifests:\nEvent What to log Why it matters Every turn context_tokens, tool_calls, tool_result_tokens Detects noise accumulation before attention degrades Compaction What was in context before, what\u0026rsquo;s in summary after Makes compaction fidelity loss visible State manifest write File path, token count, key facts written Verifies scratchpad covers what was lost Every escalation Action, confidence, human decision, time-to-decision Enables calibration math; detects rubber-stamping Tool errors Category (transient, validation, business, permission), retry outcome Distinguishes fixable from unfixable failures Session end Total turns, peak context, compaction events, escalation rate Session-level reliability summary None of this is particularly exotic. It all comes down to making the decision to instrument before you ship, which none of us do.\nWhat I\u0026rsquo;m still uncertain about is whether fidelity loss during compaction is topic-independent or are there certain classes information which are more susceptible to loss than others (implicit constraints, ruled out approaches, specific line numbers). If the latter is true, perhaps we can come up with a scratchpad template which captures exactly the topics not captured by the compaction. It\u0026rsquo;s unknown to me, but the closest thing I could find is in the git context controller paper, where they state that with git, you can track your branching context and merge agent state, which suggests that they have a reason to believe that naive merges are not compositional.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/your-agent-has-no-ground-truth/","summary":"Context degradation, silent failure, and miscalibrated escalation look like three separate problems. They\u0026rsquo;re all the same instrumentation gap.","title":"Auditing Agent Self-Truth"},{"content":"Most prompt engineering posts are structured as a technique list: here\u0026rsquo;s few-shot, here\u0026rsquo;s chain of thought, here\u0026rsquo;s structured output. The techniques are useful, but what gets left out is the failure mode that makes each one necessary.\nHere\u0026rsquo;s the same material from the other direction.\nFailure 1: the model interprets vague instructions differently each run You ask it to be conservative in the system prompt. It is conservative on Monday, liberal on Wednesday with the same input. You add \u0026ldquo;use your best judgement.\u0026rdquo; Nothing changes\nThe model isn\u0026rsquo;t aware of what you mean by conservative. It has to make an assumption based on context, which is different every time. Essentially, you\u0026rsquo;re telling it to take a guess.\nThe solution is not using vague language. Operationalize your criteria. Instead of saying \u0026lsquo;conservative\u0026rsquo;, list out categories of things it should avoid. Instead of judgement calls, provide rules. Describe what you want, don\u0026rsquo;t tell it how to think about what you want.\n# Vague — model guesses what \u0026#34;important\u0026#34; means system = \u0026#34;Summarize this meeting and highlight the important parts.\u0026#34; # Explicit — model has rules, not adjectives system = \u0026#34;\u0026#34;\u0026#34;Extract from this meeting transcript: 1. ACTION ITEMS: Any task assigned to a named person with a deadline. Skip if no assignee is named. Skip if no deadline is mentioned. 2. DECISIONS: Mark as FINAL only if the transcript records a vote or explicit approval. Mark as PROPOSED if discussed but not resolved. 3. Skip the first 5 minutes of casual conversation before the agenda begins.\u0026#34;\u0026#34;\u0026#34; The same goes for severity. \u0026ldquo;Critical\u0026rdquo; means \u0026ldquo;it\u0026rsquo;ll crash the system,\u0026rdquo; the model will guess. Show it what crashing looks like:\nsystem = \u0026#34;\u0026#34;\u0026#34;Classify bugs by severity using these examples: CRITICAL — matches this pattern: query = f\u0026#34;SELECT * FROM users WHERE id = {user_input}\u0026#34; (unsanitized input in a database query) MINOR — matches this pattern: def getUserName(): # should be get_user_name per convention (naming convention violation, no functional impact)\u0026#34;\u0026#34;\u0026#34; When there are false positives – instances where the model raises alerts on acceptable code – the solution is to add additional examples of what is not a problem. Do not try to increase the confidence threshold; confidence scores are not calibrated, and the model can be 95% confident about a hallucination. Only explicit negatives help the model learn what is not acceptable.\nFailure 2: the model produces valid-looking JSON that doesn\u0026rsquo;t parse Prompt based JSON parsing fails 5-20% of the time in production. The model simply outputs malformed JSON due to missing brackets or extra commas. Not often enough to matter in unit tests, but frequently enough to cause significant issues downstream\nThe fix is to use tool_use with a JSON schema, which turns the unconstrained \u0026ldquo;the model tried its best\u0026rdquo; to the constrained \u0026ldquo;the API will parse this JSON\u0026rdquo;.\nextract_invoice_tool = { \u0026#34;name\u0026#34;: \u0026#34;extract_invoice\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Extract structured data from an invoice document.\u0026#34;, \u0026#34;input_schema\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;object\u0026#34;, \u0026#34;properties\u0026#34;: { \u0026#34;invoice_number\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;}, \u0026#34;vendor_name\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;}, \u0026#34;total_amount\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;number\u0026#34;}, \u0026#34;payment_terms\u0026#34;: {\u0026#34;type\u0026#34;: [\u0026#34;string\u0026#34;, \u0026#34;null\u0026#34;]}, # nullable: may not exist \u0026#34;line_items\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;array\u0026#34;, \u0026#34;items\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;object\u0026#34;, \u0026#34;properties\u0026#34;: { \u0026#34;description\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;}, \u0026#34;amount\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;number\u0026#34;} }, \u0026#34;required\u0026#34;: [\u0026#34;description\u0026#34;, \u0026#34;amount\u0026#34;] } }, \u0026#34;document_type\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;, \u0026#34;enum\u0026#34;: [\u0026#34;invoice\u0026#34;, \u0026#34;receipt\u0026#34;, \u0026#34;credit_note\u0026#34;, \u0026#34;unclear\u0026#34;, \u0026#34;other\u0026#34;] } }, \u0026#34;required\u0026#34;: [\u0026#34;invoice_number\u0026#34;, \u0026#34;vendor_name\u0026#34;, \u0026#34;total_amount\u0026#34;, \u0026#34;document_type\u0026#34;] } } response = client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, tool_choice={\u0026#34;type\u0026#34;: \u0026#34;tool\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;extract_invoice\u0026#34;}, tools=[extract_invoice_tool], messages=[{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: invoice_text}] ) Two design decisions that are important for this schema are:\nOptional/nullable fields prevent fabrication. If the model believes the payment_terms are required and not present, it must invent some. By making the field nullable, the model is able to return null when there is no value.\n\u0026quot;unclear\u0026quot; and \u0026quot;other\u0026quot; in enums prevent forced classification. By not including them, you\u0026rsquo;re forcing the model to pick the closest possible category when it may not actually be the right one. Giving it an \u0026ldquo;unclear\u0026rdquo; option allows it to state that, while an \u0026ldquo;other\u0026rdquo; option along with a freeform field allows catching things that your categories don\u0026rsquo;t explicitly have.\nWhat tool_use is not addressing: semantic errors. The schema is only a structure check. The model could pluck the wrong number out of a field, or transpose two lines, and the schema would be satisfied. That needs the next fix.\nFailure 3: the output is structurally valid but semantically wrong The math doesn\u0026rsquo;t add up. The dates are out of order. The value from page 3 ends up in the page 1 data field. Your JSON parsed fine, but the data is wrong.\nThese errors are challenging to catch at the prompt stage. They\u0026rsquo;re also simple to resolve if you can detect and return the model the specific error message along with the original document and failed extraction:\ndef validate_extraction(extracted: dict, raw_document: str) -\u0026gt; str | None: \u0026#34;\u0026#34;\u0026#34;Returns an error description if validation fails, None if valid.\u0026#34;\u0026#34;\u0026#34; if \u0026#34;line_items\u0026#34; in extracted and \u0026#34;total_amount\u0026#34; in extracted: calculated = sum(item[\u0026#34;amount\u0026#34;] for item in extracted[\u0026#34;line_items\u0026#34;]) stated = extracted[\u0026#34;total_amount\u0026#34;] if abs(calculated - stated) \u0026gt; 0.01: return ( f\u0026#34;Line items sum to {calculated:.2f} but stated total is {stated:.2f}. \u0026#34; f\u0026#34;Either a line item is missing or the total is incorrect.\u0026#34; ) return None def extract_with_retry(document: str, max_retries: int = 2) -\u0026gt; dict: result = extract_invoice(document) for attempt in range(max_retries): error = validate_extraction(result, document) if error is None: return result # Feed three things back: original document, failed extraction, specific error retry_response = client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, tool_choice={\u0026#34;type\u0026#34;: \u0026#34;tool\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;extract_invoice\u0026#34;}, tools=[extract_invoice_tool], messages=[{ \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: ( f\u0026#34;Original document:\\n{document}\\n\\n\u0026#34; f\u0026#34;Your previous extraction:\\n{json.dumps(result, indent=2)}\\n\\n\u0026#34; f\u0026#34;Validation error: {error}\\n\\n\u0026#34; f\u0026#34;Please re-extract, correcting this specific error.\u0026#34; ) }] ) result = extract_tool_result(retry_response) return result The three-level retry pattern is a bearer. \u0026ldquo;That was wrong, try again\u0026rdquo; the same wrong answer, the model has no idea what to fix. Only the message that describes the error gives the model a hint about what to do.\nWhat retries are not suitable for is guessing information that is not in the document. If the invoice does not have a purchase order number, no amount of retries will make one appear. For each field, decide beforehand what to do if the information is not present. Return null if possible or send the marked-up result to a human for review. Spend less tokens by not retrying when you already know that nothing will change.\nA schema that checks for possible errors before they occur is also good practice. One pattern that helps with that is asking the model to report both the calculated values and the values found in the text and comparing them:\n{ \u0026#34;line_items_sum\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;number\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Sum of all line item amounts as calculated by you\u0026#34;}, \u0026#34;stated_total\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;number\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Total amount as written in the document\u0026#34;}, \u0026#34;totals_match\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;boolean\u0026#34;}, \u0026#34;conflict_detected\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;boolean\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;True if document contains contradictory information\u0026#34;} } Discrepancy between line_items_sum and stated_total is a self-reported signal, not an inference. It\u0026rsquo;s more reliable than asking the model whether it\u0026rsquo;s confident or not.\nFailure 4: the model misses things in large reviews You send a 14-file PR for review to Claude. It handles the first three accurately. By the 9th file, it gives shallow findings. By the 12th, it overlooks an obvious injection vulnerability. It only managed to give a two-liner on the 14th.\nIt has nothing to do with the model\u0026rsquo;s capabilities; it has to do with dilution of attention - the inherent property of the transformer architecture to perform worse on information that is not at the extremes of the input. Performance on information \u0026ldquo;in the middle\u0026rdquo; is significantly worse than on information at the beginning or the end of the input, even if the information in the middle was more relevant. Bigger context windows do not alleviate this problem; it is a fundamental limitation of the attention-based architecture.\nThe solution is to process each file in parallel, then do a single cross-file pass, and then a single pass for each file again.\nasync def review_pr(files: list[dict]) -\u0026gt; dict: # Pass 1: independent review of each file — full attention on one file at a time per_file_tasks = [ client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, messages=[{ \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: f\u0026#34;Review this file for bugs, security issues, and logic errors:\\n\\n{f[\u0026#39;content\u0026#39;]}\u0026#34; }] ) for f in files ] per_file_results = await asyncio.gather(*per_file_tasks) # Pass 2: cross-file integration — one pass that sees only the summaries integration_review = await client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, messages=[{ \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: ( \u0026#34;Given these per-file findings, identify cross-file issues:\\n\u0026#34; \u0026#34;- Data flow inconsistencies between modules\\n\u0026#34; \u0026#34;- Contradictory patterns across files\\n\u0026#34; \u0026#34;- API contract violations across service boundaries\\n\\n\u0026#34; f\u0026#34;Findings:\\n{format_findings(per_file_results)}\u0026#34; ) }] ) return {\u0026#34;per_file\u0026#34;: per_file_results, \u0026#34;integration\u0026#34;: integration_review} The parallel per-file calls are also where independent instances help. On their own models, they can review the code they produce in the same session, retaining the context of the reasoning they did for them and thus being less prone to questioning their decisions. A separate invocation would judge the code they produce without this possible bias\nFor findings where the model is unsure, use calibrated confidence thresholds rather than relying on the model’s self-reported confidence. The model’s self-reported confidence is most likely not calibrated, as it has incentive to be overly confident. This can be calibrated by running examples through the model where you know the answer and observing what confidence thresholds correspond to the model producing correct results.\n{ \u0026#34;finding\u0026#34;: \u0026#34;Potential race condition in order processing\u0026#34;, \u0026#34;severity\u0026#34;: \u0026#34;major\u0026#34;, \u0026#34;confidence\u0026#34;: 0.65, \u0026#34;reasoning\u0026#34;: \u0026#34;Lock acquisition looks correct but unlock timing depends on an async callback whose ordering I cannot fully verify.\u0026#34;, \u0026#34;route\u0026#34;: \u0026#34;human_review\u0026#34; } Failure 5: nightly eval runs cost too much to run at the right frequency A thousand-document evaluation suite run synchronously costs full price and blocks your pipeline for hours. Teams either run evals less frequently than they should or spend more than necessary.\nThe fix is the Message Batches API: 50% cost reduction with up to 24h processing time. The right workloads are latency-tolerant: reports, weekly audits, eval suites fit for asynchronous processing.\nimport anthropic import json client = anthropic.Anthropic() def submit_extraction_batch(documents: list[dict]) -\u0026gt; str: requests = [ { \u0026#34;custom_id\u0026#34;: doc[\u0026#34;id\u0026#34;], # used to match results to inputs \u0026#34;params\u0026#34;: { \u0026#34;model\u0026#34;: \u0026#34;claude-sonnet-4-6\u0026#34;, \u0026#34;max_tokens\u0026#34;: 1024, \u0026#34;tool_choice\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;tool\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;extract_invoice\u0026#34;}, \u0026#34;tools\u0026#34;: [extract_invoice_tool], \u0026#34;messages\u0026#34;: [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: doc[\u0026#34;text\u0026#34;]}] } } for doc in documents ] batch = client.beta.messages.batches.create(requests=requests) return batch.id def retrieve_batch_results(batch_id: str) -\u0026gt; dict: results = {} for result in client.beta.messages.batches.results(batch_id): if result.result.type == \u0026#34;succeeded\u0026#34;: results[result.custom_id] = extract_tool_result(result.result.message) else: results[result.custom_id] = {\u0026#34;error\u0026#34;: result.result.error} return results The custom_id field is what you use to associate results with inputs, but note that you can\u0026rsquo;t rely on results coming back in the same order as you sent them.\nTwo hard constraints to consider when choosing between batch vs streaming APIs: absence of streaming (results come back as a whole set, not incrementally), and absence of mid-request tool calling (no possibility to define tools externally, get results, and proceed to the next turn). So for any agentic-style loop of \u0026ldquo;tool call -\u0026gt; observe -\u0026gt; next step,\u0026rdquo; you\u0026rsquo;d want to use the synchronous API.\nOn scheduling: The batch API has a 24-hour SLA but no upper bound, so if you have a 30-hour SLA for processing, you can schedule the batch no later than 24 hours before the deadline. This leaves you with a 6-hour buffer for collection, validation, and other operations.\nCoupling with prompt caching:\nIf you have a large system or tool schema for extraction prompts, prompt caching on top of batch pricing can give you much better value than the advertised 50%+ savings for caching. You want to design your prompts so that the invariant part (large system message/tool schema) is cached, and the varying part (document contents) is charged at the lower per-token rate.\nThe common theme in all five fixes seems to be a repair rather than optimization: Each of these techniques was introduced as a minimal viable response to a failure mode rather than an enhancement to an existing model. Explicit criteria fix vague instructions, structured schemas fix syntactic chaos, retry-with-feedback fixes semantic slips, multi-pass architecture fixes inattention, and batch API fixes evaluation budget waste. There is nothing wrong with any of these approaches per se, but none would be necessary in a well-designed system.\nWhat I\u0026rsquo;m curious about is how much better retry-with-error-feedback was compared to blind retries with temperature adjustment. There was a paper on arXiv in 2025 that demonstrated blind retries to be competitive on purely syntactic errors like formatting. My hypothesis is that error feedback helps more when the model makes systematic mistakes (wrong field name, incorrect math operation) rather than superficial slips (badly formatted JSON), but I need to benchmark this hypothesis against my own extraction pipelines.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/llm-extraction-failure-modes/","summary":"Every prompt engineering technique exists because a specific failure mode forced it. Here\u0026rsquo;s the failure taxonomy, not the technique list.","title":"Fixing Common Failure Modes in LLM Extraction"},{"content":"Six month after setting up Claude Code, the configuration of Claude Code is already inconsistent. The CLAUDE.md is at 400 lines long and few knows what is up to date or not. The skills directory has 3 commands that no one uses. The path-scoped rules refers to pattern files that got reorganized in Q1. And the CI job runs claude -p and sometimes finds the same issue as last week or none at all.\nThe problem isn\u0026rsquo;t any individual layer. It\u0026rsquo;s that each layer was set up in isolation, and nothing enforces that they stay coherent.\nHere\u0026rsquo;s how the layers fit together and how to keep them from rotting.\nThe four-layer stack Each level has different capabilities, performance characteristics, and failure modes. Incorrect use of levels leads to subtle, hard-to-track-down failures in programs.\nLayer 1: CLAUDE.md — always-loaded, universal standards Layer 2: Path-scoped rules — loaded only when editing matching files Layer 3: Skills — on-demand, invoked by model or developer Layer 4: Headless CI mode — non-interactive, machine-readable output Layer 1 - CLAUDE.md is loaded for all sessions, regardless of the current task. This is where universal conventions are established. It defines how to name files, handle errors, write tests, and what architectural decisions will be made. It doesn’t contain instructions for specific tasks - that’s what skills are for - and conventions for specific file types - that’s the job of path layer.\nThe three-level hierarchy matters for team environments:\n~/.claude/CLAUDE.md — user-level, personal, not version-controlled, not shared via git .claude/CLAUDE.md or root CLAUDE.md — project-level, shared with everyone who clones the repo Subdirectory CLAUDE.md — directory-level, overrides project-level in that subtree only The failure mode that I\u0026rsquo;ve encountered most frequently with teams is when a new developer clones the repository and finds that Claude Code doesn\u0026rsquo;t work as expected. The issue is usually that the instructions are in ~/.claude/CLAUDE.md rather than .claude/CLAUDE.md. The user-level configuration is not tracked in any repository and therefore no new developer setups Claude Code. Bring team-wide standards into the project.\nWhen CLAUDE.md grows unwieldy, split it with @import:\n# .claude/CLAUDE.md @.claude/rules/testing.md @.claude/rules/api-conventions.md @.claude/rules/deployment.md Each imported file is a source of truth. Each updated once, and it reflects everywhere. Without @import you are either stuck with one 400 line file or have repeated rules across files in different directories - both of which are worse.\nLayer 2 — Path-scoped rules in .claude/rules/ are loaded conditionally based on which files you\u0026rsquo;re editing. This is the right layer for conventions that apply to a file type spread across many directories.\n--- paths: [\u0026#34;**/*.test.tsx\u0026#34;, \u0026#34;**/*.spec.ts\u0026#34;] --- # Test conventions - Use the factory pattern from test/factories/ for all test data creation - Integration tests connect via test/setup/db.ts — never mock the database - Test public API contracts, not private implementation details - Coverage target: 80% branch coverage for new code The glob pattern **/*.test.tsx matches every test file in the codebase regardless of directory depth. The alternative — a directory-level CLAUDE.md in every directory containing tests — means 50+ copies of the same rules, guaranteed to drift.\nThe critical distinction from root CLAUDE.md: path rules load only when you\u0026rsquo;re editing a matching file. Terraform conventions don\u0026rsquo;t consume tokens when you\u0026rsquo;re editing React components. Root CLAUDE.md loads every session regardless.\nTo verify a rule is actually loading, run /memory in Claude Code. It shows which configuration files are active in the current session. If a rule isn\u0026rsquo;t listed, the glob didn\u0026rsquo;t match — check the pattern. /memory is a diagnostic command; it doesn\u0026rsquo;t trigger loading or refresh stale config.\nLayer 3 - Skills located at .claude/skills/ define on-demand workflows. The description of each skill is always present and visible to Claude, but the body of the skill is only loaded when it is explicitly invoked via /skill-name or implicitly when Claude\u0026rsquo;s description matches certain criteria.\n--- description: Run a security review on the current diff. Use when asked to review, check security, or audit changes. context: fork allowed-tools: Read, Grep, Glob --- Review the current git diff for: - OWASP Top 10 vulnerabilities - Hardcoded secrets or credentials - Input validation gaps at system boundaries - SQL injection and XSS vectors Report only confirmed findings with file path, line number, and remediation. context: fork: The skill gets executed in an isolated sub-agent. Whatever verbose results it writes, are hidden away in the subagent. Only the summary comes back in the context window. That way, analysis or brainstorming skills that otherwise would clutter the context with intermediate results, can be used without polluting the context.\nallowed-tools: Read, Grep, Glob: A read only analysis skill that has Write or Bash access is a potential attack vector. It could potentially write arbitrary files, if the model wanted to. Be conservative and only grant permissions the skill actually needs.\nThe skills-vs-CLAUDE.md confusion is common:\nPut in skills Put in CLAUDE.md Task-specific workflows (/review, /deploy-check) Always-on standards (naming, error handling) Steps that only matter for one kind of task Rules that apply to every session Verbose procedures that would bloat always-loaded context Short, universal reference material Layer 4 — Headless CI mode runs Claude Code non-interactively via the -p flag:\n# Hangs in CI — waits for keyboard input that never arrives claude \u0026#34;Review this PR for security issues\u0026#34; # Correct — processes prompt, outputs to stdout, exits claude -p \u0026#34;Review this PR for security issues\u0026#34; For machine-parseable output:\nclaude -p \\ --output-format json \\ \u0026#34;Review this PR. Previous findings: ${PREVIOUS_FINDINGS} Report ONLY: 1. New issues not in previous findings 2. Previous findings still present Do NOT re-report addressed issues.\u0026#34; Not including PREVIOUS_FINDINGS is not an option if you\u0026rsquo;re running on every push. Same issue gets reported on every push until devs disable commenting entirely because all the notifications are noise. Dedupe logic should be in the prompt.\nOne subtle edge case to be aware of: never use the same Claude session to review code that it authored. Claude builds internal context about why it wrote the code that it did. It\u0026rsquo;s not good practice to ask Claude to review code that it wrote in the same thread. You should use a different invocation for your code reviews\n# Session A: generate claude -p \u0026#34;Implement the authentication middleware\u0026#34; # Session B: review — independent, no shared context claude -p \u0026#34;Review the authentication middleware for security issues and edge cases\u0026#34; Plan mode: the decision is about ambiguity, not difficulty The common framing is \u0026ldquo;use plan mode for difficult tasks, direct execution for easy ones\u0026rdquo;. The better framing is to use plan mode for tasks where there is more than one valid approach, and the choice between them has downstream implications for other files.\nA straightforward but involved bugfix that touches one function and has a clear stack trace is direct execution. A seemingly simple feature request that could be implemented in three fundamentally different ways and touches on many subsystems is plan mode.\nPlan mode does not merely ask one to think before acting, but physically removes the ability to act until one has switched back to direct execution. Edit, Write, and Bash (the three broad classes of file modifications) are not available in plan mode. The hybrid pattern for large modifications is therefore\nThe hybrid pattern for large changes:\nPlan phase: explore the codebase with the Explore subagent, evaluate approaches, design the strategy. The Explore subagent isolates verbose discovery output from the main context window. Execute phase: switch to direct execution with the strategy decided. File-by-file implementation with no re-planning needed. The failure mode is: starting direct execution and switching to plan mode only when complexity emerges. When the task description already states the complexity (\u0026ldquo;restructure the authentication module to support OAuth\u0026rdquo;), plan mode should be chosen immediately, not after the first surprise.\nWhat drifts and how to catch it The stack decays in expected ways.\nCLAUDE.md becomes too big and rules get added for things that nobody knows if they are needed. The file reaches 400 lines and the model forgets about the rules in the middle. The solution here is to review the file at least once every time the team makes a major architecture change and treats CLAUDE. MD as a dependency that needs to be updated if any of the code described in it changes.\nThe path rules point to patterns that no longer exist. The rule says something like src/api// , ts but the API was moved to services/ six months earlier. The rule is no longer matched by anything, which makes it ineffective. The fix is to run / memory after every move to see if any of the glob patterns in rules match what they used to. If a given rule is not found, it should be removed.\nSkills accrue dead commands. The team adds a skill to do something once and then never removes it. The skills/ directory bloats up, and so does Claude’s ability to reason about it. The solution is to review skills quarterly, removing any that have not been used in three months and do not seem to be needed.\nCI review output becomes noise. Nobody looks at the results because the same problems are always there, or because there are too many false positives to pay attention to real issues. The way to address this is to include previous results in the prompt when launching CI review job, structure the output to prioritize findings by severity level, and establish a policy that dictates what merge is allowed based on review results. Having no policy about CI review results basically allows anyone to merge anything.\nThe configuration stack belongs in code review. Changes to .claude/CLAUDE.md, .claude/rules/, .claude/skills/, and CI workflow YAML should go through the same PR process as application code. If configuration changes aren\u0026rsquo;t reviewed, they\u0026rsquo;re not maintained (they\u0026rsquo;re just accumulated).\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/the-claude-code-config-stack/","summary":"CLAUDE.md, path rules, skills, hooks, and headless CI each have their own post. Nobody writes about how they compose or what happens when the stack drifts.","title":"Exploring Claude Code Configuration Stack"},{"content":"The agent used the wrong tool, and the standard response is to say that the problem was due to vague description and tell the agents to improve their descriptions. This advice may be correct, but it does not give a clear idea of what exactly should be improved because, in this case, the error was in the wrong tool choice.\nFive types of interface design failures can lead to this type of mistake, and only after these causes are identified can the correction be developed. Therefore, without proper diagnostics, the problem cannot be solved correctly.\nThe five failure patterns 1. Lack of scope of selection. The description indicates what the tool does but not when it should be used over the similar one in the list. The model has to guess the scope of selection based on the context but it fails to do so.\n# Causes misrouting between get_customer and lookup_order \u0026#34;get_customer\u0026#34;: \u0026#34;Retrieves customer information\u0026#34; \u0026#34;lookup_order\u0026#34;: \u0026#34;Retrieves order information\u0026#34; # Fixed — explicit boundary in both descriptions \u0026#34;get_customer\u0026#34;: ( \u0026#34;Looks up a customer account by email, phone, or customer ID. \u0026#34; \u0026#34;Returns profile: name, contact details, account status, loyalty tier. \u0026#34; \u0026#34;Use this to verify WHO the customer is. \u0026#34; \u0026#34;Do NOT use for order-specific queries — use lookup_order for those.\u0026#34; ) \u0026#34;lookup_order\u0026#34;: ( \u0026#34;Looks up a specific order by order ID or tracking number. \u0026#34; \u0026#34;Returns order status, line items, shipping details, and history. \u0026#34; \u0026#34;Use this when the query is about an ORDER, not the customer account. \u0026#34; \u0026#34;Do NOT use this to verify customer identity — use get_customer first.\u0026#34; ) The fix is mutual: each description must reference the other and state when NOT to use it. One-sided clarification doesn\u0026rsquo;t work because the model compares descriptions simultaneously when choosing.\n2. Undescribed output. The text describes only the inputs to the tools, not the outputs. As a result, the model can\u0026rsquo;t determine whether a tool will provide the needed information for the next step and, therefore, may select an irrelevant tool or call it anyway.\n# Model doesn\u0026#39;t know what it gets back \u0026#34;search_orders\u0026#34;: \u0026#34;Searches orders by customer\u0026#34; # Model knows exactly what to expect \u0026#34;search_orders\u0026#34;: ( \u0026#34;Searches all orders for a customer by customer ID. \u0026#34; \u0026#34;Returns: list of order IDs, statuses, total amounts, and creation dates. \u0026#34; \u0026#34;Does NOT return line items or shipping details — use get_order_detail for those.\u0026#34; ) Output descriptions are load-bearing for sequential tool calls. If the model doesn\u0026rsquo;t know what step 1 returns, it can\u0026rsquo;t plan step 2.\n3. Unconstrained parameters. Optional fields with no guidance on when to use them make every call slightly unpredictable. The model fills optional fields by inference, which is inconsistent.\n# Model guesses when to use `include_history` \u0026#34;get_customer\u0026#34;: { \u0026#34;properties\u0026#34;: { \u0026#34;customer_id\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;}, \u0026#34;include_history\u0026#34;: {\u0026#34;type\u0026#34;: \u0026#34;boolean\u0026#34;} # when? why? } } # Model knows exactly when to set it \u0026#34;get_customer\u0026#34;: { \u0026#34;properties\u0026#34;: { \u0026#34;customer_id\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;string\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Customer ID in format C-NNNNN (e.g. C-48291)\u0026#34; }, \u0026#34;include_history\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;boolean\u0026#34;, \u0026#34;description\u0026#34;: ( \u0026#34;Set true only when the user specifically asks about \u0026#34; \u0026#34;past orders or account history. Defaults to false. \u0026#34; \u0026#34;Setting true increases response size significantly.\u0026#34; ) } } } 4. Cross-tool dependency leakage. Description of tool A suggests that it can perform a task which actually requires invoking tool B first. In the model A is called expecting it to return a certain result, but it cannot do this on its own.\n# Implies identity verification is part of the refund call \u0026#34;process_refund\u0026#34;: \u0026#34;Processes a refund for a verified customer\u0026#34; # States the dependency explicitly \u0026#34;process_refund\u0026#34;: ( \u0026#34;Processes a refund for a specific order. \u0026#34; \u0026#34;REQUIRES: get_customer must have been called first in this session \u0026#34; \u0026#34;to verify customer identity. Will fail with a permission error if called \u0026#34; \u0026#34;without prior identity verification.\u0026#34; ) 5. System prompt conflicts. A keyword in the system prompt creates an unintended tool association that overrides a well-written description. This one is invisible until you look for it.\nIf the system prompt says something like \u0026ldquo;always check customer details before processing a request\u0026rdquo;, then the word \u0026ldquo;customer\u0026rdquo; becomes a signal for the model to use the get_customer tool. It can happen that you add a new customer-related tool, but the model ignores it and always uses get_customer instead. After updating the tool descriptions, make sure to check the system prompt for any such \u0026ldquo;override\u0026rdquo; terms\nDiagnosing which failure you have When a misroute has occurred in production, the debugging proceeds in the following order:\nExamine the full prompt as it was inputted into the model, including the system prompt, the tool descriptions, and the conversation history. Identify the two tools that the model had to choose between. As the model, try to understand which of the two tools to pick based on their description alone, without the context of the user question. If not possible, you are dealing with failure pattern 1 (selection scope not specified). Otherwise, proceed to determine whether the model would have needed the output of the selected tool to proceed to the next step. If the tool description did not specify this, you are dealing with pattern 2. Check the system prompt for keyword overlap with tool names. If found, pattern 5. If none of the above, examine the optional parameters that were passed to the tool, if any. You are likely dealing with pattern 3. Otherwise, if the model used the correct tool but with incorrect assumptions about what it would return, it is probably pattern 4. This process takes about ten minutes. It is faster than adding few-shot examples to the prompt, which only addresses the symptom, not the cause. It is also much faster than building a separate routing classifier, which would add an unnecessary layer of complexity to the system.\nError responses are a contract, not a fallback The second most common failure mode once you\u0026rsquo;ve routed to the right tool is that the agent recovers incorrectly from an error that happened during tool use, causing it to make the wrong decision. The problem is that the error gave the agent no information to work with.\nThere are 4 error categories, and each one implies a different way of recovering from the error. The category needs to be in the response, not as a comment for the developer to read, but as data the agent can read at runtime.\ndef make_error(category: str, message: str, description: str, retryable: bool) -\u0026gt; dict: return { \u0026#34;isError\u0026#34;: True, \u0026#34;content\u0026#34;: [{\u0026#34;type\u0026#34;: \u0026#34;text\u0026#34;, \u0026#34;text\u0026#34;: message}], \u0026#34;errorCategory\u0026#34;: category, # \u0026#34;transient\u0026#34; | \u0026#34;validation\u0026#34; | \u0026#34;business\u0026#34; | \u0026#34;permission\u0026#34; \u0026#34;isRetryable\u0026#34;: retryable, \u0026#34;description\u0026#34;: description, } # Transient — retry after delay make_error( \u0026#34;transient\u0026#34;, \u0026#34;Order database temporarily unavailable\u0026#34;, \u0026#34;High load on order service. Request is valid — retry in 2-3 seconds.\u0026#34;, retryable=True, ) # Validation — fix the input and retry make_error( \u0026#34;validation\u0026#34;, \u0026#34;Invalid order ID format\u0026#34;, \u0026#34;Order ID must be in format #NNNNN (e.g. #12345). Received: \u0026#39;order-abc\u0026#39;.\u0026#34;, retryable=True, ) # Business — do NOT retry, take an alternative path make_error( \u0026#34;business\u0026#34;, \u0026#34;Refund exceeds automatic limit\u0026#34;, \u0026#34;Refund of £750 exceeds the £500 automatic limit. Escalate to a human agent.\u0026#34;, retryable=False, ) # Permission — escalate or use different credentials make_error( \u0026#34;permission\u0026#34;, \u0026#34;Access denied to financial records\u0026#34;, \u0026#34;Current service account lacks financial access. Escalate to a senior agent.\u0026#34;, retryable=False, ) The isRetryable field is the one, at which the implementation usually branches out. Transient and validation errors are retryable, i.e., the same request is likely to succeed on a retry (after some time elapses, or the input is fixed). By contrast, business and permission errors are not retryable, as the request is doomed to fail on retry. The agent has to choose different branches of execution based on this information.\nThe failure mode, which most often leads to confusion, is the empty result of a successful query vs. an inaccessible query. Without some additional conventions, these two cases are not different from each other.\n# Valid empty result — NOT an error, agent should stop searching { \u0026#34;isError\u0026#34;: False, \u0026#34;content\u0026#34;: [{\u0026#34;type\u0026#34;: \u0026#34;text\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;No orders found for customer C-48291 in the last 90 days.\u0026#34;}], \u0026#34;resultCount\u0026#34;: 0 } # Access failure — IS an error, agent should decide whether to retry { \u0026#34;isError\u0026#34;: True, \u0026#34;content\u0026#34;: [{\u0026#34;type\u0026#34;: \u0026#34;text\u0026#34;, \u0026#34;text\u0026#34;: \u0026#34;Could not reach order database\u0026#34;}], \u0026#34;errorCategory\u0026#34;: \u0026#34;transient\u0026#34;, \u0026#34;isRetryable\u0026#34;: True, \u0026#34;description\u0026#34;: \u0026#34;Connection timed out after 5s. Query did not execute.\u0026#34; } If the tool returns an empty list upon a failed access attempt, the agent assumes there are no orders and proceeds. The coordinator then receives an incomplete set of results with no indication of the issue, resulting in silent downstream failures that are harder to detect than outright errors.\nTool scoping: the count matters before the schemas do Before optimizing the descriptions for the tools, adjust their quantity: an agent with 18 tools to choose from has worse selection reliability than one with fewer, even if the descriptions are perfectly accurate, since the model must spend more time thinking about how to choose rather than doing the task.\nAim for 4-5 tools per agent, narrowed down to what makes sense for that agent\u0026rsquo;s specialization. For a multi-agent research system, this might look like:\nAgent Tools Web Search search_web, fetch_page, extract_links, save_snippet Document Analysis extract_metadata, extract_data_points, summarize_content, verify_claim Synthesis compile_report, verify_fact, format_citation, assess_coverage Coordinator Agent (spawn subagents), review_output, request_revision Each agent is provided with exactly the capabilities that it needs to fulfill its designated role. The coordinator agent has no access to the search_web tool. The web search agent has no access to the compile_report tool. This is good design, but more importantly, this is a prerequisite for per-agent description optimization. You cannot define selection-scope boundaries cleanly if the same tool is used in six different ways by six different agents.\nWhen a subagent sometimes needs a capability controlled by another role, we can grant the capability in a limited way to the subagent, using a scoped cross-role tool (see fig. 2). A scoped cross-role tool is a cross-role tool that grants capability only to one particular subagent, and only in certain contexts (such as when invoked by the subagent), instead of being generally available to whoever chooses to invoke it.\n# Generic tool — enables misuse, unclear purpose {\u0026#34;name\u0026#34;: \u0026#34;fetch_url\u0026#34;, \u0026#34;description\u0026#34;: \u0026#34;Fetches any URL and returns content\u0026#34;} # Scoped alternative — constrained to legitimate use, purpose is unambiguous { \u0026#34;name\u0026#34;: \u0026#34;load_document\u0026#34;, \u0026#34;description\u0026#34;: ( \u0026#34;Fetches a document from an approved internal URL and returns its text content. \u0026#34; \u0026#34;Only accepts URLs from docs.company.com and wiki.company.com. \u0026#34; \u0026#34;Use this to load source documents for analysis. \u0026#34; \u0026#34;Do NOT use for external URLs — use search_web for external content.\u0026#34; ) } MCP configuration: scope at the server boundary first MCP tool scoping takes place at two levels. Most teams configure tools at the tool level: that is, which tools to grant access to which agents. Less obvious but much more impactful is the configuration at the server level: which MCP servers are enabled for a given context.\nProject level configuration goes in the .mcp.json file at a repo\u0026rsquo;s root, under version control and shared by all developers, and looks approximately like this:\n{ \u0026#34;mcpServers\u0026#34;: { \u0026#34;github\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;@modelcontextprotocol/server-github\u0026#34;], \u0026#34;env\u0026#34;: {\u0026#34;GITHUB_TOKEN\u0026#34;: \u0026#34;${GITHUB_TOKEN}\u0026#34;} }, \u0026#34;jira\u0026#34;: { \u0026#34;command\u0026#34;: \u0026#34;npx\u0026#34;, \u0026#34;args\u0026#34;: [\u0026#34;-y\u0026#34;, \u0026#34;@community/mcp-server-jira\u0026#34;], \u0026#34;env\u0026#34;: { \u0026#34;JIRA_URL\u0026#34;: \u0026#34;${JIRA_URL}\u0026#34;, \u0026#34;JIRA_TOKEN\u0026#34;: \u0026#34;${JIRA_TOKEN}\u0026#34; } } } } The ${VARIABLE_NAME} syntax allows to keep credentials in environment outside of version control. Every developer maintains their own tokens locally, while the config file committed to repository remains clean from any secrets.\nYou could store personal or experimental servers in the ${HOME} directory at ~/.claude.json , as this file is neither committed to repository, nor shared between collaborators. This is useful for servers that you are planning to propose for others\u0026rsquo; adoption later, but currently test on local or have some personal preferences.\nWhen you have an MCP tool with sparse description, the agent will favor built-in tools over MCP ones, since they are more reliable. However, the solution is the same as with custom tools: provide selection scope, output description and explicit boundaries. The fact that a tool is built-in or external doesn\u0026rsquo;t impact its description and capabilities.\nThe ordering that saves the most time I was going to put few shot examples on this list too, but examples are a liability in the first place. They burn tokens, and don\u0026rsquo;t diagnose the real issue. The routing classifier comes lower down on my list because it\u0026rsquo;s an infra cost for an edge case that ought to be described in ten minutes.\nThe only thing I\u0026rsquo;m not so sure about is the cutoff. Why four or five? The numbers sound right to me as a practitioner, but there\u0026rsquo;s no particular reason to believe they\u0026rsquo;re ideal for any particular model class. I\u0026rsquo;d love to see this explored on modern language models.\nWhen debugging a misbehaving agent you should first check\nthe number of tools per agent (scope creep if it\u0026rsquo;s over 5), absence of selection scope, output description, or parameter constraints, collisions between the system prompt and tool names, improper error returns, the presence of prerequisites in a downstream tool\u0026rsquo;s description. Thank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/how-to-debug-tool-misrouting/","summary":"Everyone says \u0026lsquo;fix your tool descriptions.\u0026rsquo; Nobody shows how to diagnose which specific failure caused the misroute.","title":"Debugging Tool Misrouting in LLM Agents"},{"content":"A note on the code examples: All the code examples below are written in Python. We will use python for our agent implementation, but please remember that all programming languages are viable options! Every tutorial on writing agents talks about the loop - check if stop_reason is tool_use, call the tool, add the result, repeat. If the stop_reason is end_turn, we need to get the text. Here\u0026rsquo;s the full code:\ndef run_agent(client, user_prompt: str) -\u0026gt; str: messages = [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: user_prompt}] for _ in range(MAX_ITERATIONS): response = client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, tools=TOOLS, messages=messages, ) if response.stop_reason == \u0026#34;tool_use\u0026#34;: messages.append({\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: response.content}) messages.append(build_tool_result_message(response)) continue if response.stop_reason == \u0026#34;end_turn\u0026#34;: return extract_text(response) return \u0026#34;Error: exceeded iteration cap.\u0026#34; That\u0026rsquo;s the loop. It\u0026rsquo;s not the hard part.\nThe hard problem is everything you put into messages. Every pattern in agentic systems, from hooks to hub-and-spoke orchestration to programmatic enforcement to session forking, is some kind of answer to the question of how to keep the context window from becoming an unmanageable swamp that the model can\u0026rsquo;t reason about.\nThe four anti-patterns that kill loops from the inside Before diving into advanced patterns, some words about common ways to shoot yourself in the foot. Here are four anti-patterns that come from the outside, not from the architecture itself.\nChecking natural language for completion. If the agent says \u0026ldquo;I\u0026rsquo;m done,\u0026rdquo; that\u0026rsquo;s not a termination signal. It\u0026rsquo;s text. The agent can say \u0026ldquo;I\u0026rsquo;m done\u0026rdquo; and then emit a tool_use block in the same response. stop_reason is the only reliable signal. It\u0026rsquo;s deterministic and unambiguous. Nothing else is.\nUsing an iteration cap as a termination strategy. A cap is an emergency brake for bugs, not a way to end normal work. If your agent routinely hits the cap, the loop has a bug. The correct fix is finding and fixing the bug, not raising the cap.\nForcing tool_choice: any. This prevents stop_reason == \u0026quot;end_turn\u0026quot; from ever being reached. The loop runs forever. Don\u0026rsquo;t do it.\nChecking response.content[0].type == \u0026quot;text\u0026quot; to detect completion. Claude can return a text block alongside a tool_use block in the same response. This check is wrong in the general case.\nThese anti-patterns are not subtle. They appear because I\u0026rsquo;ve seen them in production code written by engineers who knew what they were doing, but didn\u0026rsquo;t read the fine print\nThe real problem: context quality degrades over turns The loop is stateless. Each call to client.messages.create() transmits the entire context for the model to process. As your threads grow longer, the context that needs to be processed grows linearly. This will always hurt model performance, because its ability to reason on long contexts is bounded and fragile.\nThis is a fundamental limitation of the system, not an implementation detail that can be worked around by \u0026ldquo;improving the prompt.\u0026rdquo; A 2026 study found that for each task, model performance degrades by approximately 4x with each doubling of the time needed to perform the task, and this degradation is uneven depending on where information falls in the context. This makes no sense from the model\u0026rsquo;s perspective - it has no awareness of ordering or priority. But it has a huge impact on real world performance.\nWhat this means is that every architectural choice in your agentic systems is ultimately a tradeoff of context budget. The use of multi-agent decomposition, hooks, and sessions are each ways of improving performance by ensuring that as much relevant information as possible falls within the context window while minimizing distracting information.\nHooks are context surgery, not just guardrails The standard framing for how to use hooks incorrectly focuses on safety. Yes, you can use PreToolUse to prevent the model from performing forbidden actions. Same goes for PostTooluse, for modifying the results of tools. But this misses the primary value of PostToolUse hooks.\nPostToolUse hooks are the preferred way to alter the context before the model uses the information. Tool output verbosity is the single largest reason why context budget matters. A tool that outputs 2000 tokens of JSON when 50 tokens of properly formatted summary would suffice is needlessly wasting budget on information that will need to be filtered out by the model on every turn.\n# Without a PostToolUse hook: raw tool result enters context { \u0026#34;customer_id\u0026#34;: \u0026#34;C-48291\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;Alice Chen\u0026#34;, \u0026#34;account_status\u0026#34;: \u0026#34;active\u0026#34;, \u0026#34;created_at\u0026#34;: \u0026#34;2019-03-14T09:22:11Z\u0026#34;, \u0026#34;last_login\u0026#34;: \u0026#34;2024-11-02T14:37:55Z\u0026#34;, \u0026#34;orders\u0026#34;: [...], # 200 lines of order history \u0026#34;preferences\u0026#34;: {...}, # 80 lines of settings \u0026#34;support_tickets\u0026#34;: [...] # 150 lines of prior tickets } # With a PostToolUse hook: normalized signal enters context customer_id: C-48291 | status: active | open_tickets: 2 | last_order: 2024-10-28 The hook is normalizing the data format, which is a way of preserving downstream attention budget. Every step subsequent to this one uses 20 tokens instead of 600, which is a significant saving.\nThe PreToolUse hook has another purpose, which is programmatic enforcement of ordering constraints: the canonical example is financial safety checks, where process\\_refund cannot be called before verify\\_identity has run. But the same idea applies to any prerequisite check that physically blocks the execution of a tool until some condition is met.\ndef pre_tool_use_hook(tool_name: str, tool_input: dict, session_state: dict) -\u0026gt; dict | None: if tool_name == \u0026#34;process_refund\u0026#34;: if not session_state.get(\u0026#34;identity_verified\u0026#34;): return { \u0026#34;error\u0026#34;: \u0026#34;Cannot process refund — customer identity not verified. \u0026#34; \u0026#34;Call verify_identity first.\u0026#34; } return None # allow the call The model receives an error and directs itself toward the prerequisite. It cannot circumvent this by phrasing the request differently or deciding that verification is unnecessary. This is vital in financial, security, or compliance operations where one wrong step can result in tangible losses. Tighter instructions for the system prompt improve the chances of passing a compliance check by a probabilistic value close to 1. Or 95% perhaps. But they rarely offer an absolute guarantee. If the value of the 5% is unacceptable in the context of a compliance breach or a financial loss, use the hook.\nHub-and-spoke: why subagents exist A subagent serves as a context firewall. Any internal work (tool-calls, intermediate reasoning steps, dead ends) are performed internally and hidden behind a summary of their results for the coordinator’s context.\nThis is the whole point of a multi-agent system - not specialization (a single agent is also capable of handling most tasks), and not parallelism (the subagents work sequentially), but isolation: a coordinator that delegates three subtasks to three different agents will have in its context three summaries, not the entire transcribed dialogues and reasoning steps of each agent.\ndef run_subagent(client, kind: str, subtopic: str, research_goal: str, prior_results: str = \u0026#34;\u0026#34;) -\u0026gt; str: # Fresh messages list — complete isolation from coordinator context messages = [{\u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: SUBAGENT_TASK_TEMPLATE.format( research_goal=research_goal, subtopic=subtopic, prior_results=prior_results or \u0026#34;(none)\u0026#34;, )}] for _ in range(MAX_ITERATIONS): response = client.messages.create( model=\u0026#34;claude-sonnet-4-6\u0026#34;, max_tokens=1024, system=SUBAGENT_SYSTEM[kind], # Narrow system prompt, not the coordinator\u0026#39;s tools=TOOLS, messages=messages, ) if response.stop_reason == \u0026#34;tool_use\u0026#34;: messages.append({\u0026#34;role\u0026#34;: \u0026#34;assistant\u0026#34;, \u0026#34;content\u0026#34;: response.content}) messages.append(build_tool_result_message(response)) continue return extract_text(response) Two common pitfalls to avoid here:\nSubagents don’t have access to the same context as the coordinator. Whatever research they do, whatever goals they’re pursuing, whatever constraints they’re operating under - it has to come explicitly in their prompt. Anything that the coordinator knows but doesn’t specify to the subagent, the subagent can’t know about. Every call to a subagent is an isolated interaction with no memory of previous calls.\nIf the subagents are doing an incomplete job, the culprit is almost certainly the coordinator’s decomposition, not the subagents’ research. A subagent can only do the research the coordinator asks it to do. If the decomposed task mentions only solar and wind energy, the subagents will necessarily give an answer that doesn’t mention geothermal, tidal, biomass, or nuclear fusion. The solution isn’t to “train” the subagents to think more creatively; it’s to constrain the decomposition prompt so that it specifies the exact information the coordinator wants.\nThe solution is constraining the decomposition prompt, not training the subagents:\nDECOMPOSITION_PROMPT = \u0026#34;\u0026#34;\u0026#34;Decompose the research topic into subtopics. Topic: \u0026#34;{topic}\u0026#34; Requirements: - Produce AT LEAST 6 distinct, non-overlapping subtopics. - Cover the FULL breadth, including emerging and less-common categories. Respond with ONLY JSON: {{\u0026#34;subtopics\u0026#34;: [\u0026#34;subtopic one\u0026#34;, ...]}}\u0026#34;\u0026#34;\u0026#34; Session management is context quality across time The three session options below are not about UX preference, they\u0026rsquo;re about context quality across time - managing the evolution of session context and keeping it relevant to the current state of the files under analysis.\n--resume resumes a session. The history is preserved as is. It should be used when resuming work on files that were not significantly modified since the last session. Do not use it if the files were substantially changed since the last session, as unchanged tool outputs from the prior session will be retained in context, potentially providing false positives if the model was asked to re-analyze the changed files. The model would read the entire transcript including the stale tool outputs. fork_session allows forking a session into a different branch. Each branch has its independent context. It should be used to explore different options based on some common starting point, such as different ways to refactor the same code or structure the same system. It should not be used to address the issue of stale context, as a branch will have the same context as the parent session, including stale tool outputs. A fresh start with summary injection is appropriate when context becomes stale, which happens when files are changed substantially, or the session simply grew too long and accumulated too much noise. To address this scenario, a fresh session should be started with a summary of findings from the previous session, highlighting what files were modified, and requesting re-analysis of those files. Prior analysis: three authentication issues found in auth.ts, session.ts, middleware.ts. All three have been fixed. Please re-analyse these three files to verify the fixes and check for new issues introduced by the changes. Prior findings for all other files remain valid and do not need re-analysis. This is not merely a session management technique, but a general pattern akin to the subagent context firewall but applied across time: whatever you put in the context for the purposes of reasoning must be strictly necessary for the current step.\nThe pattern underneath all the patterns Hooks control what tool results appear as as input to the model before it thinks. Subagents hide the messiness of their own work from the coordinator so it doesn\u0026rsquo;t appear in context. Similarly, programmatic enforcement takes decisions out of the model\u0026rsquo;s hands entirely if its reasoning could introduce variance. Session forking lets you explore an alternate universe without polluting the original with unrelated detritus. And fresh starts with summaries lets you forget everything but the lessons learned so far.\nThese are all techniques for improving context quality. The actual work is the easy part. What matters is what you put in the loop.\nThe thing I\u0026rsquo;m least sure about is whether observation masking - showing the model some results but not others while still recording them for the log - is the right primitive or if it creates a whole new set of errors where the model is making decisions based on information that isn\u0026rsquo;t actually there. The technique definitely improves solve rates and reduces costs on these benchmarks, but I\u0026rsquo;m not sure anyone has characterized the failure modes it can introduce.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/the-agentic-loop-is-not-the-hard-part/","summary":"stop_reason is six lines of code. The real engineering in agentic systems is protecting what goes into the model\u0026rsquo;s context window.","title":"Protecting The Model's Context Window"},{"content":"My RAG pipeline had a faithfulness score of 0.91 - but users were getting incorrect answers.\nIt took me two weeks to realize what the problem was. In short, I was evaluating the wrong thing - the retriever was confidently returning “plausible” chunks of text that didn’t include the answer, and the generator was faithfully summarizing those chunks. Everything looked correct according to standard evaluation metrics - but the system wasn’t actually working correctly.\nBelow are some insights on where RAG pipelines can fail, and how to spot those failures.\nThe metric most teams track, and what it misses Faithfulness measures whether the generated answer is consistent with the retrieved context. It\u0026rsquo;s a generation-side metric, and a good faithfulness score indicates that the model isn\u0026rsquo;t hallucinating compared to the retrieved documents.\nIt says nothing about whether those retrieved documents are accurate compared to ground-truth answers. The metric you actually want to look at is context recall: the ratio of ground-truth answer content that appears in the retrieved chunks. If context recall is low, it doesn\u0026rsquo;t matter how faithful the generation is- your model is faithfully summarizing the wrong documents.\nMost teams measure faithfulness on generation. Few measure context recall on retrieval at all, and the failure mode is often not obvious until users complain.\nWhat most eval panels look like: ✓ Faithfulness (is the answer grounded in context?) ✓ Answer relevance (does the answer address the question?) ✗ Context recall (did retrieval surface the right content?) ✗ Context precision (how much retrieved content was actually useful?) The top two are on the generation side; the bottom two are on the retrieval side. If you just run the top two, you can’t assess if your retriever is working properly.\nWhy retrieval fails silently A retriever can fail in a way that makes it appear to succeed. Imagine asking a question about a contract, for example, \u0026ldquo;What are the termination clauses in the ACME contract?\u0026rdquo;\nYour retriever returns chunks of text that happen to mention termination clauses, but from a different contract or a generic policy file. Your chunks have high relevance to your query in terms of semantic similarity or even keyword matching. Your reranker confidently promotes these false positives.\nYour generator produces a coherent answer about termination clauses using the retrieved documents. Your faithfulness metric is 1.0 because the generator didn\u0026rsquo;t hallucinate. However, you are still wrong.\nThis is the retrieval recall versus end-to-end accuracy problem. The evaluation of chunking methods in 2026 demonstrates the issue well. The best method achieves 91.9% retrieval recall but only 54% end-to-end accuracy on the benchmark. Even though the retriever retrieved many relevant chunks, the answers were wrong because the chunks were too small to contain a full answer. On average, each retrieved chunk consisted of only 43 tokens.\nIn short, high retrieval recall is not necessarily indicative of high end-to-end accuracy. The two metrics are fundamentally different, and optimization for one does not imply optimization for the other.\nThe layered failure taxonomy Before you can fix a RAG pipeline, you have to know which one is broken. There’s four possible layers, each of which can be broken in different ways.\nLayer 1 — Chunking. The chunks are too small (context rotation: the answer is in another chunk) , too big (retrieved chunk contains the answer but also 800 irrelevant tokens which suppress it) or of the wrong structure (tables embedded in PDFs parsed as text)\nLayer 2 — Retrieval. The dense vectors capture semantic similarity but not verbatim matches which is critical for queries like”Section 4.2(b)” which traditional IR models like BM25 capture better than embeddings (A 2026 pior art study on financial documents found BM25 outperformanced text-embedding-3-large on numeric and citation queries by a statistically significant margin). Table structure mismatches were the cause of 73% of retrieval failures in that domain.\nLayer 3 — Context assembly. Context is lost around retrieved chunks: If a chunk says “revenue grew by 3% over the previous quarter” the model has no way of knowing which revenue, which quarter or which comparison it is referring to since that information was trimmed during chunking. This renders the chunk effectively useless as the model cannot determine what the actual fact even is.\nLayer 4 — Generation. The model hallucinates, refuses to answer or misconstrues the context. This is the easiest layer to blame but the last to investigate since it is rare that the actual problem is in this layer.\nWhat contextual retrieval actually fixes (and what it doesn\u0026rsquo;t) Anthropic\u0026rsquo;s contextual retrieval prepends LLM-generated context to each chunk before embedding and BM25 indexing:\nCONTEXT_PROMPT = \u0026#34;\u0026#34;\u0026#34; \u0026lt;document\u0026gt; {whole_document} \u0026lt;/document\u0026gt; Here is the chunk we want to situate within the whole document: \u0026lt;chunk\u0026gt; {chunk_content} \u0026lt;/chunk\u0026gt; Give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval. Answer only with the context. \u0026#34;\u0026#34;\u0026#34; def contextualize_chunk(document: str, chunk: str, client) -\u0026gt; str: response = client.messages.create( model=\u0026#34;claude-haiku-4-5-20251001\u0026#34;, max_tokens=100, messages=[{ \u0026#34;role\u0026#34;: \u0026#34;user\u0026#34;, \u0026#34;content\u0026#34;: CONTEXT_PROMPT.format( whole_document=document, chunk_content=chunk ) }] ) return f\u0026#34;{response.content[0].text}\\n\\n{chunk}\u0026#34; The resulting chunk becomes:\nThis chunk is from ACME Corp\u0026#39;s Q2 2023 SEC filing. The previous quarter\u0026#39;s revenue was $314 million. The company\u0026#39;s revenue grew by 3% over the previous quarter. Anthropic reports that this reduces retrieval failures by an average of 49% with BM25 and 67% with BM25 plus reranking on their benchmark. With a cost of $1.02 per million document tokens with prompt caching, it can be afforded on most corpora.\nThis only fixes layer 3. Chunks are now not context-free.\nIt does not address layer 1 (chunking boundaries), layer 2 (domain-specific issues such as table parsing), or layer 4 (generation) – only one type of issue in one layer.\nThe hybrid retrieval floor If you are doing pure dense vector search, you know what you are missing: exact matches, identifiers and numeric queries. In BM25, these things are matched directly via tokens.\nThe production floor is therefore hybrid: dense for semantics, BM25 for identifiers, combined by Reciprocal Rank Fusion:\nfrom rank_bm25 import BM25Okapi def hybrid_search( query: str, chunks: list[str], embeddings: list[list[float]], query_embedding: list[float], top_k: int = 20, semantic_weight: float = 0.7, ) -\u0026gt; list[tuple[str, float]]: # BM25 lexical scores tokenized = [c.split() for c in chunks] bm25 = BM25Okapi(tokenized) bm25_scores = bm25.get_scores(query.split()) # Dense cosine scores import numpy as np emb_matrix = np.array(embeddings) q_vec = np.array(query_embedding) cosine_scores = (emb_matrix @ q_vec) / ( np.linalg.norm(emb_matrix, axis=1) * np.linalg.norm(q_vec) + 1e-9 ) # Reciprocal Rank Fusion def rrf_rank(scores, k=60): ranked = np.argsort(scores)[::-1] return {idx: 1 / (k + rank + 1) for rank, idx in enumerate(ranked)} bm25_rrf = rrf_rank(bm25_scores) cosine_rrf = rrf_rank(cosine_scores) fused = {} for idx in range(len(chunks)): fused[idx] = ( semantic_weight * cosine_rrf.get(idx, 0) + (1 - semantic_weight) * bm25_rrf.get(idx, 0) ) top_indices = sorted(fused, key=fused.get, reverse=True)[:top_k] return [(chunks[i], fused[i]) for i in top_indices] This isn\u0026rsquo;t a silver bullet. On a general set of corpora, hybrid approaches are modestly better than either method alone. But on certain classes of documents - tables in financial reports, legal citations, computer code - the improvements are spectacular.\nThe eval panel that actually catches failures The minimum viable eval panel for a RAG system in production:\nfrom ragas import evaluate from ragas.metrics import ( faithfulness, # generation: is answer grounded in context? answer_relevancy, # generation: does answer address the question? context_recall, # retrieval: did we surface the right content? context_precision, # retrieval: how much retrieved content was useful? ) results = evaluate( dataset=eval_dataset, metrics=[faithfulness, answer_relevancy, context_recall, context_precision], ) Thresholds worth targeting before shipping:\nMetric Minimum bar Faithfulness \u0026gt; 0.85 Answer relevancy \u0026gt; 0.80 Context recall \u0026gt; 0.75 Context precision \u0026gt; 0.70 If faithfulness is high but context recall is low, your retriever is broken, not your generator. Fix chunking and retrieval before touching the prompt.\nIf your context precision is low, your retrieved chunks are too noisy. Try a reranker, or smaller chunks, or both.\nWhat I\u0026rsquo;d do differently Start from the eval panel, not the pipeline. Know what \u0026ldquo;working\u0026rdquo; means in terms of all four metrics before writing any retrieval code. The layer you cannot measure yourself will become the layer you waste weeks chasing ghosts in production.\nAnd really think through chunking as an engineering decision, not a default. The chunk size and strategy will vary wildly depending on the document type. Markdown pages, PDFs of SQL queries, or scanned financial tables will need different treatment at chunking time.\nThe thing I\u0026rsquo;m still not sure of is whether contextual retrieval\u0026rsquo;s LLM-generated context is adding noise to the signal, if the chunk was self-contained. Say, if the chunk has a single factual answer, the prepend context makes its precision artificially low, but you can\u0026rsquo;t know that at indexing time.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/your-rag-eval-is-lying/","summary":"A 0.91 faithfulness score doesn\u0026rsquo;t mean your RAG pipeline works. Most eval panels can\u0026rsquo;t see the layer that\u0026rsquo;s actually broken.","title":"Pitfalls in RAG Evaluation"},{"content":"The first question I ask when shown an agent architecture is not \u0026ldquo;does it work?\u0026rdquo; It is \u0026ldquo;what happens when it does not?\u0026rdquo;\nThat question can turn a positive response into a negative one. Most people are thinking of agents as a way to tackle a really hard problem. Complexity is not the litmus test for an agent, however. If the steps are knowable before seeing the input, then you have a workflow, not an agent. A workflow is much easier to build, execute, and debug than an agent.\nHere\u0026rsquo;s how I actually think through it.\nThe one question that does most of the work Can I draw the exact steps and their order before seeing the input?\nIf yes: it\u0026rsquo;s a workflow. Pick the simplest workflow shape that fits. If no: it might need an agent. But not yet — there are cheaper stops first.\nThis seems obvious, but I\u0026rsquo;ve seen teams fall into the \u0026ldquo;multi agent\u0026rdquo; trap for customer support tickets, when a 3-way routing would handle 95% of the volume at 1/10th the cost and complexity. The task was not open-ended, but it wasn\u0026rsquo;t obvious until we started diagramming.\nThe cost-of-failure lens When choosing an architecture, I\u0026rsquo;m going to do one more reality check: what is the cost of getting wrong answers?\nIrreplaceable work, a regulatory requirement, or something that can\u0026rsquo;t be undone - these are reasons to prefer a controlled process over an agent. A workflow with checkpoints and human review is safer ground if an error in thinking leads to catastrophic consequences.\nOn the other hand, if mistakes are easily undone or have little downside, I\u0026rsquo;m much more tolerant of \u0026ldquo;wrong\u0026rdquo; output from an agent.\nIn short: agents\u0026rsquo; ability to go off-road is much less valuable when wrong steps lead to irreversible outcomes. A workflow failure at a specific step has predictable consequences - bad results at step 3 were caused by some kind of error in step 2, and you can see how step 3\u0026rsquo;s output was produced. An agent can produce invalid output in any number of ways: loops, tangents, hallucinations, or just running out of thinking budget before returning a result that doesn\u0026rsquo;t account for all relevant information. Wrong answers from an agent have more varied consequences, and when those consequences are expensive to repair, I\u0026rsquo;ll favor a workflow with clearer stopping points.\nThe climb, in order I think of LLM systems as a ladder. The rule: go up one rung only when the rung below provably can\u0026rsquo;t do the job.\nTier 0: No LLM — structured input, deterministic rules, sub-10ms latency Tier 1: Augmented LLM — single-turn Q\u0026amp;A, doc summarization, RAG Tier 2: Workflows — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer Tier 3: Single agent — open-ended task, unknown path, exploration Tier 4: Multi-agent — 2+ distinct expertise domains, parallel exploration, context overflow Most tasks fall in Tier 1 or Tier 2. Supporting an FAQ line with grounded responses is T1. Same with the translation pipeline, which has a T2 prompt chain plus a parallel fan out for the per language substeps. Code reviews across multiple dimensions are T2 parallelized (security, performance, style) rather than one big multi-layered prompt each time.\nThe use case for agents is where the path is not apparent at design time: research, debugging, open ended coding. It\u0026rsquo;s a capability that\u0026rsquo;s needed when there isn\u0026rsquo;t anything lower tier that can handle the task.\nConsider the multi-agent question specifically Multi-agent is where I think people make most expensive mistakes.\nA 2026 paper (arxiv 2604.02460) finds that, when controlling for total compute (i.e., number of tokens allocated to thinking), single agents perform as good or better than multi-agent systems on multi-hop reasoning tasks across three model families and five mas architectures. The reason most benchmarks find MAS useful is precisely because they are allocated more total compute. Controlling for that variable, it disappears.\nThe practical implication is that multi-agent is often primarily a way to distribute compute rather than a capability improvement; it has costs associated with context management at the supervisor level and emergent behaviors that can be exceptionally difficult to reverse engineer.\nMy personal rule of thumb for using multi-agent at all is if one of more of the following is true:\nThe task requires multiple genuinely distinct areas of expertise (law + finance + compliance, etc.) rather than just \u0026ldquo;is complicated\u0026rdquo; The task requires genuine parallelization of distinct lines of thought The task requires more context than fits in a single model\u0026rsquo;s context window (or can be summarized/transposed via context editing) MAS is often substantially more expensive (in tokens) than using a single model. The number is ballpark 10–15x but could be higher depending on the task. This is primarily because of the agent loop; the overhead of repeatedly querying the model adds up quadratically to the number of hops. A 20 step agent loop (not uncommon) has a 50x overhead vs a single model query once you start including the cost of history in each response.\nWhatever you\u0026rsquo;re trying to accomplish with MAS, first ask yourself: can a single model do it with skills? Because it almost certainly can.\nThe thing nobody says about observability It is no secret that observability is critical for the functioning of large-scale LLMs. What is less apparent is that it is no longer optional beyond tier 1.\nThe type of generic request latency, error rate, and uptime metrics one usually gets from conventional application performance monitoring solutions offer limited insight into the root causes of issues in an LLM-based system. In order to effectively troubleshoot and optimize such a system, one would need to collect:\nAll prompts provided and responses generated by the system\nWhich branches, tools, or agents were called for a particular prompt and why\nWhat information was retrieved for a given response\nThe token usage per call and per turn\nNon-deterministic behavior is a common source of silent failures. A hallucinating agent will not self-correct, even if it strays miles away from the intended task, and a retrieval step that returns irrelevant documents will not throw an error. Routing classifiers, too, are not perfect, and if a support ticket gets routed down the wrong agent chain, it may be challenging to detect until an audit. These examples highlight the importance of observability in LLMs - without it, debugging becomes guesswork.\nI\u0026rsquo;ve observed companies choose to forgo investing in observability in order to accelerate product development. This can lead to weeks of debugging a productionized tier 2 agent system before identifying a systemic issue in the agent\u0026rsquo;s decision-making logic. A properly instrumented tier 2 agent would substantially outperform a non-observable tier 3 agent in most cases, as the value of the tier 2 agent\u0026rsquo;s responses would be immediately apparent.\nWhere I land The default position for anything that promises to be complex by nature should be scepticism and seeking for simplest practical alternative. Agents are powerful, but they are also costly, risky, and have poor feedback loops. It’s not a question of if my task is complex enough to warrant an agent, but whether the simplest possible viable solution has been exhausted.\nThe mnemonic that I find helpful here is No LLM -\u0026gt; (Augmented) -\u0026gt; (Chain) -\u0026gt; Route -\u0026gt; (Parallel) -\u0026gt; Orchestrate -\u0026gt; Evaluate -\u0026gt; Agent -\u0026gt; Multi-agent. Each higher step is only attempted if the lower one has provably failed; and the choice should be written down with reasons for all dismissed options.\nThe second point I want to make is that the decision is best formalised and made visible somewhere. It is not typical to later revisit the justification for choosing a single orchestrator-worker combination over an entire agent due to some emergent properties of the subtask decomposition, and even less common to reverse the decision. By writing the reasoning down you get to keep track of the tradeoffs against the emergent properties of the chosen architecture, which may not be obvious at the time of designing the system, but will become apparent later.\nThe third point is that I want to think through my own uncertainties. Where exactly does the line between single-agent and multi-agent solutions lie in terms of task horizon and degree of subtask autonomy, given roughly equal computational budget? I think that we are beginning to see the theory that informs such choices, but I do not think that the practical guidance is mature enough yet.\nThank you so much for reading this. If you interested, please reach out to me at my X/Twitter or Reddit\n","permalink":"https://minh2416294.github.io/posts/reach-for-an-agent-last/","summary":"A decision framework for picking LLM architecture by asking what failure costs first and why agents are the right answer less often than you think.","title":"Picking the Right LLM Architecture"}]