The verdict: I asked three versions of Claude where my own Anthropic API key was being used in my own codebase. Haiku said it was not. Sonnet 4.6 mapped the full surface. I had to escalate to Opus 4.7 with extended thinking after a runtime experiment proved Haiku wrong. And none of them remembered the migration we had made together, weeks earlier, that put the key there in the first place.
I noticed an Anthropic bill I did not expect. Where, in my own repository, was the key being used? Build? Runtime? GitHub Actions? Some old script I forgot about?
I asked Haiku. Haiku said the string ANTHROPIC_API_KEY did not appear anywhere in the codebase. Strictly true. Functionally wrong.
I did not trust the answer. I opened the running app, hit the chat endpoint, and watched the network panel. Anthropic's API was being called. The key was live. The small model had handed me a confident, plausible, partially-correct answer — the worst kind of wrong.
I escalated. Opus 4.7 with extended thinking got the rest of the way. The @ai-sdk/anthropic SDK reads the variable implicitly, so a literal grep returns nothing even though the dependency is real. One runtime endpoint, no build use, no CI use, no scripts. Clean forensic map.
But the moment that stayed with me was smaller and sharper. Earlier in this same project, Claude Code had helped me migrate the chat endpoint from OpenAI's gpt-4o-mini to Anthropic's claude-sonnet-4-6. The commit is still in the log: swap to Claude Sonnet 4.6 with prompt caching. We made that decision together. Weeks later, in a fresh session, the assistant had no memory of it and could not tell me which provider was now in use.
I had to reverse-engineer my own codebase to audit a change the AI itself had helped me make.
The case of the missing string
$ grep -rIn "ANTHROPIC_API_KEY" .
# nothingA naive search returns nothing. A naive reasoner concludes: not used here. That conclusion is wrong. The SDK reads the env var implicitly. The string is absent because the abstraction hides it. The dependency is real.
Two lessons, one frustration
The post-mortem has two threads, and they are connected.
Lookup vs. forensic. Most AI tasks fall into one of two classes:
- Lookup. Where is X? What does Y mean? How do I do Z? The answer is present in the data. The work is finding it. A small model is fine — often preferable, because it is faster and cheaper.
- Forensic. What is the full surface of X, including where X is not? Is the absence of evidence evidence of absence? The answer is not directly findable. The work is reasoning about what would have to be true.
Operating rule: Lookup rewards search. Forensic rewards reasoning from absence, knowledge of implicit defaults, and synthesis across layers the user did not even mention.
Smaller models handle lookup well. They struggle with forensic work for predictable reasons. Fewer searches before they commit to an answer. Less likely to know that an SDK reads env vars implicitly. Quicker to collapse three axes — build, runtime, CI — into one. Better at find X than at prove the full surface of X.
The forgotten architecture. Forensic work would be rarer if AI assistants remembered the decisions they helped you make. They do not. Every session starts cold. To a new session, the codebase we built together yesterday is just another repo the model has never seen. The smaller the model, the worse this gets — less context window, less ability to reconstruct intent from artifacts, less patience to search across CI and scripts before answering.
So here is the real reason I needed Opus 4.7 with extended thinking. My AI assistant forgot what it had helped me change. The cheap model could not work the change back out from the artifacts alone. The escalation was not a luxury. It was the cost of memory loss.
Why this matters for AI budgets
If you are standing up AI for your business, you will hear a story that goes the cheap model is good enough for most things. That story is true for lookup work and misleading for forensic work — and the longer you let AI evolve your codebase, the more forensic the audit work becomes.
The rule is short:
- Extract, classify, summarize known text, triage. Route to the small model.
- Audit, diagnose, map, prove a negative, reconstruct prior architectural decisions. Route to the larger model. Turn on extended thinking when the answer is not directly findable.
You will spend less in the long run paying the larger-model premium on the right ten percent of calls than paying the smaller-model price on every call and absorbing the cost of the wrong answers.
The harder lesson: Model selection is not a cost-optimization problem. It is a task-classification problem. And task classification gets harder as your AI assistant accumulates decisions it does not remember making.
Two operating habits that have already saved me time:
- Verify before trusting a negative. When a small model says X is not in your repo, run a runtime experiment before accepting it. A working app is the only oracle that cannot lie about which API it just called.
- Keep an architecture log. A short append-only file the AI writes to whenever you make a decision together — which model, which provider, which package, which workflow. Cheap to maintain. Turns most forensic questions back into lookup questions.
Next time a small model tells you something useful, ask one follow-up: what would have to be true for this answer to be wrong, and did you check for it? If the model cannot answer that, you were doing forensic work with a lookup tool — possibly on an architecture nobody on either side of the keyboard remembers building.
