Diff Review Loops

The most dangerous moment in AI-assisted development: you're looking at a diff that looks correct. Code is clean, logic seems right, tests pass. You approve it. Two days later, bug in production.

AI-generated code has a specific failure mode — correct for the happy path, wrong for the edge cases. The code looks right because the structure is right. The bug is in the behavior.

The diff review mindset

Reviewing an AI diff requires a different mindset than reviewing a human-written one. With human code, you're checking for mistakes. With AI code, you're checking for plausible-looking mistakes — code that looks right but isn't.

A concrete example: you ask the model to add pagination to a list. It returns this:

const start = (page - 1) * pageSize;
const end = start + pageSize - 1;
return items.slice(start, end);

Looks right, but slice treats its end index as exclusive. Subtracting one means every page returns pageSize - 1 items. A test that only checks the first item or verifies that the result is non-empty can still pass while every page silently drops its final item.

The code is clean, the logic is almost right, and nothing in the diff signals the problem. That's the failure mode.

Questions I ask:

What edge cases does this not handle? The model tends to handle the cases it was asked about and miss the ones it wasn't.
Does this change the behavior in ways that aren't visible in the diff? Changing a function's return type, modifying shared state, altering execution order.
Are the tests testing the right things? AI-generated tests often test the happy path and miss the edge cases.
Does this introduce any implicit dependencies? New imports, global state, environment variables.

Ask the model to break it

After the model generates code, ask it to try to break it. "What inputs would cause this to fail? What edge cases does this miss?"

Works because the model is in a different mode — looking for problems rather than generating solutions. It often finds issues it missed during generation.

Patterns that indicate risky diffs

Patterns I've learned to scrutinize:

Simplified conditionals. The model often simplifies complex conditions. Sometimes correct, sometimes it removes an edge case handler.
Changed function signatures. Even small signature changes can break callers in non-obvious ways.
New async/await patterns. Async code is where subtle bugs hide. Any change here deserves extra attention.
Modified error handling. The model tends to add or remove error handling in ways that change behavior silently.

Scenario Lab

Real tasks, real model outputs — see which model wins and why

Paste a stack trace and get a plain-English explanation with a fix suggestion.

~0.8k in~0.5k outsignal: speed

→

Fast models handle this as well as expensive ones. The answer is either right or obviously wrong — no back-and-forth needed.

🌙

GPT-5.6 LunaBest pickrecommended

Fast, correct, no fluff

What it produced

The error occurs because `user` can be `undefined` when the component first renders. Add a null check: `if (!user) return null;` before accessing `user.name`.

What went right

▸Identified root cause immediately
▸Gave a one-line fix
▸No unnecessary explanation

Cost note

The current GPT-5.6 fast default; negligible at normal volume

token cost

$0.0038/run

💎

Gemini 3 FlashGood fit

Correct but slightly verbose

✨

Claude Sonnet 5Use with care

Correct, but added unrequested refactor suggestions

🧠

Claude Opus 4.8Avoid

Thorough but massively over-engineered for this task

Bottom line

Luna is fast, cheap, and more than capable for error explanation. Paying for Sol or Fable here is waste.

Outputs are curated from real usage. Prices from official API docs, verified 2026-07-14. Anthropic · OpenAI · Google · DeepSeek · Cursor Claude Sonnet 5 is $2/$10 per million input/output tokens through August 31, 2026, then $3/$15.