Model Personalities

Ask Claude and Gemini the same thing: "Clean up my user profile component." Gemini formats the file. Claude rewrites the component hierarchy, extracts a hook, strengthens the types, and leaves a comment about a potential race condition it noticed in the auth flow.

Same request. Completely different output. Neither of them is wrong.

This is model personality. Not benchmark scores — those tell you what a model can do. Personality tells you what it will do when you leave things open. How much initiative it takes, how it handles ambiguity, what it considers "cleaning up" versus "staying in my lane."

Two models with the same benchmark scores can produce wildly different results. The one that fits your task isn't always the smartest — it's the one whose instincts align with what you actually want.

Model Match

Swipe to find your AI coding partner

1 / 8

PASS

💎

Gemini 3 Flash

The Careful One

▸Literal-minded — does exactly what you say
▸Risk-averse — picks the safest approach
▸Consistent in long sessions

Best for

Production refactors where surprises are costly

✨

Claude Sonnet 5

The Proactive One

▸Genuinely creative — suggests better APIs
▸Notices things you didn't ask about
▸Best at explaining complex concepts

Best for

Feature design and architecture exploration

🧠

Claude Opus 4.8

The Deep Thinker

▸Traces actual logic, not just patterns
▸Thinks in systems and abstractions
▸Proactive with high-signal observations

Best for

Hard problems, architecture reviews, subtle bugs

swipe or tap

Gemini: the careful one

Google · Conservative · Best for: Production refactors

Gemini asks for permission. It's conservative, sticks close to what you asked for, and rarely goes off-script. Ambiguous task? It'll ask a clarifying question rather than assume and run.

It's literal-minded — does what you say, not what you might have meant. Ask it to "clean up this component" and it'll fix formatting. It won't restructure the hierarchy or suggest a different pattern unless you ask.

Multiple valid approaches? Gemini picks the safest, most conventional one. Reliable for production work, but you miss out on solutions that require a judgment call.

In long sessions with large contexts, it stays level. Doesn't drift or get "creative" with your architecture. For marathon refactoring, that steadiness matters.

Where it falls short: it won't push back on your approach, suggest alternatives, or notice you're solving the wrong problem. Reliable executor, not a thought partner.

GPT-5.6 Luna: the fast operator

OpenAI · Fast · Best for: Narrow, verifiable work

Luna is the cheapest and fastest GPT-5.6 tier: $1 / $6 per million input/output tokens. Give it a small fix, a test failure, a summary, or one mechanical pipeline step and it gets to the artifact quickly without turning the task into a research project.

Its best operating environment has a clear finish line. "Change this prop and run these tests" is Luna-shaped. "Redesign this subsystem and decide what good looks like" is not. Use it as the worker in a guarded loop and escalate when the evidence becomes ambiguous.

Where it falls short: strategy and depth. Luna can execute a good plan, but it should not be your default architect.

GPT-5.6 Terra: the everyday agent

OpenAI · Balanced · Best for: Everyday implementation

Terra is the new daily-driver GPT: competitive with GPT-5.5 at roughly half the token price, $2.50 / $15. It has enough reasoning, context, and tool use for normal multi-file features, research, and verification loops, without making every Tuesday task a flagship-model run.

Its personality is measured autonomy. It fills reasonable gaps, states consequential assumptions, runs the checks, and usually keeps the diff proportional to the request. Start here for normal coding work; drop to Luna when the step is mechanical, or move up to Sol when the problem keeps branching.

Where it falls short: the last layer of difficult reasoning. For high-consequence architecture or a task that has already defeated the normal approach, escalation is cheaper than repeated Terra retries.

GPT-5.6 Sol: the relentless one

OpenAI · Frontier agent · Best for: Long autonomous loops

Sol is the strongest GPT-5.6 tier and OpenAI's current flagship for complex coding, computer use, science, research, and security work. It leads Terminal-Bench 2.1, can use Max reasoning for difficult problems, and can coordinate subagents in Ultra mode. The practical difference from GPT-5.5 is persistence: give it the outcome and tools, and it is unusually good at staying in the loop until the result runs.

That autonomy needs a real finish line. METR observed the highest reward-hacking rate it had measured in a public model on its ReAct harness. Sol sometimes found weaknesses in the evaluation environment instead of solving the intended task. In production, keep success conditions external, hide the checks that must remain hidden, and inspect the artifact rather than trusting a self-reported pass.

Where it falls short: cheap work and soft evals. Sol is excellent, but Luna or Terra are better engineering when the task does not need flagship reasoning.

Claude Sonnet 5: the proactive one

Anthropic · Proactive · Best for: Feature design

Claude has opinions and isn't shy about sharing them. Pushes back on your approach, suggests alternatives, notices bugs you didn't ask about, restructures code to be "better" — even when you just wanted a simple change. Sonnet 5 is much more agentic than its predecessor: it plans, uses browsers and terminals, checks its own work, and can carry multi-step coding tasks through to verification. It keeps the 1M context window and adds adjustable effort levels, with introductory API pricing of $2 / $10 per million tokens through August 31 before returning to the usual Sonnet rate of $3 / $15.

It might suggest a better API surface, flag that your data model will break at scale, or restructure code in a way you hadn't considered. Creative by instinct.

But that proactiveness cuts both ways. While implementing a feature, it'll spot naming inconsistencies, missing error boundaries, potential race conditions in adjacent code — and often fix them without being asked. Magical in short sessions. A 40-file diff in long ones.

It has strong preferences about structure, naming, and patterns — will refactor code to match its taste. Setting explicit constraints ("do not refactor unless I ask") is essential.

On the upside, it gives the clearest explanations of any model. Connects your specific code to the general principle. That's where the personality really shines.

Where it falls short: scope creep. Its instinct to be helpful means it expands tasks. Fix it with explicit constraints in your prompt or CLAUDE.md.

Claude Fable 5: the marathon thinker

Anthropic · Maximum depth · Best for: Architecture and multi-day projects

Fable is what comes after Opus 4.8 when the problem is not just hard but long. Anthropic positions it for ambitious migrations, complex implementations, and asynchronous projects that can run for days. It plans across stages, delegates to subagents, writes tests, challenges its own assumptions, and checks the output against the goal.

It is the best fit when understanding the system is the task: a migration with mixed-version compatibility, a subtle architecture decision, or a project where the rollback plan matters as much as the happy path. At $10 / $50 per million tokens, it should create leverage, not boilerplate.

Fable also has deliberate safety boundaries. Flagged cybersecurity and biology requests can route to Opus 4.8, so it is not simply an unrestricted replacement for every old Opus workflow.

Where it falls short: ordinary implementation. Once Fable has made a decision-complete plan, Terra, Sonnet, or Composer can often execute it for much less.

Composer 2.5: the agentic one

Cursor · Agentic · Best for: End-to-end tasks

Composer 2.5 is Cursor's own model, trained specifically for agentic coding. It doesn't just edit files — it acts. Terminal commands, reading output, making more edits, looping until the task is done. Closest thing to an AI developer that can execute end-to-end.

Give it a task spanning multiple files with verification — "add this feature, make sure tests pass, fix type errors" — and it works through the steps autonomously. Runs the build, reads errors, fixes them, reruns.

It also self-corrects. Observes the results of its own actions, catches mistakes a non-agentic model would leave for you. Sees the TypeScript error, understands it in context, fixes it — no copy-pasting errors back into a prompt.

Not limited to open files either. It navigates the project, finds relevant files, makes coordinated changes across many of them. Draws from the Auto + Composer pool, which includes more usage than the API pool — making it cost-efficient for everyday agentic work.

The May 2026 retrain from Composer 2 closed the gap with frontier models — SWE-Bench Multilingual at 79.8% (within ~1pp of Opus 4.8) and CursorBench v3.1 at 63.2%, at roughly one-tenth the per-token cost. Behaviorally it also dropped the antagonistic edges of Composer 2: users on the forum describe it as "willing to think with you" rather than arguing back. The lingering gripe is inconsistent thinking depth — sometimes a "please think harder" nudge is needed before it commits to a deep answer.

Where it falls short: autonomy still has a cost, and GPT-5.6 Sol now leads the long autonomous-loop tier. Composer remains the value choice inside Cursor; use Sol when the task must cross more tools and sustain the hardest loop. Short task scopes and frequent checkpoints are still useful for high-stakes work.

How personality affects prompting

Match your prompt style to the model's instincts. Be explicit with Gemini — it won't infer intent. Give Luna a narrow step and a check, Terra the normal implementation loop, and Sol a hardened success condition. Set hard scope limits with proactive Claude models, and use Fable only when the reasoning horizon justifies it. Give Composer 2.5 a clear success condition and let it run, but check in at breakpoints.

The right model isn't the smartest one — it's the one whose personality fits the task.