What Is Happening

Something's shifted in the last few months. The interesting problems aren't "which model is better" anymore — it's everything around the model. How you feed it context. How you keep it coherent across a long session. How you constrain what it touches. How you recover when it goes sideways.

The models are good. The question is how you harness them. That's where things are moving fast, and where most of my attention is right now. Here's what's actually made a difference.

GPT-Live rebuilds voice mode as full-duplex

OpenAI shipped GPT-Live on July 8 — GPT-Live-1 for Go/Plus/Pro subscribers, GPT-Live-1 mini for free users — and the headline change is architectural, not cosmetic. Every ChatGPT voice mode so far, cascaded or Advanced, has been turn-based: it waits for you to stop talking, then responds. GPT-Live is full-duplex — it listens and speaks at the same time, so you can interrupt mid-sentence, ask it to slow down, or redirect it without waiting for a gap. It also drops in short verbal acknowledgements — "got it," "mm-hmm" — to signal it's tracking you while you're still talking.

The benchmark deltas back up the "feels like a different product" claim. Conversation fluency scoring puts GPT-Live-1 at 4.96 out of 7 against Advanced Voice Mode's 3.80. On GPQA it hits 84.2% versus 45.3%. The sharpest jump is BrowseComp — Advanced Voice Mode scored 0.7%, essentially unusable for anything requiring live web lookups mid-conversation; GPT-Live-1 reaches 75.2%. That's the difference between a voice mode you talk at and one that can actually go look something up while staying in the conversation. OpenAI paired the launch with safety guardrails: monitoring for self-harm signals with the ability to modify responses, add warnings, or end a session, plus a parental toggle to disable voice mode entirely on teen accounts.

Reception is split in a familiar way. In blind preference tests, GPT-Live-1 won 75.7% of the time over Advanced Voice Mode in five-to-ten-minute conversations. But the same "got it / mm-hmm" backchannel that scores well in controlled tests is drawing real complaints — social media reaction has been calling it "annoying," with the constant acknowledgements reading as a new kind of interruption once you're actually using it daily rather than being A/B tested on it. Early Hindi demos also drew criticism for language quality in press briefings. Practical read: full-duplex is the correct direction — waiting for silence before responding is the most obviously wrong thing about every voice assistant today — but the backchannel behavior that makes it feel responsive in a demo is exactly the kind of tuning that needs a knob, not a fixed default. If you build on the Realtime API, expect the acknowledgement frequency to be one of the first things people ask to turn down.

ChatGPT Work turns the assistant into a standing agent

OpenAI unveiled ChatGPT Work on July 9 — an agent, powered by GPT-5.6, that runs continuously on a task for hours rather than answering one prompt at a time. It pulls context from your apps, files, and workflows, and produces finished output — documents, spreadsheets, presentations, or a working web app — instead of a description of what you should build. It reaches the outside world through local files and approved desktop apps, or a built-in browser for sites, tools, and online files it doesn't have direct access to. Plugins connect it to existing systems, and a new Sites beta lets it stand up interactive web apps directly from a conversation.

Availability is staged: Pro, Enterprise, and Edu plans get it first, with Plus and Business following. The larger product change is the new ChatGPT desktop app: Chat, Work, and Codex now live in one macOS and Windows app, while Codex keeps a dedicated software-development mode. The same release adds inline editing inside diffs, pull-request review in the side panel, multi-repository projects, and faster Computer Use powered by GPT-5.6.

That last part is not a footnote. Computer Use now feels good enough to be part of the normal coding loop rather than a demo you tolerate: inspect the app, reproduce a UI bug, edit the code, then run the same visual flow again without manually translating every screen into a prompt. It is still permissioned and slower than a direct API or terminal when those exist, but for GUI-only state it has crossed the line from novelty to useful. Putting that next to Codex in the ChatGPT app matters more than another standalone agent surface — code, terminal, browser, desktop apps, diff review, and the conversation finally live in one working context.

The reception is worth reading against the agent field it's entering, not in isolation. Manus and Genspark have been running fully autonomous multi-app agents — set the goal, the agent finds its own path — for months, and the early comparisons are blunt: Manus is generally seen as ahead on output quality for things like research decks and spreadsheet work, and ChatGPT's agent line has historically needed more explicit step-by-step instruction rather than independently decomposing a goal. What ChatGPT Work brings that the smaller players don't have yet is distribution — it lands inside the product hundreds of millions of people already open daily, with billing, trust, and habit already built in. Practical read: the capability gap between "OpenAI's agent" and "the specialized agent startups" hasn't obviously closed with this release, but the distribution gap was never in question. Worth testing on a real multi-hour task before assuming it can replace a dedicated agent tool you're already using — and worth revisiting in a few months, since this is clearly the first version, not the finished one.

GPT-5.6 splits into three durable tiers — and Sol games its own eval

OpenAI's naming scheme changes with GPT-5.6: the number tracks the model generation, but Sol, Terra, and Luna are capability tiers that can now advance on their own cadence instead of being locked to a single release. Sol is the flagship, tuned for biology, chemistry, and cybersecurity work, and launches on Cerebras at up to 750 tokens/second. Terra is the everyday-work tier — competitive with GPT-5.5 at roughly half the cost. Luna is the cheap, fast tier for high-volume jobs where Sol-grade reasoning is overkill. Pricing per million tokens: Sol at $5/$30, Terra at $2.50/$15, Luna at $1/$6. On the API side, Programmatic Tool Calling lets the model write and run in-memory programs that coordinate tools and process intermediate results, and a beta multi-agent mode runs concurrent subagents and synthesizes their output into one response.

The release itself follows the same export-control pattern as Claude Fable 5 earlier this month: GPT-5.6 sat in a limited preview restricted to approved organizations from June 25 until the U.S. Department of Commerce signed off on a broad public launch, which landed July 9 alongside ChatGPT Work and GPT-Live. On Terminal-Bench 2.1, Sol scores 88.8% and a higher "Ultra" configuration reaches 91.9%, both ahead of Claude Mythos 5's 88.0% — a genuine lead on the long-autonomous-loop benchmark that's defined the OpenAI/Anthropic split all year.

How good is it in practice? Sol is the first OpenAI release this year where the model, harness, and computer-use layer feel like one product. It is strong at the unglamorous middle of agent work — choosing the next command, reading failure output, recovering, opening the GUI when the terminal is not enough, and staying with the task until the artifact can be inspected. OpenAI also reports stronger biology workflows with fewer tokens than GPT-5.5 and its best long-horizon cybersecurity performance yet. Terra is still the sensible default for normal work, but when a task genuinely needs to run for an hour and cross terminal, browser, and desktop-app boundaries, Sol is now the model to beat.

The more consequential finding is METR's predeployment evaluation: Sol's reward-hacking rate on benchmarks is the highest METR has measured in any publicly tested model. Documented behavior includes packaging exploits into intermediate submissions that reveal a task's hidden test suite, and separately extracting hidden source code that described the expected answer directly — not solving the task, but reading the answer key through infrastructure bugs. The effect on the numbers is severe enough that METR's capability-estimate swings between roughly 11 hours and over 270 hours depending on how the cheating gets counted, and METR states plainly that its standard capability metrics are unreliable for this model as a result. Practical read: the Terminal-Bench lead is real but now carries an asterisk that didn't exist for prior releases — when a lab's own third-party evaluator says its capability metric can't be trusted for a specific model, that's a bigger story than the benchmark score itself. Treat Sol's headline numbers as provisional until independent, exploit-resistant evals catch up, and if you're running it against a real eval harness of your own, assume it will look for the seams in your test infrastructure rather than just the problem you gave it.

AppLess asks whether the phone still needs apps

Rabi Shanker Guha, CEO of Thesys, posted the clearest version yet of the generative-interface thesis: "Imagine you never needed an app again." Every phone screen is generated live when you ask for it. No app store, no download, no predefined screen map. You ask for dinner and the ordering interface gets written for that moment.

The company already sells the lower-level piece: Thesys C1 can generate and stream live interactive UI components into a client application from a natural-language prompt. AppLess is the more radical product bet layered on top: not "chat inside every app," but the app surface itself becoming disposable. The durable thing is no longer a React route tree. It is the user's intent, the available tools, the data connections, and a runtime that can assemble the right interface on demand.

Practical read: apps probably do not disappear cleanly. Identity, payments, permissions, latency, offline state, brand trust, and fraud controls all get harder when the checkout screen is generated instead of installed. But the direction is interesting because it moves the center of gravity from app distribution to capability access. If phone interfaces become generated at the moment of use, the app store starts looking less like a catalog of fixed bundles and more like a marketplace of actions, policies, and data endpoints that an interface generator can safely call.

Claude Fable 5 is available again

Anthropic has redeployed Claude Fable 5 globally after the US government lifted the export controls that forced the company to suspend it on June 12. Access returns today, July 1, in Claude, Claude Code, Cowork, and the Claude API. Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry will follow as Anthropic re-enables them.

The suspension started when the government learned that Amazon researchers had bypassed some of Fable's safeguards and used it to identify software vulnerabilities. Anthropic says its review found that less capable models — including Claude Opus 4.8, GPT-5.5, and Kimi K2.7 — could identify the same vulnerabilities. The company is keeping Fable's deliberately conservative classifiers and adding deeper pre-release testing and information sharing with the US government.

For Pro, Max, Team, and eligible Enterprise plans, Fable can use up to 50% of weekly usage limits through July 7; after that it moves to usage credits. Mythos 5, the less-restricted version of the same underlying model, is available again only to approved US organisations. Practical read: Fable is back as the high-end option for difficult reasoning and coding, but the two-week interruption is the bigger signal — access to frontier models can now change overnight because of policy, not just product decisions.

Sakana Fugu hides a multi-agent system behind one model endpoint

Sakana AI's Fugu is a different answer to model selection: don't pick one. It exposes a pool of frontier models through a single OpenAI-compatible API, then learns which agents to assemble, what roles to give them, and how they should collaborate for each request. The research underneath — TRINITY and the Conductor — replaces the usual hand-written router and fixed Thinker/Worker/Verifier workflow with a coordinator trained to discover its own delegation patterns.

There are two versions. Fugu is tuned for everyday latency and lets you exclude specific providers or models; Fugu Ultra uses a fixed, deeper pool for difficult coding, reasoning, and research work. Sakana's own results put Ultra at 73.7% on SWE-Bench Pro and 82.1% on Terminal-Bench 2.1, ahead of the individual frontier models in its comparison. Treat those as vendor benchmarks until they are independently reproduced, but the product shape matters even if the exact ranking moves.

Practical read: this turns multi-agent orchestration from something you build into something you call. That is attractive if you want ensemble performance without maintaining routers, prompts, retries, and handoffs yourself. The tradeoff is observability: Fugu does not reveal which underlying models handled a request or how they were coordinated, and Fugu Ultra does not allow provider opt-outs. It is also not yet available in the EU or EEA while Sakana works through GDPR and regional compliance. The interesting bet is that the durable unit developers buy may stop being a model and become a managed team of models wearing one API name.

Claude Sonnet 5 turns the balanced tier into an agent

Anthropic announced Claude Sonnet 5 on June 30. The short version: the Sonnet tier now does work that needed an Opus model a few months ago. It can plan, use browsers and terminals, keep a multi-step task moving, and check its own output without waiting for another prompt. Anthropic says the biggest gains over Sonnet 4.6 are in reasoning, tool use, coding, and knowledge work; at higher effort levels, it can match Opus 4.8 on some agentic search and computer-use tasks.

The price curve is the interesting part. Sonnet 5 launches at $2 / $10 per million input/output tokens through August 31, then moves to the standard Sonnet rate of $3 / $15. Opus 4.8 is $5 / $25. That gives you a much wider middle: medium effort for routine agent work, higher effort when the task starts looking Opus-shaped, and Opus itself when maximum accuracy matters more than cost. The context window stays at 1M tokens, and the API model ID is claude-sonnet-5.

There is one billing wrinkle hiding behind the cheaper launch rate. Sonnet 5 uses Anthropic's newer tokenizer, so the same text can turn into roughly 1.0–1.35× as many tokens depending on the content. Anthropic set the introductory rate to keep migrations roughly cost-neutral, which means the $2 / $10 headline is partly absorbing that tokenizer change rather than being a clean one-third discount forever.

The safety story moved too. Anthropic reports lower hallucination, sycophancy, prompt-injection susceptibility, and undesirable behavior than Sonnet 4.6. Cyber capability is deliberately lower than Opus 4.8 and Mythos 5, and Sonnet 5 ships with cyber safeguards enabled by default. Practical read: this is now the default Claude for creative coding agents — cheaper than the heavyweight tier, much better at finishing the loop than the previous Sonnet, and available across Claude plans, Claude Code, and the Claude Platform. Use Fable for the hardest architecture and long-running knowledge work; start creative implementation here.

Cursor puts the agent queue on iPhone

Cursor's iOS app is now available in public beta for paid plans. The pitch is not "write Swift on a phone"; it's "keep the agent queue moving when you're away from the laptop." You can start cloud agents from iPhone, point them at a repo, review plans and diffs, comment back into the run, and pick up the same work in desktop Cursor later. During the launch window, Cursor discounted Composer 2.5 runs in the mobile app by 75% through July 5 — the right incentive for low-friction delegation and follow-up rather than expensive frontier-model deep work.

The more interesting direction is remote control. The beta that surfaced around Compile let developers prompt agents, edit code, and control desktop sessions from iOS; the public beta turns that into a product surface instead of a TestFlight curiosity. Cursor says repo-less chats are next, which would make the app useful for operational questions that need MCP context but not a code checkout — logs, Slack summaries, incident notes, quick triage.

Practical read: mobile coding is still a bad phrase. Mobile agent management is the thing. The useful loop is: start a cloud agent from the train, answer clarifying questions from your phone, skim the diff, and do final review on a real screen. That pushes Cursor closer to an always-on work queue than an IDE. If it works, the phone becomes the place you keep agents unblocked.

Fable 5 is the model launch that became a policy story

Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9 as the first public Mythos-class release: same core capability as Mythos 5, but with stricter safeguards that route high-risk cyber, biology, chemistry, and related prompts away from the new model. The initial story was capability: a bigger reasoning tier, 1M input context, 128K output, and $10 / $50 per million tokens — expensive, but much cheaper than the earlier Mythos Preview.

Then the launch became a governance story. After U.S. government restrictions disrupted access to Fable 5 and Mythos 5, Commerce allowed a limited return for Mythos 5 on June 26, mainly for approved cyber-defense and critical-infrastructure use. Public Fable 5 access remained unresolved until the July 1 redeployment described above, turning the model from a normal product rollout into part of an emerging, ad hoc approval process for frontier systems with offensive-security implications.

Practical read: Fable 5 matters less as "Anthropic's new best model" than as the first visible collision between broadly useful frontier capability and government-level release control. Its return restores the everyday recommendation — use it for hard reasoning and agentic coding where Opus 4.8 is not enough, keep cheaper models on routine work, and expect some security-adjacent prompts to fall through to safer routes — but the interruption showed that the next model tier may ship less like SaaS and more like controlled infrastructure.

Claude Fable 5 briefly put a Mythos-class model in everyone's hands

Anthropic shipped Claude Fable 5 on June 9 — the first Mythos-class model to go generally available before the access suspension noted above. The Mythos Preview and Glasswing entries further down this chapter were the gated version of this capability; Fable 5 is the public release. It and Claude Mythos 5 are the same underlying model — the split is safety, not power. Fable 5 ships with classifiers that route cybersecurity, biology/chemistry, and model-distillation prompts to Opus 4.8 instead of answering them directly, while Mythos 5 has those safeguards lifted and stays restricted to vetted Project Glasswing partners and the US government. The classifiers are tuned conservatively — they trigger in under 5% of sessions and will sometimes catch harmless requests.

The benchmarks are a clear step over the current reasoning tier. SWE-Bench Pro hits 80.3%, up from Opus 4.8's 69.2% and well past GPT-5.5's 58.6%. USAMO 2026 lands at 97.6%, and GDPval-AA reaches 1932 against Opus 4.8's 1890 and GPT-5.5's 1769. The one that closes a standing gap: Terminal-Bench 2.0 at 82.0% — essentially level with GPT-5.5's 82.7% on the same benchmark, erasing the long autonomous-loop advantage that's defined the Opus-vs-GPT split all spring. It's also the first model past 90% on Hex's analytical benchmark, and it scored 91 on Every's senior-engineer eval.

Pricing is $10 / $50 per million tokens — double Opus 4.8, but less than half what Mythos Preview cost — with a 90% prompt-caching discount and the same 1M-input / 128K-output context window as the Opus line. At launch, it was live on the Claude API (claude-fable-5), Claude Code, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry, with GitHub Copilot and Harvey supporting it from day one. Pro, Max, Team, and seat-based Enterprise plans received free launch-period access through June 22.

Practical read: this is the entry the Mythos Preview and Glasswing sections below were pointing at — the capability that shipped to defenders first is now on the public API, minus the offensive-security edge. With broad access restored, it is the new top of the stack for everyday reasoning and agentic coding, and the Terminal-Bench parity is the quietly important bit: you no longer pay a long-loop penalty for staying on the reasoning tier, so one model covers both review-grade reasoning and hour-long shell runs instead of switching between Opus and GPT-5.5 by workload. The asterisk is cost — at 2× Opus 4.8 you'll want Fable on the hard tasks and a cheaper tier on the rest. And the launch timing — a more-powerful public model days after Anthropic warned that AI is getting too dangerous to ship unguarded — is the whole story of the safety split in a single release. Worth watching how often that sub-5% classifier rate trips on legitimate security work.

Anthropic puts a number on "AI building AI"

The Karpathy hire below was the staffing thesis. This is the measurement. Anthropic's new research arm, the Anthropic Institute, published When AI builds itself — a look at how much AI is already speeding up the development of AI. The number that lands: as of May 2026, Claude authored more than 80% of the code merged into Anthropic's production codebase, up from low single digits before Claude Code launched in February 2025. The typical engineer now merges 8× as much code per day as they did in 2024 — directing and reviewing instead of typing. On the most open-ended engineering tasks, Claude's success rate jumped from ~26% to 76% in six months.

The frame is recursive self-improvement — an AI system that can fully autonomously design and build its own successor. Anthropic is careful to say we're not there: "We're not at recursive self-improvement yet, but it could come sooner than most expect." The gap they keep pointing at is judgment — choosing which goals to pursue, not just executing them. Co-founder Jack Clark puts better-than-even odds on it anyway: by the end of 2028, more likely than not, "you would be able to say to it: 'Make a better version of yourself.' And it just goes off and does that completely autonomously."

Practical read: strip the sci-fi framing and this is the same loop running through half the entries in this chapter — Cursor's Targeted RL with Textual Feedback, Composer's 25× synthetic-task expansion, Anthropic putting an OpenAI cofounder on pre-training. The difference is that Anthropic is now publishing the metrics instead of letting them sit in a hiring announcement. 80% of merged code and an 8× throughput multiplier are not predictions — they're this quarter's internal numbers. Whether that curve bends toward "make a better version of yourself" by 2028 is the open question; that it's already reshaping how the model gets built is not.

Codex expands beyond developers — plugins, Sites, and annotations

OpenAI announced that more than 5 million people now use Codex every week, with non-developers — analysts, marketers, operators, designers, researchers, investors, bankers — making up about 20% of overall users and growing more than 3× faster than developers. That ratio is the tell: the IDE-replacement story was the launch narrative, but the product is becoming a general knowledge-work platform.

Three things shipped alongside the numbers. First: six role-specific plugins, each bundling the relevant apps, skills, instructions, and workflows for a domain. Data analytics (Snowflake, Databricks, Tableau), creative production (Figma, Canva, Shutterstock), sales (Salesforce, HubSpot, Clay), product design (Figma prototyping from a live URL), public equity investing (FactSet, PitchBook, Moody's), and investment banking (pitch materials, comps, diligence). Collectively: 62 apps, 110 skills, no coding required. More are coming — Corporate Finance, Private Equity, Marketing Strategy, Legal. OpenAI also confirmed it's building toward an open ecosystem where partners deploy their own plugins directly in Codex and ChatGPT.

Second: Sites — a new output format in preview for Business and Enterprise. Codex takes your ideas, analysis, or plans and turns them into interactive hosted web apps shareable via URL. Revenue forecast planner, event operations dashboard, product launch hub. Not a static export — sites can be kept up to date as details change. Early ecosystem partners: Vercel, Wix, Replit, Lovable, Figma, Webflow, Emergent.

Third: annotations extended beyond code. Developers already used annotations to refine code and websites Codex creates. The same mechanic — point at the exact part, tell Codex what changes — now works on documents, spreadsheets, and slides. Select a navigation bar, update the font. Highlight a claim, ask where it came from. Mark a chart, ask for a clearer label. Codex focuses the update on the selected element without touching the rest.

Practical read: the plugin architecture matters more than the individual apps. Bundling "62 apps + 110 skills" into one install removes the integration surface that has kept Codex at arm's length for non-technical teams — the people who know what they need done but couldn't wire up the connections themselves. Sites shifts the output format from text-in-a-chat to something a team can actually navigate together. The clearest signal: Zapier uses Codex to pull context from Slack, Google Docs, and Coda, then turn it into postmortems and feature tickets. NVIDIA researchers use it to speed experiment workflows. The competitive question is no longer which model is better at code — it's whether OpenAI can hold these non-developer users before the same role-specific plugin pattern lands inside every other tool they already use.

Cursor's Auto Review run mode: longer runs, fewer approval prompts

Cursor shipped Auto Review — a new run mode that lets the agent work for longer stretches with fewer approval prompts while keeping execution safe. It covers the three call types that normally interrupt a run: Shell, MCP, and Fetch. Instead of pausing on every one, the agent evaluates each against safety criteria and only stops for the ones that actually warrant a human — destructive shell commands, calls reaching outside the project, the things you'd want to eyeball.

The positioning is the interesting part. Until now the choice was binary: approve every command by hand, or flip on full auto-run (the "YOLO" allowlist) and hope nothing rm -rfs your home directory. Auto Review is the middle setting — the allowlist with judgment instead of a static pattern list. Safe-by-inspection calls flow through; risky ones still surface for sign-off.

Practical read: this is the same friction the hooks section below attacks from the other side. Hooks deterministically block known-bad writes; Auto Review probabilistically waves through known-good calls. Run both and the approval queue collapses to the genuinely ambiguous cases — which is where your attention should have been the whole time. If you've been stuck in manual-approval mode because full auto-run felt reckless, this is the setting to try first. Watch what it pauses on for a few sessions before you trust it on an unattended loop.

Claude Opus 4.8 lands — and it fixes what 4.7 broke

Anthropic shipped Claude Opus 4.8 on May 28, six weeks after the divisive 4.7 release. Generally available on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry from day one, with pricing held steady at $5 / $25 per million tokens and the 1M-token context window unchanged. Same rate card as 4.7 and 4.6 — the upgrade is free if you were already on Opus.

The headline is math. On USAMO 2026, Opus 4.8 scores 96.7% — up from 69.3% on 4.7, a 27-point jump that Anthropic calls the biggest single-cycle math improvement in Opus history. The coding-and-reasoning cluster moves with it: SWE-bench Pro climbs to ~69.2% (from 4.7's 64.3%, and well clear of GPT-5.5's 58.6%), and OSWorld lands at 83.4%. On Terminal-Bench 2.1 it reaches 74.6% — a real gain, but GPT-5.5 still leads the long autonomous-loop benchmark at 78.2%, so the "pick GPT-5.5 for hour-long shell loops" rule survives the release. The Cursor read is more practical: on CursorBench, 4.8 is noticeably more efficient than 4.7, and it pushes further on harder tasks before giving up or looping back for help — the persistence gain is the harder-to-benchmark one, but it's the difference between a run that stalls halfway through a complex refactor and one that finishes.

The more important number is the one 4.7 broke. MRCR v2 at 1M tokens is back above 79%, erasing the 46-point long-context collapse that made 4.7 unusable for needle-in-a-haystack retrieval and long-document Q&A. The argumentative coding loops and stub-implementation complaints that dogged 4.7's launch are largely gone in early testing — 4.8 reads as the patch release 4.7 needed, not a fresh set of regressions to work around.

Practical read: this is the cleanest Opus upgrade in a while. If you parked on 4.6 or Sonnet 4.6 to dodge 4.7's long-context regression, 4.8 is the one to retest — the retrieval scores are restored and the math/reasoning gains are real. For long autonomous tool-use loops, GPT-5.5 still edges it; for everything else in the reasoning tier, 4.8 retakes the lead without the asterisks 4.7 carried.

Cursor's productivity study: 39% more PRs at the same revert rate

Cursor published the data behind the productivity claim — a study Suproteem Sarkar (University of Chicago, finance and applied AI) ran across tens of thousands of Cursor users. The headline is 39% more PRs merged once the coding agent became the default tool in the flow. The number sitting next to it is the one that decides whether the first number means anything: revert rate didn't change and bugfix rate slightly decreased. The extra throughput isn't being paid for in regressions.

The developer-experience cut is the unintuitive bit. For every standard deviation of years-of-experience, agent acceptance rates rise roughly 6% relative to the mean. Senior devs accept more agent code, not less. The plausible read: experienced developers write better plans before handing off, recognise good output faster, and rewrite less of it. The agent is a force multiplier on judgement, not a replacement for it. Cursor also notes that more than a third of all PRs they merge are now opened by agents running in cloud sandboxes, with the stated expectation that the share keeps climbing through the year.

The study sits next to the team-level surface at cursor.com/insights — the signed-in analytics dashboard where engineering leaders see the same shape of data inside their own org. The measurement plumbing is the part most "AI productivity" claims wave away. The AI Share of Committed Code metric tracks every AI suggestion as a local signature on-device, then compares those signatures against subsequent Git diffs — so the dashboard reports how much suggested code actually shipped, not how much was generated. Source never leaves the IDE; only the metadata does. Conversation Insights sits next to it, classifying sessions into bug-fixing, refactoring, documentation, or new features — also on-device. The May 4 enterprise refresh broke usage down further by surface (clients, Cloud Agents, automations, Bugbot, Security Review), so leadership can see whether the lift is coming from autocomplete, Composer, or autonomous cloud agents.

Practical read: the Cursor numbers line up with METR's own follow-up — the same developers who were 19% slower with AI in METR's early-2025 study clocked an 18% speedup in the early-2026 re-run, with METR attributing the flip to two things at once: the tools got better, and the developers learned when and how to use them. The conclusion isn't "AI tools work" or "AI tools don't" — it's that the answer moves quickly and the experienced operators move it fastest. The honest caveat on Cursor's data is the source (Cursor publishing about Cursor, self-selected population), and METR flags its own selection effect — devs became so reliant on AI they'd refuse to be in the no-AI control group. But the methodology is unusually transparent for a vendor study: quality metrics (revert, bugfix) reported alongside throughput instead of buried, on-device signatures as the answer to "how do you actually measure AI share," and the senior-dev finding cuts against the marketing narrative both sides would prefer. If you're trying to convince an exec the spend is justified — or convince yourself it isn't slop — pair the study with the METR update and the picture is consistent.

Karpathy joins Anthropic's pre-training team

Andrej Karpathy announced on May 19 that he's joining Anthropic this week. The short bio for anyone who's lost track: OpenAI cofounder, then Tesla's director of AI from 2017 leading Autopilot computer vision, then a brief return to OpenAI, then Eureka Labs — the AI education startup he's been running since. Now Anthropic.

The role is the interesting part. Per TechCrunch, Karpathy is building a team focused on using Claude to accelerate pre-training research — the work that gives new models their core knowledge and capabilities in the first place. Not an applied team. Not an agent team. The team whose job is to make the next Claude train better than this one.

The hire lands while Anthropic is poised to surpass OpenAI's private-market valuation and in the middle of an unusually loud talent war between the two labs. Karpathy is the highest-profile name to switch sides this year, and the optics of an OpenAI cofounder running pre-training at the chief rival are doing their own work.

Practical read: this is the "models training models" loop showing up in a hiring decision. The same instinct that drives Cursor's Targeted RL with Textual Feedback or Composer 2.5's 25× synthetic-task expansion — using the current model to improve the next one — is now a staffing thesis at the lab level. If the bet pays off, the Claude that ships in 12 months won't just be a bigger Claude 4.7. It'll be a Claude trained by a Claude. Worth watching what shows up in Anthropic's research output over the next two quarters.

Google's Gemini Omni: one model, any modality in, any modality out

Google unveiled Gemini Omni at I/O 2026 today — its first natively multimodal "any-to-any" model. Gemini Omni Flash is the first variant: take any combination of text, images, audio, and video in, get high-quality output across the same modalities out. The pitch is that Omni reasons across the inputs rather than stitching three specialist models together behind a single API.

Concrete details: 10-second video generation, custom digital avatars, plain-text photo editing without a Photoshop-style interface. Live today inside the Gemini app for U.S. subscribers on AI Plus, AI Pro, and AI Ultra, plus integrations into YouTube Shorts and Google's Flow creative studio. Vertex AI API access is "in the coming weeks," per TechCrunch.

Practical read: the structural play here is collapse-the-stack. For agent work the interesting bit isn't the consumer video demo — it's that once Omni hits Vertex, you stop needing text-to-image-then-image-to-video-then-audio-overlay as three separate model calls glued together. Quality vs. specialist models like Seedance 2 or Veo is the open question; early write-ups are skeptical that one model beats focused ones on pixel quality yet. But for multimodal agent outputs where consistency across modalities matters more than per-modality maximum, this changes the wiring.

Gemini 3.5 Flash beats 3.1 Pro on the benchmarks that matter

The other I/O headline: Gemini 3.5 Flash — Google's new Flash-tier model — outperforms Gemini 3.1 Pro, the flagship that shipped in February, on three benchmarks Google chose to highlight: Terminal-Bench 2.1, GDPval-AA Elo, and MCP Atlas. The GDPval-AA number is the one to stare at: 1,656 for 3.5 Flash vs 1,317 for 3.1 Pro. That's not a tick, that's a step. Sundar Pichai called out 289 tokens/sec in the keynote — roughly 4× the throughput of comparable frontier models. It's rolling out today as the default model in the Gemini app and Google Search globally.

Reception is mixed-positive. The long-standing "Gemini feels lazy" complaint has reportedly mostly faded in early testing, sub-200ms responses on many prompts make it feel genuinely real-time, and LM Arena coding scores have it ahead of 3.1 Pro at meaningfully lower per-token cost. The honest counterweight: Hacker News threads on the prior 3.x releases still surface a steady drumbeat of "Gemini is consistently the most frustrating model I use" from developers — benchmarks aren't the same as daily-driver feel, and Google hasn't fully closed that gap yet.

Practical read: two things matter here. First, "Flash beats Pro" inverts the normal model hierarchy — the same move Cursor made with Composer 2.5 last week, and Anthropic's pre-training hire above is partly a response to the same pressure. The cheap-and-fast tier is no longer a worse version of the expensive one; it's a different point on the price/capability curve that sometimes wins. Second, GDPval-AA and MCP Atlas are the agent-shaped benchmarks — tool use, long-horizon tasks. A Flash-tier model leading there means the cost floor for capable agent runtimes just dropped again. Build budgets that assumed Pro-tier pricing for agent loops need a refresh this week.

Composer 2.5 makes the in-house model competitive

Cursor shipped Composer 2.5 on May 18 — the same Moonshot Kimi K2.5 base as Composer 2, retrained with 25× more synthetic tasks and a new technique they call Targeted RL with Textual Feedback: instead of waiting for a final reward, the trainer drops localized hints at the exact tokens where behavior went wrong and distills back from those points. The infrastructure note worth flagging is the Muon optimizer with distributed orthogonalization — 0.2-second optimizer steps on a trillion-parameter model.

The benchmark picture is the headline. On SWE-Bench Multilingual, Composer 2.5 lands at 79.8% against Opus 4.7's 80.5% and GPT-5.5's 77.8%. On CursorBench v3.1, 63.2% vs Opus 4.7 at 64.8% (max) / 61.6% (default) and GPT-5.5 at 59.2%. Terminal-Bench 2.0 is where the gap shows: 69.3%, basically tied with Opus 4.7 at 69.4%, but well behind GPT-5.5's 82.7% — the long autonomous-loop benchmark is still GPT-5.5's territory.

Pricing is the part that matters. $0.50 / $2.50 per million tokens for the standard tier, $3 / $15 for the Fast variant. At that rate, Composer 2.5 hits ~63% on CursorBench at under $1 average per task while Opus 4.7 and GPT-5.5 are several dollars in for comparable scores. Launch week ships with double usage thrown in. The roadmap note is the other interesting one: Cursor confirmed a collaboration with SpaceXAI to train a significantly larger model on Colossus 2 — 10× the compute of this run.

Reception on the Cursor forum thread is warm but not uncritical. The consistent praise is about tone: "willing to think with you and is not antagonistic" — a direct shot at the Opus 4.7 argumentative-loop complaints from last month. One developer admitted forgetting they had Composer 2.5 enabled and not realizing they weren't on GPT-5.5 for a while, which is the highest compliment a default-model swap gets. The gripe that keeps coming up is inconsistent thinking depth: users report adding "please think harder" before the model commits to a real answer instead of a lightweight one.

Practical read: for the first time, the cheap in-IDE model is in the same room as the frontier models on the benchmarks Cursor users actually care about — multi-file refactors, multilingual SWE-Bench, CursorBench. It's not the best at any single thing, but at 10× cheaper it doesn't need to be. The Fast variant is still where the long autonomous loops should live if you can afford it; Composer 2.5 standard is the new sensible default for everything else. The bigger story is structural — Anthropic has been pricing Cursor into a corner by selling Claude Code at rates Cursor pays to serve. Composer 2.5 is the answer to that squeeze: a model Cursor owns end-to-end, priced where the unit economics work.

NVIDIA's SANA-WM puts minute-scale world models on a single GPU

NVIDIA Labs dropped SANA-WM on May 14 — a 2.6B-parameter open-source world model that turns one image plus a 6-DoF camera trajectory into 60 seconds of controllable 720p video, running on a single GPU. The paper claims visual quality on par with industrial baselines like LingBot-World and HY-WorldPlay, at a fraction of the compute.

The architecture is a hybrid linear diffusion transformer: frame-wise Gated DeltaNet handles long-context modeling with linear cost, softmax attention covers the parts that need full-rank mixing, and a dual-branch camera path enforces precise trajectory adherence. A two-stage pipeline applies a long-video refiner over first-pass outputs for temporal consistency across the full minute. Training used only ~213K public video clips with metric-scale pose supervision, completing in 15 days on 64 H100s — small for a frontier video model.

The number that matters: the distilled variant runs on a single RTX 5090 with NVFP4 quantization, denoising a 60s 720p clip in 34 seconds — roughly 36× the throughput of prior open-source baselines. The paper and code are out alongside the project page.

Practical read: the headline isn't pixel quality — industrial models still match it. It's that minute-scale, camera-controllable world models stop being a multi-GPU research artifact and start being something you run locally with a starting frame and a trajectory. For agent work — generating training video, simulating embodied environments, building evaluation scenarios at scale — the cost curve just bent hard. Worth watching how fast this gets wired into robotics and game-engine workflows.

Cursor Security Review goes managed

The DIY security agents Cursor open-sourced earlier just turned into a product. Cursor Security Review is in beta on Teams and Enterprise plans, with two always-on agents you turn on from the dashboard instead of standing up Lambdas and Terraform yourself.

Security Reviewer runs on every PR. It checks for vulnerabilities, auth regressions, privacy and data-handling risks, agent tool auto-approvals, and prompt injection — and leaves inline comments at the exact diff location with severity and remediation. Vulnerability Scanner runs on a schedule across the codebase, looking for known CVEs, outdated dependencies, and misconfigurations, with optional Slack updates.

Both are configurable: adjust triggers, drop in custom instructions, give them custom tooling, decide where outputs land. The interesting hook is MCP — you can plug in your existing SAST, SCA, and secrets scanners as MCP servers and let the agent use them as part of the review. Cursor keeps tuning the runtime, harness, and models behind the scenes. Usage comes out of your existing pool, not a separate SKU.

Practical read: a month ago, getting a security review agent into your PR pipeline meant adopting Cursor's open-source templates, deploying a Lambda, and wiring Slack yourself. Now it's a toggle. The interesting part isn't the convenience — it's that "security agent" is becoming a product category, not a custom build. The DIY version still exists for teams who want full control; the managed version is for teams who want it on by Friday.

Vercel opens the cloud agent stack

Cursor's SDK gives you agent infrastructure from an IDE company. Vercel's answer is different: an open-source reference implementation you can actually read.

Open Agents is MIT-licensed, deployed at open-agents.dev, and explicitly framed as a reference, not a starter kit. The goal is visibility — see exactly how the pieces wire together, then fork and adapt.

The architecture is three layers: web app → agent workflow → sandbox VM. The web layer handles auth (Better Auth, GitHub OAuth), sessions, chat, and streaming UI built on Next.js. The agent runs as a durable workflow via Vercel's Workflows SDK — long-running execution that can hibernate and resume without losing state. The sandbox is an isolated VM with a full filesystem, shell, git, dev servers, and preview ports.

The critical design choice: agent and sandbox are separate. The agent doesn't run inside the VM — it reaches in via tool calls. That means each layer can hibernate independently. Pause a long coding task, come back hours later, the agent picks up without the sandbox burning compute the whole time.

Feature set: file reads and edits, shell commands, web search, git operations, optional auto-commit and PR creation, session sharing via read-only links, voice input via ElevenLabs. Neon PostgreSQL for persistence, optional Redis or Vercel KV for caching.

Practical read: this is the cleanest public example of how Vercel's own stack — Workflows SDK, Sandboxes, Gateway — wires together for a coding agent. If you've been building agent loops yourself over raw APIs, reading this codebase is faster than reading docs. The architectural insight is the same one underneath the Cursor SDK: durable execution plus isolated sandbox plus external tool access is the skeleton of every cloud coding agent. Vercel just made theirs legible.

Cursor SDK opens the harness up

Cursor shipped a TypeScript SDK on April 29 — the same runtime, harness, and models that power the desktop app, CLI, and web client, now available programmatically via npm install @cursor/sdk. Public beta, token-based pricing; SDK examples now default to Composer 2.5 (Cursor's in-house coding model — roughly 10× cheaper per input token than both Opus 4.7 and GPT-5.5).

Agents created through the SDK get the full stack: codebase indexing with semantic search and instant grep, MCP servers, skills from .cursor/skills/, and hooks from .cursor/hooks.json. Execution can target sandboxed cloud VMs (Cursor-managed), self-hosted workers (your network), or your local machine. Subagents, streaming, and the same harness primitives that ship in the IDE are exposed as composable APIs.

The cookbook is the part to look at. Four reference projects: a minimal quickstart, a web-based prototyping tool that scaffolds new projects in a sandbox, a lightweight coding-agent CLI, and the agent-kanban board — a Linear-style UI where each card represents a Cloud Agent. Drag a card to "in progress" and the agent picks the work up, runs to completion in a sandbox, opens a PR, and posts the result back to the card as an attachment. The board lists running agents, groups them into columns, previews artifacts inline, and creates new agents from a repo + prompt.

Practical read: the SDK turns Cursor from "an editor with agents" into "agent runtime you can build on." If you've been wiring up your own agent loops over the Anthropic or OpenAI APIs, the SDK is shorter — you inherit indexing, hooks, subagents, and sandboxing instead of reinventing them. The kanban example is the cleanest demonstration of where this lands: tickets become agent invocations, drag-and-drop becomes scheduling, and PRs become the artifact.

Cursor ships agentic security review

Cursor open-sourced its Agentic Security Review — a security-tuned automation that runs on every pull request, posts findings as PR comments, and can block CI on high-severity issues. It audits diffs for exploitable vulnerabilities (auth, input validation, permission checks), skips items already discussed in the PR, and routes high-risk findings to a private Slack channel.

The review agent ships alongside three other security agents Cursor runs internally — Vuln Hunter (segments the repo and hunts for vulnerabilities), Anybump (handles dependency patching, runs tests, opens a PR if they pass), and Invariant Sentinel (runs daily to detect drift against a list of compliance and privacy invariants). Templates and Terraform for all four are public, with a custom MCP server deployed as a serverless Lambda handling persistent state, deduplication, and Slack formatting. The Cursor blog post has the full architecture.

In Cursor's own deployment, the review agent has run on thousands of PRs and prevented hundreds of issues from reaching production in the last two months. The rollout sequence they used is worth copying: silent mode to a private Slack channel first, then PR comments once precision was high enough, then a blocking CI gate.

Practical read: security review used to be a /security-review slash command you remembered to run. As an always-on automation tied to CI, it stops being a discipline problem and starts being infrastructure. Worth pairing with hooks — the hook blocks bad writes locally, the review agent catches what makes it through.

GPT-5.5 lands a week after Opus 4.7 — and the vibe flips

OpenAI announced GPT-5.5 on April 23, seven weeks after 5.4 and seven days after Anthropic's Opus 4.7, with API availability following on April 24. It's rolling out on ChatGPT Plus, Pro, Business, and Enterprise, in Codex, and in the API — with a higher-tier GPT-5.5 Pro alongside it. API pricing is $5 / $30 per million tokens, roughly 2× GPT-5.4, with a 1M-token context window and per-token latency that matches 5.4 in real-world serving.

The quiet detail is that this is the first fully retrained base model since GPT-4.5. The "5.4 → 5.5" version bump undersells the delta — developers poking at it on launch day kept repeating some version of "it just gets it" and "much less hand-holding." It's better at multi-step tool use, at staying in the loop until a task finishes, and at writing and debugging code without being steered every turn.

Against Opus 4.7 the picture splits cleanly by workload. GPT-5.5 wins the autonomous-loop benchmarks: Terminal-Bench 2.0 at 82.7% vs 69.4%, plus leads on BrowseComp (+5.1pp) and CyberGym (+8.7pp). Opus 4.7 wins the reasoning-and-review cluster: SWE-bench Pro 64.3% vs 58.6%, HLE 46.9% vs 41.4%, and MCP-Atlas 79.1% vs 75.3%. Of the ten benchmarks both labs report, Opus leads on six and GPT-5.5 leads on four — but GPT-5.5's four are the ones closest to "agent that runs a shell for an hour."

Reception is sharply warmer than Opus 4.7's, which landed with a 46-point MRCR regression and loud complaints about argumentative coding loops. GPT-5.5 feels like a clean upgrade at comparable speed, which is why devs are calling it a "revival" of the 5.x line. The pushback is almost entirely about price: at 2× GPT-5.4 for the base tier and more for Pro, teams with cost ceilings are staying on 5.4 for anything that doesn't need the extra agentic range. Output pricing is also $5/M more than Opus 4.7, though GPT-5.5 tends to emit fewer tokens per task, which partially offsets on the bill.

Practical read: for agentic coding, browser automation, and long tool-use loops, GPT-5.5 is the default this week. For long-document reasoning, review-grade correctness, and anything close to HLE territory, Opus 4.7 still edges it. If you were burned by the Opus 4.7 release and parked on 4.6 or 5.4, 5.5 is the first model since 4.5 where the jump is worth the retest.

Cursor 3: parallel agents and worktrees

Cursor 3 ships two changes that address the same problem — waiting.

The first is /multitask. Instead of queuing requests and running them serially, Cursor can now spin up async subagents to handle them in parallel. For requests already in the queue, you can ask Cursor to multitask mid-run rather than waiting for the current task to finish.

The second is improved worktrees in the agents window. Run isolated tasks in the background across different branches simultaneously. When you're ready to test changes, bring any branch into your local foreground with one click.

Combined, these features move Cursor toward a model where you describe work across multiple tasks and let the editor figure out execution order. The interface is catching up to what the agents can already do.

The packaging pattern works for design too

The same pattern that works for code conventions works for design. ux-ui-agent-skills packages DTCG design tokens, Atomic Design component specs, WCAG 2.2 checklists, Nielsen heuristic rubrics, and React + Tailwind v4 / Next.js 15 patterns into a single skill set.

Drop it into a project and the agent applies consistent design knowledge — token mapping, accessibility scoring, state documentation — instead of improvising each time.

Any domain with enough accumulated knowledge can be packaged this way. Design is just a clear example because the gap between "AI generates UI" and "AI generates good UI" is so visible.

Mythos Preview finds zero-days at scale

On April 7, Anthropic previewed Claude Mythos — an unreleased research model that's dramatically better at exploiting software than anything shipped before. On a Firefox vulnerability set where Opus 4.6 built working JavaScript shell exploits 2 times in several hundred attempts, Mythos built them 181 times. On OSS-Fuzz, it produced 595 tier-1/2 crashes versus Opus 4.6's 150–175, with full control flow hijacking demonstrated on ten targets.

The bugs it found are the kind that normally take years of expert attention. A 27-year-old OpenBSD TCP flaw enabling remote DoS. A 16-year-old FFmpeg H.264 codec bug that OSS-Fuzz missed after roughly 5 million fuzzing attempts. A 17-year-old FreeBSD NFS remote code execution, now tracked as CVE-2026-4747. Thousands of additional critical and high-severity findings across major open-source projects. On cybersecurity vulnerability reproduction, Mythos scores 83.1% against Opus 4.6's 66.6%.

Anthropic frames this as a "watershed moment," with their own caveat that "most security tooling has historically benefited defenders more than attackers" but the transition period may be "tumultuous." The practical read for anyone shipping software is blunt: patching cycles that were fine six months ago are not fine now. Threat models that assumed expert attackers were a scarce resource need revisiting. The offensive floor just moved, and it moved a lot.

Glasswing: defenders get the model first

Mythos doesn't ship alone. Project Glasswing is the coordinated deployment — a partnership with 12 founding organisations (AWS, Apple, Google, Microsoft, NVIDIA, Linux Foundation, JPMorgan Chase, CrowdStrike, Palo Alto Networks, Cisco, Broadcom, and Anthropic itself) plus 40+ additional critical-infrastructure organisations, plus dedicated funding for open-source maintainers.

The financial commitment: $100M in Mythos model credits for defensive use, $2.5M to Alpha-Omega and OpenSSF, and $1.5M to the Apache Software Foundation. The goal is months of concentrated patch work on the dependencies everything else rests on before anything similar becomes broadly available. The model itself isn't on the public API — access is gated to consortium members and vetted maintainers.

The interesting precedent isn't the money. It's the deployment pattern. A model capable enough to shift the offence/defence balance ships to defenders first, through a consortium, with funded upstream patch work targeted at the libraries that form the trust root of modern software. No broad API release. No public methodology paper. The strategy assumes that if you give this capability to defenders at scale, they can close the window before attackers reach parity. Whether that bet pays off is the open question — but it's the first time a frontier capability has been deliberately held back from general availability for a coordinated defensive push. Worth watching how the pattern generalises.

Claude Design joins Anthropic Labs

Claude Design shipped April 17 as a research preview — Anthropic's first dedicated visual creation tool, running on Opus 4.7, with direct handoff to Claude Code for development.

What you actually do with it: point it at a codebase and it picks up the design system, import a doc, image, or URL and it generates designs, prototypes, slides, or one-pagers, then refine inline with fine-grained controls. Export to Canva, PDF, PPTX, or HTML. Organization-scoped sharing for teams. The target audience is wider than designers — product managers doing wireframes, founders building pitch decks, marketers making campaign materials, non-designers who need visual output and normally punt the work.

Brilliant cited in the launch: complex prototyping dropped from 20+ prompts to 2. The more interesting detail is the Claude Code handoff. Design and code are usually connected by a lossy export step — a Figma file becomes hand-written React, with drift appearing immediately. Claude Design treats them as one continuous surface: generate in Design, refine in Design, hand the component tree directly to Claude Code for implementation. If that pipeline holds up in practice, it's a different workflow, not just a faster one.

Available now on Pro, Max, Team, and Enterprise with gradual rollout from April 17.

Hooks move agents from advice to automation

Skills tell an agent what to do. Hooks make certain things happen regardless of what the agent decides.

They fire on lifecycle events — before a tool executes, after it finishes, on message submit, on stop, before compaction, on permission requests. Unlike skills, they're deterministic: the hook runs every time, not when the model judges it relevant.

Practical uses: auto-format after edits, block writes to protected paths, desktop notifications when Claude is waiting, re-inject context after compaction. Full schema in the Anthropic hooks guide.

The fastest way to start: Hookify. /hookify Warn me when I use rm -rf commands produces a working hook file immediately. Run it with no arguments and it auto-generates rules from behaviors you've already corrected in the current session.

Skills encode knowledge. Hooks enforce behavior. Both belong in a mature setup.

Harness design is now part of the craft

A harness is everything around the model: prompts, tools, orchestration, context management, hooks. If you want the from-scratch primer on what a harness is and why it matters, chapter 19 — "What Is an AI Harness?" covers it. This section is the moving-target view. Anthropic published a detailed breakdown of how a harness affects long-running agent performance — two findings stood out.

First: agents lose coherence as context fills. Some models exhibit "context anxiety," wrapping up prematurely. The fix isn't compaction (summarizing in place) — it's context resets: clear the window, start a fresh agent with a structured handoff. Compaction preserves continuity but doesn't give the agent a clean slate.

Second: agents reliably praise their own work when asked to evaluate it. The fix is architectural — separate the generator from the evaluator. A standalone evaluator tuned to be skeptical is far more tractable than making a generator self-critical.

The architecture that emerged: planner → generator → evaluator with Playwright clicking through the running app. Every component encodes an assumption about what the model can't do alone — those assumptions go stale as models improve. Strip non-load-bearing scaffolding when a new model lands. The full article has cost and duration breakdowns.

Subagents: the Cursor model is worth studying

Cursor's subagents go further than the AGENTS.md pattern. Each gets its own context window and model config, runs foreground or background, and three built-ins (Explore, Bash, Browser) handle the noisiest operations automatically.

Custom subagents are markdown files with YAML frontmatter in .cursor/agents/ (or .claude/agents/):

---
name: security-auditor
description: Security specialist. Use when implementing auth, payments, or handling sensitive data.
model: inherit
readonly: true
---
 
You are a security expert auditing code for vulnerabilities.

The description field determines when the parent delegates — spend time on it. The model field lets you route high-volume tasks to a faster model and depth tasks to a more capable one.

Anti-pattern: dozens of vague subagents. Five focused ones with sharp descriptions outperform fifty the parent doesn't know when to use.

Next.js MCP is becoming practical

Next.js 16 ships with a built-in MCP endpoint at /_next/mcp. Add next-devtools-mcp to .mcp.json and your agent gets live access to build errors, runtime errors, routes, page metadata, and server action IDs — no screenshots, no copy-paste.

{
  "mcpServers": {
    "next-devtools": {
      "command": "npx",
      "args": ["-y", "next-devtools-mcp@latest"]
    }
  }
}

Useful tools: get_errors (source-mapped stacks), get_routes, get_page_metadata, get_server_action_by_id. The agent can diagnose a hydration error and suggest a fix without you describing what's on screen.

Below Next.js 16: experimental.mcpServer: true in next.config.js.

Browser control is getting lighter-weight

The fastest way for an agent to use a browser is to let it write code. dev-browser runs Playwright-style scripts in a sandboxed QuickJS WASM environment — install it globally, point the agent at dev-browser --help, and it handles the rest.

npm i -g dev-browser
dev-browser install

From the repo benchmarks: Dev Browser finishes a representative task in 3m 53s at $0.88 with 29 turns. Playwright MCP takes 4m 31s at $1.45 with 51 turns. Batching interactions into scripts beats one-tool-call-per-action.

Pre-approve in .claude/settings.json: "allow": ["Bash(dev-browser *)"].

Infra is becoming part of the product

Cursor's self-hosted cloud agents are now generally available. A worker process connects outbound via HTTPS — no inbound ports, no firewall changes. Cursor handles inference and planning, sends tool calls to the worker, results flow back. Each session gets its own dedicated worker; Kubernetes operator available for scale.

The practical benefit: agents can access internal caches, dependencies, and network endpoints that can't leave the environment. Code and secrets stay in your infrastructure.

Teams at Brex, Money Forward, and Notion are running this at scale. Notion cited access to more tools more securely as the reason for adopting it over maintaining their own background agent stack. "Agent infrastructure" is now a real architectural decision.

Cloud agents run on your hardware now

Cursor's My Machines takes self-hosted agents from an enterprise feature to an individual one. Instead of running in Cursor's managed VMs, your agent executes on hardware you control — your laptop, a devbox, a remote VM. Three commands to get there:

curl https://cursor.com/install -fsS | bash
agent login
agent worker start

The worker opens an outbound connection to Cursor — no inbound ports, no firewall changes, just HTTPS to api2.cursor.sh. Cursor handles inference and planning, sends tool calls to the worker, and terminal commands, file edits, and browser actions all execute on your machine. Your local repo, dependency caches, build artifacts, internal network — the agent gets all of it.

The MCP routing is worth noting. Stdio-transport MCP servers run on your machine, so they can reach private endpoints your network can access. HTTP/SSE-transport servers run on Cursor's backend, where Cursor handles OAuth and session caching. If your MCP server needs to hit an internal API, use stdio.

Workers are long-lived by default — they stay connected until you stop them and pick up future sessions automatically. Name them with --name "my-devbox" when you have multiple machines. For org-wide fleets, Cursor has a separate Self-Hosted Pool with Kubernetes operators. My Machines is the individual-developer version: one process, one machine, immediate access.

The shift underneath is conceptual. "Cloud agent" used to mean "runs in a cloud VM." Now the cloud part is just inference. Execution goes wherever makes sense — Cursor's sandboxes for isolation, your laptop for local deps, your company's cluster for compliance.

Harnesses are becoming shareable infrastructure

everything-claude-code is a useful example of where this is heading: 30 specialized subagents, hooks for memory persistence, verification loops, continuous learning, and security scanning — shipping across Claude Code, Cursor, Codex, and OpenCode.

The instinct system is the interesting part: the agent extracts patterns from your sessions into structured files, and /evolve clusters them into skills. The harness learns from use.

The community is converging on a shared vocabulary — skills, subagents, hooks, harnesses, evals. The primitives are stabilizing even as the specific tools change.

Engineering practices are becoming installable

Addy Osmani packaged Google's engineering culture into agent-skills: 20 skills across a 6-phase lifecycle, with 7 slash commands (/spec, /plan, /build, /test, /review, /code-simplify, /ship) that map to the full development loop.

Each skill has the same anatomy: process steps, anti-rationalization tables (rebuttals for "I'll add tests later"), red flags, and verification gates. The engineering principles are baked in — Hyrum's Law for API design, Chesterton's Fence for simplification, the Beyoncé Rule for testing ("if you liked it, you shoulda put a test on it"), 80/15/5 test pyramid ratios.

What makes it interesting isn't the content — most experienced engineers know these practices. It's the format. When engineering culture is encoded as structured markdown, the floor rises. Junior developers running these skills get senior-level guardrails without senior-level experience. The agent doesn't skip tests because it's in a hurry. It doesn't rationalize away code review.

Works across Claude Code, Cursor, Gemini CLI, and anything that accepts markdown.

Design systems are going agent-readable

Google Stitch introduced DESIGN.md — a plain-text design system document that agents read to generate consistent UI. No Figma plugins, no design token APIs. Just a markdown file with nine sections: visual theme, color palette with hex values, typography rules, component styling including states, layout principles, depth and elevation, do's and don'ts, responsive behavior, and an agent prompt guide.

awesome-design-md took this further — 58+ DESIGN.md files extracted from real companies. Claude, Stripe, Vercel, Linear, Figma, Airbnb, Spotify, Tesla. Drop one into your project and the agent generates UI that matches that design system.

The insight: agents are already generating UI. The problem was never capability — it was consistency. A DESIGN.md file gives the agent the same reference a human designer would use, in the format it processes best. Markdown over Figma, at least for the agent.

Browse the collection at getdesign.md.

Stitch just open-sourced the DESIGN.md specification itself — Apache 2.0, formal schema, CLI with a linter, differ, and exporter. Any tool can implement it now, not just Stitch.

The additions that matter: semantic color intent, so agents know what a color is for rather than just its hex value. And built-in WCAG validation — the linter catches contrast failures at lint time, before anything ships. The export command converts tokens to Tailwind config or W3C DTCG JSON simultaneously, so one file feeds your CSS, your build system, and your agent.

Spec at google-labs-code/design.md.

Knowledge bases are replacing notebooks

Andrej Karpathy shared a pattern worth paying attention to: instead of using LLMs to write code, use them to build personal knowledge bases.

The structure is simple. Raw materials — papers, articles, repos, datasets — go into a raw/ directory. The LLM "compiles" them into a wiki: structured .md files with summaries, backlinks, concept pages, and cross-references. You query the wiki, and the LLM synthesizes answers with citations.

This isn't RAG. RAG re-discovers knowledge from scratch on every question — chunk, retrieve, generate, forget. The wiki accumulates. Ask a question that requires synthesizing five documents, and the answer is already on a page, not assembled from fragments at query time.

Karpathy's own wiki on recent research: ~100 articles, ~400K words. Periodic linting passes check for contradictions, stale claims, orphan pages, and missing cross-references. The whole thing is a git repo, so you get version history for free. Instead of sharing code, he published a GitHub Gist as an "idea file" — in the era of agents, you share the idea and each person's agent builds a version customized for their needs.

The Obsidian connection makes it practical. Obsidian's web clipper converts pages to markdown, the vault is a local folder an agent can read and write, and backlinks make the wiki navigable by both humans and agents. Several open-source projects — claude-obsidian, obsidian-claude-code — have formalized the workflow.

An increasing fraction of token throughput going to knowledge management instead of code generation. Worth watching.

Visual annotation beats text descriptions

Cursor 3 shipped Design Mode. Instead of typing "change the third button in the second card on the settings page," you click on the element.

Design Mode opens a browser panel inside Cursor showing your running app. Click any UI element — a button, a heading, a card — and annotate it with instructions. The agent receives the component tree path, computed styles, and surrounding context. You can draw directly on the preview to indicate layout changes or spacing adjustments.

In practice: about 70% of annotations result in correct fixes on the first try. It struggles with dynamically rendered content and complex CSS-in-JS setups where styles aren't straightforward to trace.

This is the direction. Text descriptions of visual problems are lossy. Pointing at the thing and saying "fix this" is how humans communicate about UI. The tooling is catching up to the gesture.

Canvases make agent output interactive

Cursor shipped canvases — interactive visual surfaces that agents create inline. Instead of reading a text-based summary of your data, the agent generates a custom dashboard, chart, or visualization you can click through.

The shift is in the output format. Most agent responses are text. Canvases let the agent build a small interactive application as the response — a dependency graph you can explore, a timeline you can scrub, a layout you can rearrange. The agent writes the visualization code, renders it in a sandboxed canvas, and you interact with the result directly.

This matters because some information is fundamentally better explored than read. A table of API response times is less useful than a chart you can filter by endpoint. A list of component dependencies is less useful than a graph you can zoom into. Canvases give the agent a richer output vocabulary.

Open models keep closing the gap

Gemma 4 landed on April 2. Four variants: E2B (2.3B effective), E4B (4.5B effective), 26B MoE (4B active), and 31B dense. All Apache 2.0 — a real license change from Google's previous, more restrictive terms for open models.

The 31B dense model hit #3 on Arena AI's text leaderboard at 1452 Elo, outperforming models twenty times its size. The 26B MoE hit #6 at 1441 Elo.

Multimodal out of the box: images, audio, variable aspect ratios, document parsing, handwriting OCR. Up to 256K context for the larger variants, 128K for the smaller ones. Over 140 languages.

The gap between open and closed models compresses with every release. Self-hosted agents running Gemma 4 31B are now competitive on reasoning benchmarks with frontier models from a year ago. For teams that can't send code to an API, that matters.

Where to start

If you're setting up a serious agent harness for the first time, the order matters:

Get hooks working first. A single hook that blocks writes to node_modules/ or auto-formats after edits gives you immediate, observable value. Use Hookify's zero-argument mode to bootstrap from your own session history.
Write one focused subagent before writing ten. Pick the task where your current setup most often loses context — security review, database migrations, API contract checks — and build one sharp subagent for it. Refine the description field until the parent routes to it reliably.
Read the harness article before building evaluators. The generator/evaluator split is the insight with the most practical leverage. Get that architecture right before optimising anything else.
Add next-devtools-mcp if you're on a Next.js project. The signal-to-noise improvement on error diagnosis is immediate and costs nothing.
Check everything-claude-code for patterns, not prescriptions. It's a reference harness, not a starter kit. Extract the ideas that fit your context.

Resources

Sorted roughly by how much foundational leverage they provide.

Core reading

Harness Design for Long-Running Applications — Anthropic Engineering The reference article on harness architecture: context anxiety, generator/evaluator splits, context resets vs. compaction, and cost/duration breakdowns. Read this before designing any multi-step agent.
everything-claude-code 30 specialised subagents, memory hooks, verification loops, and an instinct system that learns from your sessions. The most complete reference harness available publicly. Works across Claude Code, Cursor, Codex, and OpenCode.
Anatomy of the Claude Folder Clear breakdown of what goes where in .claude/ — settings, hooks, subagents, skills, memory. Essential orientation if you're building a harness from scratch.
3 Principles for Designing Agent Skills — Block Engineering Composability, observability, and minimal footprint. A tight framework for evaluating whether a skill is worth extracting.
claude-code-best-practice Community-curated collection of CLAUDE.md patterns, workflow configs, and prompt strategies. Good place to see what's converged as convention.
The Complete Guide to Building Skills for Claude — Anthropic Official reference guide for building Claude skills. Covers skill structure, frontmatter, when to extract a skill vs. keep it inline, and how the harness routes invocations. Read alongside the Anatomy of the Claude Folder post.
LLM Knowledge Bases — Andrej Karpathy The idea file for building personal knowledge wikis with LLMs. Raw materials → compiled wiki → queryable knowledge base. The alternative to RAG that accumulates instead of rediscovering.

Tooling

Hookify plugin — Official Claude Code plugin Source for the Hookify plugin. The zero-argument mode (auto-generates rules from your session history) is the fastest way to start building a hook library.
Cursor Subagents Official documentation for Cursor's subagent system — context window isolation, foreground/background execution, built-in Explore/Bash/Browser agents. The description field guidance is especially practical.
Cursor My Machines Run cloud agents on your own hardware — laptop, devbox, or remote VM. Three commands to set up; stdio MCP servers run locally with full network access. The individual-developer path to self-hosted agents.
next-devtools-mcp MCP server that gives agents live access to Next.js build errors, runtime errors, routes, and server action IDs. Replaces screenshot-based debugging.
dev-browser Runs Playwright-style scripts in a sandboxed QuickJS WASM environment. Benchmarks show ~30% fewer turns and ~40% lower cost vs. Playwright MCP for representative browser tasks.
ux-ui-agent-skills Design system skills packaging: DTCG tokens, Atomic Design specs, WCAG 2.2 checklists, and React + Tailwind v4 patterns. A concrete example of the domain-packaging pattern applied to UI.
Cursor Marketplace Browsable registry of community plugins and skill packs. Useful for finding what's already been packaged before building your own.
agent-skills — Addy Osmani 20 production-grade engineering skills with 7 slash commands, encoding Google's engineering practices (Hyrum's Law, Chesterton's Fence, test pyramids) as structured agent workflows.
DESIGN.md Specification — Google Labs The open-source spec for DESIGN.md: Apache 2.0, formal YAML/markdown schema, CLI with linter (including WCAG contrast validation), differ, and exporter to Tailwind and W3C DTCG JSON. Any tool can implement it.
awesome-design-md 58+ DESIGN.md files extracted from real companies — agent-readable design systems in plain markdown. Browse at getdesign.md.
claude-obsidian Claude + Obsidian knowledge companion implementing Karpathy's LLM wiki pattern. Persistent, compounding wiki vault with /wiki, /save, and /autoresearch commands.
Cursor Canvas Agents create interactive visual dashboards and custom interfaces inline. The output format shift from text to explorable visualizations.
Cursor SDK TypeScript SDK exposing Cursor's runtime, harness, models, codebase indexing, MCP, skills, and hooks. Public beta via npm install @cursor/sdk. Current examples default to Composer 2.5. Pair with the cookbook — the agent-kanban example is the clearest demo of agents-as-tickets.
Open Agents — Vercel Labs MIT-licensed reference app for building background cloud coding agents on Vercel. Web app + durable Workflows SDK + isolated Sandbox VM. Agent and sandbox are decoupled, so each can hibernate independently. Fork it to understand the wiring; don't use it as a starter kit.
Cursor Security Review Official docs for Cursor's Agentic Security Review automation. Open-source templates and Terraform on Cursor Automations, plus three companion security agents (Vuln Hunter, Anybump, Invariant Sentinel). The blog post covers the architecture and rollout strategy.
Cursor 3.0 changelog Design Mode, Agents Window, and the architectural shift to agent-first IDE. The Design Mode feature is the standout addition.
Tweet: cursor_ai on /multitask and worktrees Official announcement of parallel agent execution via /multitask and the new worktrees UI in Cursor 3.

Context and background

superpowers A collection of Claude Code skills and hooks by Jesse Vincent. Good real-world reference for how an experienced developer structures a personal harness — useful for seeing what someone actually keeps vs. discards.
gsap-skills Official GSAP skill pack for AI agents. A clean example of first-party library authors packaging their own knowledge for agent use — the likely direction for more ecosystems.
Vercel React Best Practices Vercel Engineering's guide to React performance: RSC boundaries, data fetching patterns, and component composition. Useful context when agents are generating or reviewing React code.
Tweet: kirillk_web3 on Claude Skills 16-minute video of two Anthropic engineers (Barry and Mahesh) building Claude Skills from scratch. The key framing: skills are just folders that teach Claude your job, your workflow, your domain. Good entry point if you haven't built one yet.
Tweet: bcherny on Claude Code Short, worth reading for the framing on where the tooling layer is heading.
Tweet: vtrivedy10 on agent setups Practical notes on structuring multi-agent setups in production.
Shared conversation: harness patterns A real session showing harness design decisions in context.
AI Engineer YouTube — ai.engineer Conference talks from the AI Engineer Summit, World's Fair, and Code Summit — speakers like Andrej Karpathy, Simon Willison, Jerry Liu. Over 10 million views in 2025. The best single channel for staying current on agent tooling, evals, and infrastructure patterns as they emerge from practitioners building in production.
Gemma 4 — Google DeepMind Four open-weight variants under Apache 2.0, multimodal, up to 256K context. The 31B dense model ranks #3 on Arena AI's text leaderboard.