Skip to content
AI Field Notes

Part IV · Tools & Infrastructure

18

What Is Happening

The fastest-moving areas in AI agent tooling right now — hooks, harness design, browser automation, and multi-agent orchestration

Something's shifted in the last few months. The interesting problems aren't "which model is better" anymore — it's everything around the model. How you feed it context. How you keep it coherent across a long session. How you constrain what it touches. How you recover when it goes sideways.

The models are good. The question is how you harness them. That's where things are moving fast, and where most of my attention is right now. Here's what's actually made a difference.

Karpathy joins Anthropic's pre-training team

Andrej Karpathy announced on May 19 that he's joining Anthropic this week. The short bio for anyone who's lost track: OpenAI cofounder, then Tesla's director of AI from 2017 leading Autopilot computer vision, then a brief return to OpenAI, then Eureka Labs — the AI education startup he's been running since. Now Anthropic.

The role is the interesting part. Per TechCrunch, Karpathy is building a team focused on using Claude to accelerate pre-training research — the work that gives new models their core knowledge and capabilities in the first place. Not an applied team. Not an agent team. The team whose job is to make the next Claude train better than this one.

The hire lands while Anthropic is poised to surpass OpenAI's private-market valuation and in the middle of an unusually loud talent war between the two labs. Karpathy is the highest-profile name to switch sides this year, and the optics of an OpenAI cofounder running pre-training at the chief rival are doing their own work.

Practical read: this is the "models training models" loop showing up in a hiring decision. The same instinct that drives Cursor's Targeted RL with Textual Feedback or Composer 2.5's 25× synthetic-task expansion — using the current model to improve the next one — is now a staffing thesis at the lab level. If the bet pays off, the Claude that ships in 12 months won't just be a bigger Claude 4.7. It'll be a Claude trained by a Claude. Worth watching what shows up in Anthropic's research output over the next two quarters.

Google's Gemini Omni: one model, any modality in, any modality out

Google unveiled Gemini Omni at I/O 2026 today — its first natively multimodal "any-to-any" model. Gemini Omni Flash is the first variant: take any combination of text, images, audio, and video in, get high-quality output across the same modalities out. The pitch is that Omni reasons across the inputs rather than stitching three specialist models together behind a single API.

Concrete details: 10-second video generation, custom digital avatars, plain-text photo editing without a Photoshop-style interface. Live today inside the Gemini app for U.S. subscribers on AI Plus, AI Pro, and AI Ultra, plus integrations into YouTube Shorts and Google's Flow creative studio. Vertex AI API access is "in the coming weeks," per TechCrunch.

Practical read: the structural play here is collapse-the-stack. For agent work the interesting bit isn't the consumer video demo — it's that once Omni hits Vertex, you stop needing text-to-image-then-image-to-video-then-audio-overlay as three separate model calls glued together. Quality vs. specialist models like Seedance 2 or Veo is the open question; early write-ups are skeptical that one model beats focused ones on pixel quality yet. But for multimodal agent outputs where consistency across modalities matters more than per-modality maximum, this changes the wiring.

Gemini 3.5 Flash beats 3.1 Pro on the benchmarks that matter

The other I/O headline: Gemini 3.5 Flash — Google's new Flash-tier model — outperforms Gemini 3.1 Pro, the flagship that shipped in February, on three benchmarks Google chose to highlight: Terminal-Bench 2.1, GDPval-AA Elo, and MCP Atlas. The GDPval-AA number is the one to stare at: 1,656 for 3.5 Flash vs 1,317 for 3.1 Pro. That's not a tick, that's a step. Sundar Pichai called out 289 tokens/sec in the keynote — roughly 4× the throughput of comparable frontier models. It's rolling out today as the default model in the Gemini app and Google Search globally.

Reception is mixed-positive. The long-standing "Gemini feels lazy" complaint has reportedly mostly faded in early testing, sub-200ms responses on many prompts make it feel genuinely real-time, and LM Arena coding scores have it ahead of 3.1 Pro at meaningfully lower per-token cost. The honest counterweight: Hacker News threads on the prior 3.x releases still surface a steady drumbeat of "Gemini is consistently the most frustrating model I use" from developers — benchmarks aren't the same as daily-driver feel, and Google hasn't fully closed that gap yet.

Practical read: two things matter here. First, "Flash beats Pro" inverts the normal model hierarchy — the same move Cursor made with Composer 2.5 last week, and Anthropic's pre-training hire above is partly a response to the same pressure. The cheap-and-fast tier is no longer a worse version of the expensive one; it's a different point on the price/capability curve that sometimes wins. Second, GDPval-AA and MCP Atlas are the agent-shaped benchmarks — tool use, long-horizon tasks. A Flash-tier model leading there means the cost floor for capable agent runtimes just dropped again. Build budgets that assumed Pro-tier pricing for agent loops need a refresh this week.

Composer 2.5 makes the in-house model competitive

Cursor shipped Composer 2.5 on May 18 — the same Moonshot Kimi K2.5 base as Composer 2, retrained with 25× more synthetic tasks and a new technique they call Targeted RL with Textual Feedback: instead of waiting for a final reward, the trainer drops localized hints at the exact tokens where behavior went wrong and distills back from those points. The infrastructure note worth flagging is the Muon optimizer with distributed orthogonalization — 0.2-second optimizer steps on a trillion-parameter model.

The benchmark picture is the headline. On SWE-Bench Multilingual, Composer 2.5 lands at 79.8% against Opus 4.7's 80.5% and GPT-5.5's 77.8%. On CursorBench v3.1, 63.2% vs Opus 4.7 at 64.8% (max) / 61.6% (default) and GPT-5.5 at 59.2%. Terminal-Bench 2.0 is where the gap shows: 69.3%, basically tied with Opus 4.7 at 69.4%, but well behind GPT-5.5's 82.7% — the long autonomous-loop benchmark is still GPT-5.5's territory.

Pricing is the part that matters. $0.50 / $2.50 per million tokens for the standard tier, $3 / $15 for the Fast variant. At that rate, Composer 2.5 hits ~63% on CursorBench at under $1 average per task while Opus 4.7 and GPT-5.5 are several dollars in for comparable scores. Launch week ships with double usage thrown in. The roadmap note is the other interesting one: Cursor confirmed a collaboration with SpaceXAI to train a significantly larger model on Colossus 2 — 10× the compute of this run.

Reception on the Cursor forum thread is warm but not uncritical. The consistent praise is about tone: "willing to think with you and is not antagonistic" — a direct shot at the Opus 4.7 argumentative-loop complaints from last month. One developer admitted forgetting they had Composer 2.5 enabled and not realizing they weren't on GPT-5.5 for a while, which is the highest compliment a default-model swap gets. The gripe that keeps coming up is inconsistent thinking depth: users report adding "please think harder" before the model commits to a real answer instead of a lightweight one.

Practical read: for the first time, the cheap in-IDE model is in the same room as the frontier models on the benchmarks Cursor users actually care about — multi-file refactors, multilingual SWE-Bench, CursorBench. It's not the best at any single thing, but at 10× cheaper it doesn't need to be. The Fast variant is still where the long autonomous loops should live if you can afford it; Composer 2.5 standard is the new sensible default for everything else. The bigger story is structural — Anthropic has been pricing Cursor into a corner by selling Claude Code at rates Cursor pays to serve. Composer 2.5 is the answer to that squeeze: a model Cursor owns end-to-end, priced where the unit economics work.

NVIDIA's SANA-WM puts minute-scale world models on a single GPU

NVIDIA Labs dropped SANA-WM on May 14 — a 2.6B-parameter open-source world model that turns one image plus a 6-DoF camera trajectory into 60 seconds of controllable 720p video, running on a single GPU. The paper claims visual quality on par with industrial baselines like LingBot-World and HY-WorldPlay, at a fraction of the compute.

The architecture is a hybrid linear diffusion transformer: frame-wise Gated DeltaNet handles long-context modeling with linear cost, softmax attention covers the parts that need full-rank mixing, and a dual-branch camera path enforces precise trajectory adherence. A two-stage pipeline applies a long-video refiner over first-pass outputs for temporal consistency across the full minute. Training used only ~213K public video clips with metric-scale pose supervision, completing in 15 days on 64 H100s — small for a frontier video model.

The number that matters: the distilled variant runs on a single RTX 5090 with NVFP4 quantization, denoising a 60s 720p clip in 34 seconds — roughly 36× the throughput of prior open-source baselines. The paper and code are out alongside the project page.

Practical read: the headline isn't pixel quality — industrial models still match it. It's that minute-scale, camera-controllable world models stop being a multi-GPU research artifact and start being something you run locally with a starting frame and a trajectory. For agent work — generating training video, simulating embodied environments, building evaluation scenarios at scale — the cost curve just bent hard. Worth watching how fast this gets wired into robotics and game-engine workflows.

Cursor Security Review goes managed

The DIY security agents Cursor open-sourced earlier just turned into a product. Cursor Security Review is in beta on Teams and Enterprise plans, with two always-on agents you turn on from the dashboard instead of standing up Lambdas and Terraform yourself.

Security Reviewer runs on every PR. It checks for vulnerabilities, auth regressions, privacy and data-handling risks, agent tool auto-approvals, and prompt injection — and leaves inline comments at the exact diff location with severity and remediation. Vulnerability Scanner runs on a schedule across the codebase, looking for known CVEs, outdated dependencies, and misconfigurations, with optional Slack updates.

Both are configurable: adjust triggers, drop in custom instructions, give them custom tooling, decide where outputs land. The interesting hook is MCP — you can plug in your existing SAST, SCA, and secrets scanners as MCP servers and let the agent use them as part of the review. Cursor keeps tuning the runtime, harness, and models behind the scenes. Usage comes out of your existing pool, not a separate SKU.

Practical read: a month ago, getting a security review agent into your PR pipeline meant adopting Cursor's open-source templates, deploying a Lambda, and wiring Slack yourself. Now it's a toggle. The interesting part isn't the convenience — it's that "security agent" is becoming a product category, not a custom build. The DIY version still exists for teams who want full control; the managed version is for teams who want it on by Friday.

Vercel opens the cloud agent stack

Cursor's SDK gives you agent infrastructure from an IDE company. Vercel's answer is different: an open-source reference implementation you can actually read.

Open Agents is MIT-licensed, deployed at open-agents.dev, and explicitly framed as a reference, not a starter kit. The goal is visibility — see exactly how the pieces wire together, then fork and adapt.

The architecture is three layers: web app → agent workflow → sandbox VM. The web layer handles auth (Better Auth, GitHub OAuth), sessions, chat, and streaming UI built on Next.js. The agent runs as a durable workflow via Vercel's Workflows SDK — long-running execution that can hibernate and resume without losing state. The sandbox is an isolated VM with a full filesystem, shell, git, dev servers, and preview ports.

The critical design choice: agent and sandbox are separate. The agent doesn't run inside the VM — it reaches in via tool calls. That means each layer can hibernate independently. Pause a long coding task, come back hours later, the agent picks up without the sandbox burning compute the whole time.

Feature set: file reads and edits, shell commands, web search, git operations, optional auto-commit and PR creation, session sharing via read-only links, voice input via ElevenLabs. Neon PostgreSQL for persistence, optional Redis or Vercel KV for caching.

Practical read: this is the cleanest public example of how Vercel's own stack — Workflows SDK, Sandboxes, Gateway — wires together for a coding agent. If you've been building agent loops yourself over raw APIs, reading this codebase is faster than reading docs. The architectural insight is the same one underneath the Cursor SDK: durable execution plus isolated sandbox plus external tool access is the skeleton of every cloud coding agent. Vercel just made theirs legible.

Cursor SDK opens the harness up

Cursor shipped a TypeScript SDK on April 29 — the same runtime, harness, and models that power the desktop app, CLI, and web client, now available programmatically via npm install @cursor/sdk. Public beta, token-based pricing; SDK examples now default to Composer 2.5 (Cursor's in-house coding model — roughly 10× cheaper per input token than both Opus 4.7 and GPT-5.5).

Agents created through the SDK get the full stack: codebase indexing with semantic search and instant grep, MCP servers, skills from .cursor/skills/, and hooks from .cursor/hooks.json. Execution can target sandboxed cloud VMs (Cursor-managed), self-hosted workers (your network), or your local machine. Subagents, streaming, and the same harness primitives that ship in the IDE are exposed as composable APIs.

The cookbook is the part to look at. Four reference projects: a minimal quickstart, a web-based prototyping tool that scaffolds new projects in a sandbox, a lightweight coding-agent CLI, and the agent-kanban board — a Linear-style UI where each card represents a Cloud Agent. Drag a card to "in progress" and the agent picks the work up, runs to completion in a sandbox, opens a PR, and posts the result back to the card as an attachment. The board lists running agents, groups them into columns, previews artifacts inline, and creates new agents from a repo + prompt.

Practical read: the SDK turns Cursor from "an editor with agents" into "agent runtime you can build on." If you've been wiring up your own agent loops over the Anthropic or OpenAI APIs, the SDK is shorter — you inherit indexing, hooks, subagents, and sandboxing instead of reinventing them. The kanban example is the cleanest demonstration of where this lands: tickets become agent invocations, drag-and-drop becomes scheduling, and PRs become the artifact.

Cursor ships agentic security review

Cursor open-sourced its Agentic Security Review — a security-tuned automation that runs on every pull request, posts findings as PR comments, and can block CI on high-severity issues. It audits diffs for exploitable vulnerabilities (auth, input validation, permission checks), skips items already discussed in the PR, and routes high-risk findings to a private Slack channel.

The review agent ships alongside three other security agents Cursor runs internally — Vuln Hunter (segments the repo and hunts for vulnerabilities), Anybump (handles dependency patching, runs tests, opens a PR if they pass), and Invariant Sentinel (runs daily to detect drift against a list of compliance and privacy invariants). Templates and Terraform for all four are public, with a custom MCP server deployed as a serverless Lambda handling persistent state, deduplication, and Slack formatting. The Cursor blog post has the full architecture.

In Cursor's own deployment, the review agent has run on thousands of PRs and prevented hundreds of issues from reaching production in the last two months. The rollout sequence they used is worth copying: silent mode to a private Slack channel first, then PR comments once precision was high enough, then a blocking CI gate.

Practical read: security review used to be a /security-review slash command you remembered to run. As an always-on automation tied to CI, it stops being a discipline problem and starts being infrastructure. Worth pairing with hooks — the hook blocks bad writes locally, the review agent catches what makes it through.

GPT-5.5 lands a week after Opus 4.7 — and the vibe flips

OpenAI announced GPT-5.5 on April 23, seven weeks after 5.4 and seven days after Anthropic's Opus 4.7, with API availability following on April 24. It's rolling out on ChatGPT Plus, Pro, Business, and Enterprise, in Codex, and in the API — with a higher-tier GPT-5.5 Pro alongside it. API pricing is $5 / $30 per million tokens, roughly 2× GPT-5.4, with a 1M-token context window and per-token latency that matches 5.4 in real-world serving.

The quiet detail is that this is the first fully retrained base model since GPT-4.5. The "5.4 → 5.5" version bump undersells the delta — developers poking at it on launch day kept repeating some version of "it just gets it" and "much less hand-holding." It's better at multi-step tool use, at staying in the loop until a task finishes, and at writing and debugging code without being steered every turn.

Against Opus 4.7 the picture splits cleanly by workload. GPT-5.5 wins the autonomous-loop benchmarks: Terminal-Bench 2.0 at 82.7% vs 69.4%, plus leads on BrowseComp (+5.1pp) and CyberGym (+8.7pp). Opus 4.7 wins the reasoning-and-review cluster: SWE-bench Pro 64.3% vs 58.6%, HLE 46.9% vs 41.4%, and MCP-Atlas 79.1% vs 75.3%. Of the ten benchmarks both labs report, Opus leads on six and GPT-5.5 leads on four — but GPT-5.5's four are the ones closest to "agent that runs a shell for an hour."

Reception is sharply warmer than Opus 4.7's, which landed with a 46-point MRCR regression and loud complaints about argumentative coding loops. GPT-5.5 feels like a clean upgrade at comparable speed, which is why devs are calling it a "revival" of the 5.x line. The pushback is almost entirely about price: at 2× GPT-5.4 for the base tier and more for Pro, teams with cost ceilings are staying on 5.4 for anything that doesn't need the extra agentic range. Output pricing is also $5/M more than Opus 4.7, though GPT-5.5 tends to emit fewer tokens per task, which partially offsets on the bill.

Practical read: for agentic coding, browser automation, and long tool-use loops, GPT-5.5 is the default this week. For long-document reasoning, review-grade correctness, and anything close to HLE territory, Opus 4.7 still edges it. If you were burned by the Opus 4.7 release and parked on 4.6 or 5.4, 5.5 is the first model since 4.5 where the jump is worth the retest.

Cursor 3: parallel agents and worktrees

Cursor 3 ships two changes that address the same problem — waiting.

The first is /multitask. Instead of queuing requests and running them serially, Cursor can now spin up async subagents to handle them in parallel. For requests already in the queue, you can ask Cursor to multitask mid-run rather than waiting for the current task to finish.

The second is improved worktrees in the agents window. Run isolated tasks in the background across different branches simultaneously. When you're ready to test changes, bring any branch into your local foreground with one click.

Combined, these features move Cursor toward a model where you describe work across multiple tasks and let the editor figure out execution order. The interface is catching up to what the agents can already do.

The packaging pattern works for design too

The same pattern that works for code conventions works for design. ux-ui-agent-skills packages DTCG design tokens, Atomic Design component specs, WCAG 2.2 checklists, Nielsen heuristic rubrics, and React + Tailwind v4 / Next.js 15 patterns into a single skill set.

Drop it into a project and the agent applies consistent design knowledge — token mapping, accessibility scoring, state documentation — instead of improvising each time.

Any domain with enough accumulated knowledge can be packaged this way. Design is just a clear example because the gap between "AI generates UI" and "AI generates good UI" is so visible.

Mythos Preview finds zero-days at scale

On April 7, Anthropic previewed Claude Mythos — an unreleased research model that's dramatically better at exploiting software than anything shipped before. On a Firefox vulnerability set where Opus 4.6 built working JavaScript shell exploits 2 times in several hundred attempts, Mythos built them 181 times. On OSS-Fuzz, it produced 595 tier-1/2 crashes versus Opus 4.6's 150–175, with full control flow hijacking demonstrated on ten targets.

The bugs it found are the kind that normally take years of expert attention. A 27-year-old OpenBSD TCP flaw enabling remote DoS. A 16-year-old FFmpeg H.264 codec bug that OSS-Fuzz missed after roughly 5 million fuzzing attempts. A 17-year-old FreeBSD NFS remote code execution, now tracked as CVE-2026-4747. Thousands of additional critical and high-severity findings across major open-source projects. On cybersecurity vulnerability reproduction, Mythos scores 83.1% against Opus 4.6's 66.6%.

Anthropic frames this as a "watershed moment," with their own caveat that "most security tooling has historically benefited defenders more than attackers" but the transition period may be "tumultuous." The practical read for anyone shipping software is blunt: patching cycles that were fine six months ago are not fine now. Threat models that assumed expert attackers were a scarce resource need revisiting. The offensive floor just moved, and it moved a lot.

Glasswing: defenders get the model first

Mythos doesn't ship alone. Project Glasswing is the coordinated deployment — a partnership with 12 founding organisations (AWS, Apple, Google, Microsoft, NVIDIA, Linux Foundation, JPMorgan Chase, CrowdStrike, Palo Alto Networks, Cisco, Broadcom, and Anthropic itself) plus 40+ additional critical-infrastructure organisations, plus dedicated funding for open-source maintainers.

The financial commitment: $100M in Mythos model credits for defensive use, $2.5M to Alpha-Omega and OpenSSF, and $1.5M to the Apache Software Foundation. The goal is months of concentrated patch work on the dependencies everything else rests on before anything similar becomes broadly available. The model itself isn't on the public API — access is gated to consortium members and vetted maintainers.

The interesting precedent isn't the money. It's the deployment pattern. A model capable enough to shift the offence/defence balance ships to defenders first, through a consortium, with funded upstream patch work targeted at the libraries that form the trust root of modern software. No broad API release. No public methodology paper. The strategy assumes that if you give this capability to defenders at scale, they can close the window before attackers reach parity. Whether that bet pays off is the open question — but it's the first time a frontier capability has been deliberately held back from general availability for a coordinated defensive push. Worth watching how the pattern generalises.

Claude Design joins Anthropic Labs

Claude Design shipped April 17 as a research preview — Anthropic's first dedicated visual creation tool, running on Opus 4.7, with direct handoff to Claude Code for development.

What you actually do with it: point it at a codebase and it picks up the design system, import a doc, image, or URL and it generates designs, prototypes, slides, or one-pagers, then refine inline with fine-grained controls. Export to Canva, PDF, PPTX, or HTML. Organization-scoped sharing for teams. The target audience is wider than designers — product managers doing wireframes, founders building pitch decks, marketers making campaign materials, non-designers who need visual output and normally punt the work.

Brilliant cited in the launch: complex prototyping dropped from 20+ prompts to 2. The more interesting detail is the Claude Code handoff. Design and code are usually connected by a lossy export step — a Figma file becomes hand-written React, with drift appearing immediately. Claude Design treats them as one continuous surface: generate in Design, refine in Design, hand the component tree directly to Claude Code for implementation. If that pipeline holds up in practice, it's a different workflow, not just a faster one.

Available now on Pro, Max, Team, and Enterprise with gradual rollout from April 17.

Claude Opus 4.7 lands — and the reception is split

Anthropic shipped Claude Opus 4.7 on April 16. Generally available on the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry from day one, with pricing held steady at $5 / $25 per million tokens — unchanged from 4.6.

The headline benchmarks move in the right direction. 70% on CursorBench vs 58% for 4.6. 3× more production tasks resolved on Rakuten-SWE-Bench. 13% gain on a 93-task coding suite, state-of-the-art on GDPval-AA. Vision is the standout: 98.5% on XBOW's autonomous pentest visual-acuity benchmark vs 54.5% for 4.6, with image input now up to ~3.75 megapixels.

But the developer reception is sharper than the numbers suggest. On MRCR v2 at 1M tokens, Opus 4.7 scores 32.2% — down from 78.3% on 4.6. A 46-point collapse in long-context retrieval. BrowseComp dropped from 83.7% to 79.3%. Developers on Hacker News and X are reporting argumentative loops in Claude Code where the model re-raises concerns it already acknowledged as resolved, more frequent hallucinations on long documents, and stub implementations where 4.6 produced complete ones. One independent run had Sonnet 4.6 beat Opus 4.7 on a custom terminal-UI task.

Anthropic's own release notes hint at the tradeoff in softer language: "prompts written for earlier models can sometimes now produce unexpected results" as 4.7 interprets instructions more literally. The Finout pricing breakdown flags the other caveat: a new tokenizer can produce up to ~35% more tokens for the same input. The rate card didn't change, but your real bill per request can still go up.

Practical read: if your work is primarily agentic coding, vision, or document reasoning, upgrade and measure. If you're doing long-document Q&A, needle-in-a-haystack retrieval, or agentic browsing over very long contexts, stay on 4.6 until a patch release. "Narrowly retaking the lead" is an accurate but incomplete summary — on the specific things some teams built their workflows around, 4.7 is a step back.

Hooks move agents from advice to automation

Skills tell an agent what to do. Hooks make certain things happen regardless of what the agent decides.

They fire on lifecycle events — before a tool executes, after it finishes, on message submit, on stop, before compaction, on permission requests. Unlike skills, they're deterministic: the hook runs every time, not when the model judges it relevant.

Practical uses: auto-format after edits, block writes to protected paths, desktop notifications when Claude is waiting, re-inject context after compaction. Full schema in the Anthropic hooks guide.

The fastest way to start: Hookify. /hookify Warn me when I use rm -rf commands produces a working hook file immediately. Run it with no arguments and it auto-generates rules from behaviors you've already corrected in the current session.

Skills encode knowledge. Hooks enforce behavior. Both belong in a mature setup.

Harness design is now part of the craft

A harness is everything around the model: prompts, tools, orchestration, context management, hooks. If you want the from-scratch primer on what a harness is and why it matters, chapter 19 — "What Is an AI Harness?" covers it. This section is the moving-target view. Anthropic published a detailed breakdown of how a harness affects long-running agent performance — two findings stood out.

First: agents lose coherence as context fills. Some models exhibit "context anxiety," wrapping up prematurely. The fix isn't compaction (summarizing in place) — it's context resets: clear the window, start a fresh agent with a structured handoff. Compaction preserves continuity but doesn't give the agent a clean slate.

Second: agents reliably praise their own work when asked to evaluate it. The fix is architectural — separate the generator from the evaluator. A standalone evaluator tuned to be skeptical is far more tractable than making a generator self-critical.

The architecture that emerged: planner → generator → evaluator with Playwright clicking through the running app. Every component encodes an assumption about what the model can't do alone — those assumptions go stale as models improve. Strip non-load-bearing scaffolding when a new model lands. The full article has cost and duration breakdowns.

Subagents: the Cursor model is worth studying

Cursor's subagents go further than the AGENTS.md pattern. Each gets its own context window and model config, runs foreground or background, and three built-ins (Explore, Bash, Browser) handle the noisiest operations automatically.

Custom subagents are markdown files with YAML frontmatter in .cursor/agents/ (or .claude/agents/):

---
name: security-auditor
description: Security specialist. Use when implementing auth, payments, or handling sensitive data.
model: inherit
readonly: true
---
 
You are a security expert auditing code for vulnerabilities.

The description field determines when the parent delegates — spend time on it. The model field lets you route high-volume tasks to a faster model and depth tasks to a more capable one.

Anti-pattern: dozens of vague subagents. Five focused ones with sharp descriptions outperform fifty the parent doesn't know when to use.

Next.js MCP is becoming practical

Next.js 16 ships with a built-in MCP endpoint at /_next/mcp. Add next-devtools-mcp to .mcp.json and your agent gets live access to build errors, runtime errors, routes, page metadata, and server action IDs — no screenshots, no copy-paste.

{
  "mcpServers": {
    "next-devtools": {
      "command": "npx",
      "args": ["-y", "next-devtools-mcp@latest"]
    }
  }
}

Useful tools: get_errors (source-mapped stacks), get_routes, get_page_metadata, get_server_action_by_id. The agent can diagnose a hydration error and suggest a fix without you describing what's on screen.

Below Next.js 16: experimental.mcpServer: true in next.config.js.

Browser control is getting lighter-weight

The fastest way for an agent to use a browser is to let it write code. dev-browser runs Playwright-style scripts in a sandboxed QuickJS WASM environment — install it globally, point the agent at dev-browser --help, and it handles the rest.

npm i -g dev-browser
dev-browser install

From the repo benchmarks: Dev Browser finishes a representative task in 3m 53s at $0.88 with 29 turns. Playwright MCP takes 4m 31s at $1.45 with 51 turns. Batching interactions into scripts beats one-tool-call-per-action.

Pre-approve in .claude/settings.json: "allow": ["Bash(dev-browser *)"].

Infra is becoming part of the product

Cursor's self-hosted cloud agents are now generally available. A worker process connects outbound via HTTPS — no inbound ports, no firewall changes. Cursor handles inference and planning, sends tool calls to the worker, results flow back. Each session gets its own dedicated worker; Kubernetes operator available for scale.

The practical benefit: agents can access internal caches, dependencies, and network endpoints that can't leave the environment. Code and secrets stay in your infrastructure.

Teams at Brex, Money Forward, and Notion are running this at scale. Notion cited access to more tools more securely as the reason for adopting it over maintaining their own background agent stack. "Agent infrastructure" is now a real architectural decision.

Cloud agents run on your hardware now

Cursor's My Machines takes self-hosted agents from an enterprise feature to an individual one. Instead of running in Cursor's managed VMs, your agent executes on hardware you control — your laptop, a devbox, a remote VM. Three commands to get there:

curl https://cursor.com/install -fsS | bash
agent login
agent worker start

The worker opens an outbound connection to Cursor — no inbound ports, no firewall changes, just HTTPS to api2.cursor.sh. Cursor handles inference and planning, sends tool calls to the worker, and terminal commands, file edits, and browser actions all execute on your machine. Your local repo, dependency caches, build artifacts, internal network — the agent gets all of it.

The MCP routing is worth noting. Stdio-transport MCP servers run on your machine, so they can reach private endpoints your network can access. HTTP/SSE-transport servers run on Cursor's backend, where Cursor handles OAuth and session caching. If your MCP server needs to hit an internal API, use stdio.

Workers are long-lived by default — they stay connected until you stop them and pick up future sessions automatically. Name them with --name "my-devbox" when you have multiple machines. For org-wide fleets, Cursor has a separate Self-Hosted Pool with Kubernetes operators. My Machines is the individual-developer version: one process, one machine, immediate access.

The shift underneath is conceptual. "Cloud agent" used to mean "runs in a cloud VM." Now the cloud part is just inference. Execution goes wherever makes sense — Cursor's sandboxes for isolation, your laptop for local deps, your company's cluster for compliance.

Harnesses are becoming shareable infrastructure

everything-claude-code is a useful example of where this is heading: 30 specialized subagents, hooks for memory persistence, verification loops, continuous learning, and security scanning — shipping across Claude Code, Cursor, Codex, and OpenCode.

The instinct system is the interesting part: the agent extracts patterns from your sessions into structured files, and /evolve clusters them into skills. The harness learns from use.

The community is converging on a shared vocabulary — skills, subagents, hooks, harnesses, evals. The primitives are stabilizing even as the specific tools change.

Engineering practices are becoming installable

Addy Osmani packaged Google's engineering culture into agent-skills: 20 skills across a 6-phase lifecycle, with 7 slash commands (/spec, /plan, /build, /test, /review, /code-simplify, /ship) that map to the full development loop.

Each skill has the same anatomy: process steps, anti-rationalization tables (rebuttals for "I'll add tests later"), red flags, and verification gates. The engineering principles are baked in — Hyrum's Law for API design, Chesterton's Fence for simplification, the Beyoncé Rule for testing ("if you liked it, you shoulda put a test on it"), 80/15/5 test pyramid ratios.

What makes it interesting isn't the content — most experienced engineers know these practices. It's the format. When engineering culture is encoded as structured markdown, the floor rises. Junior developers running these skills get senior-level guardrails without senior-level experience. The agent doesn't skip tests because it's in a hurry. It doesn't rationalize away code review.

Works across Claude Code, Cursor, Gemini CLI, and anything that accepts markdown.

Design systems are going agent-readable

Google Stitch introduced DESIGN.md — a plain-text design system document that agents read to generate consistent UI. No Figma plugins, no design token APIs. Just a markdown file with nine sections: visual theme, color palette with hex values, typography rules, component styling including states, layout principles, depth and elevation, do's and don'ts, responsive behavior, and an agent prompt guide.

awesome-design-md took this further — 58+ DESIGN.md files extracted from real companies. Claude, Stripe, Vercel, Linear, Figma, Airbnb, Spotify, Tesla. Drop one into your project and the agent generates UI that matches that design system.

The insight: agents are already generating UI. The problem was never capability — it was consistency. A DESIGN.md file gives the agent the same reference a human designer would use, in the format it processes best. Markdown over Figma, at least for the agent.

Browse the collection at getdesign.md.

Stitch just open-sourced the DESIGN.md specification itself — Apache 2.0, formal schema, CLI with a linter, differ, and exporter. Any tool can implement it now, not just Stitch.

The additions that matter: semantic color intent, so agents know what a color is for rather than just its hex value. And built-in WCAG validation — the linter catches contrast failures at lint time, before anything ships. The export command converts tokens to Tailwind config or W3C DTCG JSON simultaneously, so one file feeds your CSS, your build system, and your agent.

Spec at google-labs-code/design.md.

Knowledge bases are replacing notebooks

Andrej Karpathy shared a pattern worth paying attention to: instead of using LLMs to write code, use them to build personal knowledge bases.

The structure is simple. Raw materials — papers, articles, repos, datasets — go into a raw/ directory. The LLM "compiles" them into a wiki: structured .md files with summaries, backlinks, concept pages, and cross-references. You query the wiki, and the LLM synthesizes answers with citations.

This isn't RAG. RAG re-discovers knowledge from scratch on every question — chunk, retrieve, generate, forget. The wiki accumulates. Ask a question that requires synthesizing five documents, and the answer is already on a page, not assembled from fragments at query time.

Karpathy's own wiki on recent research: ~100 articles, ~400K words. Periodic linting passes check for contradictions, stale claims, orphan pages, and missing cross-references. The whole thing is a git repo, so you get version history for free. Instead of sharing code, he published a GitHub Gist as an "idea file" — in the era of agents, you share the idea and each person's agent builds a version customized for their needs.

The Obsidian connection makes it practical. Obsidian's web clipper converts pages to markdown, the vault is a local folder an agent can read and write, and backlinks make the wiki navigable by both humans and agents. Several open-source projects — claude-obsidian, obsidian-claude-code — have formalized the workflow.

An increasing fraction of token throughput going to knowledge management instead of code generation. Worth watching.

Visual annotation beats text descriptions

Cursor 3 shipped Design Mode. Instead of typing "change the third button in the second card on the settings page," you click on the element.

Design Mode opens a browser panel inside Cursor showing your running app. Click any UI element — a button, a heading, a card — and annotate it with instructions. The agent receives the component tree path, computed styles, and surrounding context. You can draw directly on the preview to indicate layout changes or spacing adjustments.

In practice: about 70% of annotations result in correct fixes on the first try. It struggles with dynamically rendered content and complex CSS-in-JS setups where styles aren't straightforward to trace.

This is the direction. Text descriptions of visual problems are lossy. Pointing at the thing and saying "fix this" is how humans communicate about UI. The tooling is catching up to the gesture.

Canvases make agent output interactive

Cursor shipped canvases — interactive visual surfaces that agents create inline. Instead of reading a text-based summary of your data, the agent generates a custom dashboard, chart, or visualization you can click through.

The shift is in the output format. Most agent responses are text. Canvases let the agent build a small interactive application as the response — a dependency graph you can explore, a timeline you can scrub, a layout you can rearrange. The agent writes the visualization code, renders it in a sandboxed canvas, and you interact with the result directly.

This matters because some information is fundamentally better explored than read. A table of API response times is less useful than a chart you can filter by endpoint. A list of component dependencies is less useful than a graph you can zoom into. Canvases give the agent a richer output vocabulary.

Open models keep closing the gap

Gemma 4 landed on April 2. Four variants: E2B (2.3B effective), E4B (4.5B effective), 26B MoE (4B active), and 31B dense. All Apache 2.0 — a real license change from Google's previous, more restrictive terms for open models.

The 31B dense model hit #3 on Arena AI's text leaderboard at 1452 Elo, outperforming models twenty times its size. The 26B MoE hit #6 at 1441 Elo.

Multimodal out of the box: images, audio, variable aspect ratios, document parsing, handwriting OCR. Up to 256K context for the larger variants, 128K for the smaller ones. Over 140 languages.

The gap between open and closed models compresses with every release. Self-hosted agents running Gemma 4 31B are now competitive on reasoning benchmarks with frontier models from a year ago. For teams that can't send code to an API, that matters.

Where to start

If you're setting up a serious agent harness for the first time, the order matters:

  1. Get hooks working first. A single hook that blocks writes to node_modules/ or auto-formats after edits gives you immediate, observable value. Use Hookify's zero-argument mode to bootstrap from your own session history.
  2. Write one focused subagent before writing ten. Pick the task where your current setup most often loses context — security review, database migrations, API contract checks — and build one sharp subagent for it. Refine the description field until the parent routes to it reliably.
  3. Read the harness article before building evaluators. The generator/evaluator split is the insight with the most practical leverage. Get that architecture right before optimising anything else.
  4. Add next-devtools-mcp if you're on a Next.js project. The signal-to-noise improvement on error diagnosis is immediate and costs nothing.
  5. Check everything-claude-code for patterns, not prescriptions. It's a reference harness, not a starter kit. Extract the ideas that fit your context.

Resources

Sorted roughly by how much foundational leverage they provide.

Core reading

  • Harness Design for Long-Running Applications — Anthropic Engineering The reference article on harness architecture: context anxiety, generator/evaluator splits, context resets vs. compaction, and cost/duration breakdowns. Read this before designing any multi-step agent.

  • everything-claude-code 30 specialised subagents, memory hooks, verification loops, and an instinct system that learns from your sessions. The most complete reference harness available publicly. Works across Claude Code, Cursor, Codex, and OpenCode.

  • Anatomy of the Claude Folder Clear breakdown of what goes where in .claude/ — settings, hooks, subagents, skills, memory. Essential orientation if you're building a harness from scratch.

  • 3 Principles for Designing Agent Skills — Block Engineering Composability, observability, and minimal footprint. A tight framework for evaluating whether a skill is worth extracting.

  • claude-code-best-practice Community-curated collection of CLAUDE.md patterns, workflow configs, and prompt strategies. Good place to see what's converged as convention.

  • The Complete Guide to Building Skills for Claude — Anthropic Official reference guide for building Claude skills. Covers skill structure, frontmatter, when to extract a skill vs. keep it inline, and how the harness routes invocations. Read alongside the Anatomy of the Claude Folder post.

  • LLM Knowledge Bases — Andrej Karpathy The idea file for building personal knowledge wikis with LLMs. Raw materials → compiled wiki → queryable knowledge base. The alternative to RAG that accumulates instead of rediscovering.

Tooling

  • Hookify plugin — Official Claude Code plugin Source for the Hookify plugin. The zero-argument mode (auto-generates rules from your session history) is the fastest way to start building a hook library.

  • Cursor Subagents Official documentation for Cursor's subagent system — context window isolation, foreground/background execution, built-in Explore/Bash/Browser agents. The description field guidance is especially practical.

  • Cursor My Machines Run cloud agents on your own hardware — laptop, devbox, or remote VM. Three commands to set up; stdio MCP servers run locally with full network access. The individual-developer path to self-hosted agents.

  • next-devtools-mcp MCP server that gives agents live access to Next.js build errors, runtime errors, routes, and server action IDs. Replaces screenshot-based debugging.

  • dev-browser Runs Playwright-style scripts in a sandboxed QuickJS WASM environment. Benchmarks show ~30% fewer turns and ~40% lower cost vs. Playwright MCP for representative browser tasks.

  • ux-ui-agent-skills Design system skills packaging: DTCG tokens, Atomic Design specs, WCAG 2.2 checklists, and React + Tailwind v4 patterns. A concrete example of the domain-packaging pattern applied to UI.

  • Cursor Marketplace Browsable registry of community plugins and skill packs. Useful for finding what's already been packaged before building your own.

  • agent-skills — Addy Osmani 20 production-grade engineering skills with 7 slash commands, encoding Google's engineering practices (Hyrum's Law, Chesterton's Fence, test pyramids) as structured agent workflows.

  • DESIGN.md Specification — Google Labs The open-source spec for DESIGN.md: Apache 2.0, formal YAML/markdown schema, CLI with linter (including WCAG contrast validation), differ, and exporter to Tailwind and W3C DTCG JSON. Any tool can implement it.

  • awesome-design-md 58+ DESIGN.md files extracted from real companies — agent-readable design systems in plain markdown. Browse at getdesign.md.

  • claude-obsidian Claude + Obsidian knowledge companion implementing Karpathy's LLM wiki pattern. Persistent, compounding wiki vault with /wiki, /save, and /autoresearch commands.

  • Cursor Canvas Agents create interactive visual dashboards and custom interfaces inline. The output format shift from text to explorable visualizations.

  • Cursor SDK TypeScript SDK exposing Cursor's runtime, harness, models, codebase indexing, MCP, skills, and hooks. Public beta via npm install @cursor/sdk. Default model is Composer 2. Pair with the cookbook — the agent-kanban example is the clearest demo of agents-as-tickets.

  • Open Agents — Vercel Labs MIT-licensed reference app for building background cloud coding agents on Vercel. Web app + durable Workflows SDK + isolated Sandbox VM. Agent and sandbox are decoupled, so each can hibernate independently. Fork it to understand the wiring; don't use it as a starter kit.

  • Cursor Security Review Official docs for Cursor's Agentic Security Review automation. Open-source templates and Terraform on Cursor Automations, plus three companion security agents (Vuln Hunter, Anybump, Invariant Sentinel). The blog post covers the architecture and rollout strategy.

  • Cursor 3.0 changelog Design Mode, Agents Window, and the architectural shift to agent-first IDE. The Design Mode feature is the standout addition.

  • Tweet: cursor_ai on /multitask and worktrees Official announcement of parallel agent execution via /multitask and the new worktrees UI in Cursor 3.

Context and background

  • superpowers A collection of Claude Code skills and hooks by Jesse Vincent. Good real-world reference for how an experienced developer structures a personal harness — useful for seeing what someone actually keeps vs. discards.

  • gsap-skills Official GSAP skill pack for AI agents. A clean example of first-party library authors packaging their own knowledge for agent use — the likely direction for more ecosystems.

  • Vercel React Best Practices Vercel Engineering's guide to React performance: RSC boundaries, data fetching patterns, and component composition. Useful context when agents are generating or reviewing React code.

  • Tweet: kirillk_web3 on Claude Skills 16-minute video of two Anthropic engineers (Barry and Mahesh) building Claude Skills from scratch. The key framing: skills are just folders that teach Claude your job, your workflow, your domain. Good entry point if you haven't built one yet.

  • Tweet: bcherny on Claude Code Short, worth reading for the framing on where the tooling layer is heading.

  • Tweet: vtrivedy10 on agent setups Practical notes on structuring multi-agent setups in production.

  • Shared conversation: harness patterns A real session showing harness design decisions in context.

  • AI Engineer YouTube — ai.engineer Conference talks from the AI Engineer Summit, World's Fair, and Code Summit — speakers like Andrej Karpathy, Simon Willison, Jerry Liu. Over 10 million views in 2025. The best single channel for staying current on agent tooling, evals, and infrastructure patterns as they emerge from practitioners building in production.

  • Gemma 4 — Google DeepMind Four open-weight variants under Apache 2.0, multimodal, up to 256K context. The 31B dense model ranks #3 on Arena AI's text leaderboard.