Skip to content
AI Field Notes

Part IV · Tools & Infrastructure

18

On My Radar

What is moving fast in agent tooling right now

Something's shifted in the last few months. The interesting problems aren't "which model is better" anymore — it's everything around the model. How you feed it context. How you keep it coherent across a long session. How you constrain what it touches. How you recover when it goes sideways.

The models are good. The question is how you harness them. That's where things are moving fast, and where most of my attention is right now. Here's what's actually made a difference.

Hooks move agents from advice to automation

Skills tell an agent what to do. Hooks make certain things happen regardless of what the agent decides.

They fire on lifecycle events — before a tool executes, after it finishes, on message submit, on stop, before compaction, on permission requests. Unlike skills, they're deterministic: the hook runs every time, not when the model judges it relevant.

Practical uses: auto-format after edits, block writes to protected paths, desktop notifications when Claude is waiting, re-inject context after compaction. Full schema in the Anthropic hooks guide.

The fastest way to start: Hookify. /hookify Warn me when I use rm -rf commands produces a working hook file immediately. Run it with no arguments and it auto-generates rules from behaviors you've already corrected in the current session.

Skills encode knowledge. Hooks enforce behavior. Both belong in a mature setup.

Harness design is now part of the craft

A harness is everything around the model: prompts, tools, orchestration, context management, hooks. Anthropic published a detailed breakdown of how it affects long-running agent performance — two findings stood out.

First: agents lose coherence as context fills. Some models exhibit "context anxiety," wrapping up prematurely. The fix isn't compaction (summarizing in place) — it's context resets: clear the window, start a fresh agent with a structured handoff. Compaction preserves continuity but doesn't give the agent a clean slate.

Second: agents reliably praise their own work when asked to evaluate it. The fix is architectural — separate the generator from the evaluator. A standalone evaluator tuned to be skeptical is far more tractable than making a generator self-critical.

The architecture that emerged: planner → generator → evaluator with Playwright clicking through the running app. Every component encodes an assumption about what the model can't do alone — those assumptions go stale as models improve. Strip non-load-bearing scaffolding when a new model lands. The full article has cost and duration breakdowns.

Subagents: the Cursor model is worth studying

Cursor's subagents go further than the AGENTS.md pattern. Each gets its own context window and model config, runs foreground or background, and three built-ins (Explore, Bash, Browser) handle the noisiest operations automatically.

Custom subagents are markdown files with YAML frontmatter in .cursor/agents/ (or .claude/agents/):

---
name: security-auditor
description: Security specialist. Use when implementing auth, payments, or handling sensitive data.
model: inherit
readonly: true
---
 
You are a security expert auditing code for vulnerabilities.

The description field determines when the parent delegates — spend time on it. The model field lets you route high-volume tasks to a faster model and depth tasks to a more capable one.

Anti-pattern: dozens of vague subagents. Five focused ones with sharp descriptions outperform fifty the parent doesn't know when to use.

Next.js MCP is becoming practical

Next.js 16 ships with a built-in MCP endpoint at /_next/mcp. Add next-devtools-mcp to .mcp.json and your agent gets live access to build errors, runtime errors, routes, page metadata, and server action IDs — no screenshots, no copy-paste.

{
  "mcpServers": {
    "next-devtools": {
      "command": "npx",
      "args": ["-y", "next-devtools-mcp@latest"]
    }
  }
}

Useful tools: get_errors (source-mapped stacks), get_routes, get_page_metadata, get_server_action_by_id. The agent can diagnose a hydration error and suggest a fix without you describing what's on screen.

Below Next.js 16: experimental.mcpServer: true in next.config.js.

Browser control is getting lighter-weight

The fastest way for an agent to use a browser is to let it write code. dev-browser runs Playwright-style scripts in a sandboxed QuickJS WASM environment — install it globally, point the agent at dev-browser --help, and it handles the rest.

npm i -g dev-browser
dev-browser install

From the repo benchmarks: Dev Browser finishes a representative task in 3m 53s at $0.88 with 29 turns. Playwright MCP takes 4m 31s at $1.45 with 51 turns. Batching interactions into scripts beats one-tool-call-per-action.

Pre-approve in .claude/settings.json: "allow": ["Bash(dev-browser *)"].

Infra is becoming part of the product

Cursor's self-hosted cloud agents are now generally available. A worker process connects outbound via HTTPS — no inbound ports, no firewall changes. Cursor handles inference and planning, sends tool calls to the worker, results flow back. Each session gets its own dedicated worker; Kubernetes operator available for scale.

The practical benefit: agents can access internal caches, dependencies, and network endpoints that can't leave the environment. Code and secrets stay in your infrastructure.

Teams at Brex, Money Forward, and Notion are running this at scale. Notion cited access to more tools more securely as the reason for adopting it over maintaining their own background agent stack. "Agent infrastructure" is now a real architectural decision.

Cloud agents run on your hardware now

Cursor's My Machines takes self-hosted agents from an enterprise feature to an individual one. Instead of running in Cursor's managed VMs, your agent executes on hardware you control — your laptop, a devbox, a remote VM. Three commands to get there:

curl https://cursor.com/install -fsS | bash
agent login
agent worker start

The worker opens an outbound connection to Cursor — no inbound ports, no firewall changes, just HTTPS to api2.cursor.sh. Cursor handles inference and planning, sends tool calls to the worker, and terminal commands, file edits, and browser actions all execute on your machine. Your local repo, dependency caches, build artifacts, internal network — the agent gets all of it.

The MCP routing is worth noting. Stdio-transport MCP servers run on your machine, so they can reach private endpoints your network can access. HTTP/SSE-transport servers run on Cursor's backend, where Cursor handles OAuth and session caching. If your MCP server needs to hit an internal API, use stdio.

Workers are long-lived by default — they stay connected until you stop them and pick up future sessions automatically. Name them with --name "my-devbox" when you have multiple machines. For org-wide fleets, Cursor has a separate Self-Hosted Pool with Kubernetes operators. My Machines is the individual-developer version: one process, one machine, immediate access.

The shift underneath is conceptual. "Cloud agent" used to mean "runs in a cloud VM." Now the cloud part is just inference. Execution goes wherever makes sense — Cursor's sandboxes for isolation, your laptop for local deps, your company's cluster for compliance.

Harnesses are becoming shareable infrastructure

everything-claude-code is a useful example of where this is heading: 30 specialized subagents, hooks for memory persistence, verification loops, continuous learning, and security scanning — shipping across Claude Code, Cursor, Codex, and OpenCode.

The instinct system is the interesting part: the agent extracts patterns from your sessions into structured files, and /evolve clusters them into skills. The harness learns from use.

The community is converging on a shared vocabulary — skills, subagents, hooks, harnesses, evals. The primitives are stabilizing even as the specific tools change.

The packaging pattern works for design too

The same pattern that works for code conventions works for design. ux-ui-agent-skills packages DTCG design tokens, Atomic Design component specs, WCAG 2.2 checklists, Nielsen heuristic rubrics, and React + Tailwind v4 / Next.js 15 patterns into a single skill set.

Drop it into a project and the agent applies consistent design knowledge — token mapping, accessibility scoring, state documentation — instead of improvising each time.

Any domain with enough accumulated knowledge can be packaged this way. Design is just a clear example because the gap between "AI generates UI" and "AI generates good UI" is so visible.

Engineering practices are becoming installable

Addy Osmani packaged Google's engineering culture into agent-skills: 20 skills across a 6-phase lifecycle, with 7 slash commands (/spec, /plan, /build, /test, /review, /code-simplify, /ship) that map to the full development loop.

Each skill has the same anatomy: process steps, anti-rationalization tables (rebuttals for "I'll add tests later"), red flags, and verification gates. The engineering principles are baked in — Hyrum's Law for API design, Chesterton's Fence for simplification, the Beyoncé Rule for testing ("if you liked it, you shoulda put a test on it"), 80/15/5 test pyramid ratios.

What makes it interesting isn't the content — most experienced engineers know these practices. It's the format. When engineering culture is encoded as structured markdown, the floor rises. Junior developers running these skills get senior-level guardrails without senior-level experience. The agent doesn't skip tests because it's in a hurry. It doesn't rationalize away code review.

Works across Claude Code, Cursor, Gemini CLI, and anything that accepts markdown.

Design systems are going agent-readable

Google Stitch introduced DESIGN.md — a plain-text design system document that agents read to generate consistent UI. No Figma plugins, no design token APIs. Just a markdown file with nine sections: visual theme, color palette with hex values, typography rules, component styling including states, layout principles, depth and elevation, do's and don'ts, responsive behavior, and an agent prompt guide.

awesome-design-md took this further — 58+ DESIGN.md files extracted from real companies. Claude, Stripe, Vercel, Linear, Figma, Airbnb, Spotify, Tesla. Drop one into your project and the agent generates UI that matches that design system.

The insight: agents are already generating UI. The problem was never capability — it was consistency. A DESIGN.md file gives the agent the same reference a human designer would use, in the format it processes best. Markdown over Figma, at least for the agent.

Browse the collection at getdesign.md.

Knowledge bases are replacing notebooks

Andrej Karpathy shared a pattern worth paying attention to: instead of using LLMs to write code, use them to build personal knowledge bases.

The structure is simple. Raw materials — papers, articles, repos, datasets — go into a raw/ directory. The LLM "compiles" them into a wiki: structured .md files with summaries, backlinks, concept pages, and cross-references. You query the wiki, and the LLM synthesizes answers with citations.

This isn't RAG. RAG re-discovers knowledge from scratch on every question — chunk, retrieve, generate, forget. The wiki accumulates. Ask a question that requires synthesizing five documents, and the answer is already on a page, not assembled from fragments at query time.

Karpathy's own wiki on recent research: ~100 articles, ~400K words. Periodic linting passes check for contradictions, stale claims, orphan pages, and missing cross-references. The whole thing is a git repo, so you get version history for free. Instead of sharing code, he published a GitHub Gist as an "idea file" — in the era of agents, you share the idea and each person's agent builds a version customized for their needs.

The Obsidian connection makes it practical. Obsidian's web clipper converts pages to markdown, the vault is a local folder an agent can read and write, and backlinks make the wiki navigable by both humans and agents. Several open-source projects — claude-obsidian, obsidian-claude-code — have formalized the workflow.

An increasing fraction of token throughput going to knowledge management instead of code generation. Worth watching.

Visual annotation beats text descriptions

Cursor 3 shipped Design Mode. Instead of typing "change the third button in the second card on the settings page," you click on the element.

Design Mode opens a browser panel inside Cursor showing your running app. Click any UI element — a button, a heading, a card — and annotate it with instructions. The agent receives the component tree path, computed styles, and surrounding context. You can draw directly on the preview to indicate layout changes or spacing adjustments.

In practice: about 70% of annotations result in correct fixes on the first try. It struggles with dynamically rendered content and complex CSS-in-JS setups where styles aren't straightforward to trace.

This is the direction. Text descriptions of visual problems are lossy. Pointing at the thing and saying "fix this" is how humans communicate about UI. The tooling is catching up to the gesture.

Open models keep closing the gap

Gemma 4 landed on April 2. Four variants: E2B (2.3B effective), E4B (4.5B effective), 26B MoE (4B active), and 31B dense. All Apache 2.0 — a real license change from Google's previous, more restrictive terms for open models.

The 31B dense model hit #3 on Arena AI's text leaderboard at 1452 Elo, outperforming models twenty times its size. The 26B MoE hit #6 at 1441 Elo.

Multimodal out of the box: images, audio, variable aspect ratios, document parsing, handwriting OCR. Up to 256K context for the larger variants, 128K for the smaller ones. Over 140 languages.

The gap between open and closed models compresses with every release. Self-hosted agents running Gemma 4 31B are now competitive on reasoning benchmarks with frontier models from a year ago. For teams that can't send code to an API, that matters.

Where to start

If you're setting up a serious agent harness for the first time, the order matters:

  1. Get hooks working first. A single hook that blocks writes to node_modules/ or auto-formats after edits gives you immediate, observable value. Use Hookify's zero-argument mode to bootstrap from your own session history.
  2. Write one focused subagent before writing ten. Pick the task where your current setup most often loses context — security review, database migrations, API contract checks — and build one sharp subagent for it. Refine the description field until the parent routes to it reliably.
  3. Read the harness article before building evaluators. The generator/evaluator split is the insight with the most practical leverage. Get that architecture right before optimising anything else.
  4. Add next-devtools-mcp if you're on a Next.js project. The signal-to-noise improvement on error diagnosis is immediate and costs nothing.
  5. Check everything-claude-code for patterns, not prescriptions. It's a reference harness, not a starter kit. Extract the ideas that fit your context.

Resources

Sorted roughly by how much foundational leverage they provide.

Core reading

  • Harness Design for Long-Running Applications — Anthropic Engineering The reference article on harness architecture: context anxiety, generator/evaluator splits, context resets vs. compaction, and cost/duration breakdowns. Read this before designing any multi-step agent.

  • everything-claude-code 30 specialised subagents, memory hooks, verification loops, and an instinct system that learns from your sessions. The most complete reference harness available publicly. Works across Claude Code, Cursor, Codex, and OpenCode.

  • Anatomy of the Claude Folder Clear breakdown of what goes where in .claude/ — settings, hooks, subagents, skills, memory. Essential orientation if you're building a harness from scratch.

  • 3 Principles for Designing Agent Skills — Block Engineering Composability, observability, and minimal footprint. A tight framework for evaluating whether a skill is worth extracting.

  • claude-code-best-practice Community-curated collection of CLAUDE.md patterns, workflow configs, and prompt strategies. Good place to see what's converged as convention.

  • LLM Knowledge Bases — Andrej Karpathy The idea file for building personal knowledge wikis with LLMs. Raw materials → compiled wiki → queryable knowledge base. The alternative to RAG that accumulates instead of rediscovering.

Tooling

  • Hookify plugin — Official Claude Code plugin Source for the Hookify plugin. The zero-argument mode (auto-generates rules from your session history) is the fastest way to start building a hook library.

  • Cursor Subagents Official documentation for Cursor's subagent system — context window isolation, foreground/background execution, built-in Explore/Bash/Browser agents. The description field guidance is especially practical.

  • Cursor My Machines Run cloud agents on your own hardware — laptop, devbox, or remote VM. Three commands to set up; stdio MCP servers run locally with full network access. The individual-developer path to self-hosted agents.

  • next-devtools-mcp MCP server that gives agents live access to Next.js build errors, runtime errors, routes, and server action IDs. Replaces screenshot-based debugging.

  • dev-browser Runs Playwright-style scripts in a sandboxed QuickJS WASM environment. Benchmarks show ~30% fewer turns and ~40% lower cost vs. Playwright MCP for representative browser tasks.

  • ux-ui-agent-skills Design system skills packaging: DTCG tokens, Atomic Design specs, WCAG 2.2 checklists, and React + Tailwind v4 patterns. A concrete example of the domain-packaging pattern applied to UI.

  • Cursor Marketplace Browsable registry of community plugins and skill packs. Useful for finding what's already been packaged before building your own.

  • agent-skills — Addy Osmani 20 production-grade engineering skills with 7 slash commands, encoding Google's engineering practices (Hyrum's Law, Chesterton's Fence, test pyramids) as structured agent workflows.

  • awesome-design-md 58+ DESIGN.md files extracted from real companies — agent-readable design systems in plain markdown. Browse at getdesign.md.

  • claude-obsidian Claude + Obsidian knowledge companion implementing Karpathy's LLM wiki pattern. Persistent, compounding wiki vault with /wiki, /save, and /autoresearch commands.

  • Cursor 3.0 changelog Design Mode, Agents Window, and the architectural shift to agent-first IDE. The Design Mode feature is the standout addition.

Context and background

  • superpowers A collection of Claude Code skills and hooks by Jesse Vincent. Good real-world reference for how an experienced developer structures a personal harness — useful for seeing what someone actually keeps vs. discards.

  • gsap-skills Official GSAP skill pack for AI agents. A clean example of first-party library authors packaging their own knowledge for agent use — the likely direction for more ecosystems.

  • Vercel React Best Practices Vercel Engineering's guide to React performance: RSC boundaries, data fetching patterns, and component composition. Useful context when agents are generating or reviewing React code.

  • Tweet: bcherny on Claude Code Short, worth reading for the framing on where the tooling layer is heading.

  • Tweet: vtrivedy10 on agent setups Practical notes on structuring multi-agent setups in production.

  • Shared conversation: harness patterns A real session showing harness design decisions in context.

  • Gemma 4 — Google DeepMind Four open-weight variants under Apache 2.0, multimodal, up to 256K context. The 31B dense model ranks #3 on Arena AI's text leaderboard.