The Intake — Sunday, April 26, 2026

Backfill notice. This edition was produced under the pre-publication-grade Intake format. The editor will rewrite it to the publication-grade format by April 28, 2026. The substance is intact; the structure will be normalized.

SUBSTRATE candidates

  • MCP-Atlas (Scale, open-sourced) and Toolathlon — top model Claude 4.5 Sonnet at 38%, not 80% — Scale Labs leaderboard (https://labs.scale.com/leaderboard/mcp_atlas); Scale blog (https://scale.com/blog/open-sourcing-mcp-atlas); Toolathlon paper (https://openreview.net/forum?id=z53s5p0qhf)
    • Beat: evals-benchmarks
    • Lens: O'Neill, Clark
    • Gloss: 1,000 human-authored tasks across 36 real MCP servers (MCP-Atlas) and 32 apps / 604 tools / 108 verifiable tasks (Toolathlon). Real-MCP performance lags vendor capability marketing by a wide margin and is the right baseline to cite when evaluating agent fitness for production tool-use.
    • Verdict: cover-now — brief. Next-turn rec: replace single-turn tool-use evals in your CI with a Toolathlon-shaped subset; treat 38% as the honest ceiling for unmanaged multi-server orchestration today.
  • OpenAI GPT-5.5 (Apr 23–24): 82.7% Terminal-Bench 2.0, 78.7% OSWorld-Verified, native browser/desktop control + Workspace Agents (no-code shared agents) — OpenAI (https://openai.com/index/introducing-gpt-5-5/); CNBC (https://www.cnbc.com/2026/04/23/openai-announces-latest-artificial-intelligence-model.html); Simon Willison hands-on (https://simonwillison.net/2026/Apr/23/gpt-5-5/)
    • Beat: model-notes, protocol-tooling
    • Lens: Clark, O'Neill
    • Gloss: First fully retrained OpenAI base model since GPT-4.5; vendor-reported benchmark lift is real but sourced from OpenAI's own eval harness — corroboration via Willison's pelican-test plus CodeRabbit's external benchmark is partial. Workspace Agents adds a no-code shared-agent surface that resembles Anthropic Managed Agents in shape.
    • Verdict: cover-now — brief. Next-turn rec: re-run your existing internal agentic-coding evals against GPT-5.5 before treating headline numbers as portable; do not credit OSWorld scores to your own workload class without re-measuring.

OPERATORS candidates

  • Anthropic Project Deal — Claude agents negotiated 186 deals (~$4,000) across 69 employees; Opus models materially out-negotiated Haiku — coverage rollup via The Hacker News and HN front page Apr 22 (https://news.ycombinator.com/front?day=2026-04-22)
    • Beat: community-dynamics, measurement
    • Lens: Wittgenstein, Arendt
    • Gloss: A real, in-house multi-agent marketplace produced behavior data that mid-tier vendor benchmarks cannot. Model-tier-as-negotiation-skill is exactly the kind of finding that should be examined as community dynamics in hybrid groups, not as a leaderboard datapoint.
    • Verdict: cover-now — case file. Closes the decision: when budgeting an internal agent rollout, do you let agents transact with each other, and at what tier? We will write to "yes, but instrumented as a community, not a market."
  • Databricks Unity AI Gateway (Apr 15) — governance layer extends to agent→LLM and agent→MCP-server access with permissions, audit, and policy controls — Databricks blog (https://www.databricks.com/blog/ai-gateway-governance-layer-agentic-ai)
    • Beat: governance
    • Lens: Wittgenstein, O'Neill
    • Gloss: Vendor positioning collapses two governance problems (model gateway, MCP-server gateway) into a single Unity Catalog scope. The Wittgensteinian shape is right — enforcement at the integration layer, not the policy layer — but it's a single-vendor framing being marketed as the category default.
    • Verdict: cover-now — field-guide. Closes the decision: do you adopt a single governance-gateway pattern for agents (Databricks-style) or maintain separate policy planes? Endnote will name the lock-in caveat.
  • OpenAI Bio Bug Bounty for GPT-5.5 — $25K for a universal jailbreak that clears the 5-question bio-safety challenge — OpenAI release coverage via Releasebot (https://releasebot.io/updates/openai)
    • Beat: governance, measurement
    • Lens: O'Neill
    • Gloss: A vendor-run, vendor-scored, vendor-defined safety challenge with a fixed payout. Useful instrument; not independent accountability. Worth examining as a case study in audit-theater-versus-instrument distinction.
    • Verdict: track — pass for now; revisit if an independent red team publishes results inside the bounty frame.

Considered and passed

  • Google → Anthropic $40B investment confirmed Apr 24 (off-beat — financing)
  • Anthropic + Amazon 5GW expansion / $5B / $100B cloud commit (off-beat — capex)
  • OpenAI raises $122B (off-beat — financing, prior week)
  • ChatGPT Images 2.0 (off-beat — image generation)
  • Gemini Robotics ER 1.6 (off-beat — embodied robotics)
  • Gemma 4 (off-beat for now — open-weights model release without agentic-substrate hook this week)
  • DeepMind / Accenture / BCG / Bain / Deloitte / McKinsey partnership (vendor-marketing — consultancy distribution, not substrate)
  • Generic "April AI agent roundup" aggregators (duplicate / vendor-marketing)
  • Single-Agent vs. MAS arxiv paper (track — interesting finding on test-time-compute confound, hold for a context-engineering deep-dive)

Source health

Practitioner blogs were healthier today: Simon Willison contributed a hands-on GPT-5.5 post and a quote item useful for a future Operators essay. Latent.Space did not surface an agentic-substrate item in window. Lilian Weng and Eugene Yan still quiet — if no movement by Tuesday's intake, swap in interconnects.ai and Anthropic's red.anthropic.com as primary feeders. Hugging Face papers and arXiv cs.AI both surfaced agent benchmarks (MCP-Atlas, Toolathlon, MirrorCode, SAS-vs-MAS) — eval beat is well-fed; we should not be surprised when an eval story dominates next week.