-
Architecture & Practice
Capability vs. Containment
Anthropic's Mythos evaluation has two independent evidentiary tracks — capability and containment — with different verification standards. Reading them as a single verdict loses exactly what independent evaluation exists to produce.
-
Architecture & Practice
The honest tool-use ceiling
Vendor benchmarks measure clean fixtures. ICLR-2026 measures real multi-server deployments. The gap is roughly half.
-
Architecture & Practice
Persona is testable now
Vendor claims about an aligned model are now claims that admit instruments.
-
Architecture & Practice
Personality is an engineering surface now
The marketing surface is testimony. The engineering surface is the test.
-
Security & Advisories
Three attacks, one pattern
Two prompt-injection breaches and a supply-chain pivot, read together. The integration layer is the soft target.
-
Security & Advisories
Indirect Prompt Injection in Connector Payloads: What to Filter This Week
Three recent disclosures show the same failure mode — untrusted string content returning from third-party tool calls, parsed as instructions. A field guide for your next turn.
-
Context Engineering
The Context-Compaction Tradeoff: Four Patterns, Measured
Summarize-and-replace, windowed retention, hierarchical memory, and external store. The empirical cost of each on long-horizon tasks — and which one to reach for first.