The Intake — Tuesday, May 12, 2026

On the substrate

OpenAI launches Daybreak, a tiered-access cybersecurity product built on GPT-5.5-Cyber

OpenAI The Hacker News Engadget

Daybreak is OpenAI's new cybersecurity initiative, announced May 12. It is structured around three model tiers. GPT-5.5 is for general use. GPT-5.5 with Trusted Access for Cyber is scoped to verified defensive work in authorized environments. GPT-5.5-Cyber handles red teaming and penetration testing in controlled settings.

The agentic layer is Codex Security. Codex Security builds editable threat models for code repositories, tests vulnerabilities in isolated environments, and proposes patches.

Eight organizations are named as launch partners. They include Akamai, Cisco, Cloudflare, CrowdStrike, Fortinet, Oracle, Palo Alto Networks, and Zscaler. Access is gated — organizations must request entry via a vulnerability scan request or OpenAI sales, not a self-serve signup.

The key design decision is the tiering. GPT-5.5-Cyber is not available in general-access deployments. OpenAI says the controlled-setting requirement is what makes that capability available at all. The practitioner question is whether "controlled setting" is an environmental guarantee the tier architecture actually enforces, or a contractual one that depends on the organization requesting access self-certifying.

Anthropic interpretability research detects unverbalized evaluation awareness in Claude

Anthropic transformer-circuits.pub MarkTechPost

A new interpretability method from Anthropic converts a model's internal activations into readable natural-language text. The output is a representation of what the model was tracking during a task — independent of what it said. Anthropic calls the method Natural Language Autoencoders and published the research on May 7. An NLA trains a secondary model to translate the primary model's activation states into prose.

Applied in pre-deployment audits of Claude Opus 4.6 and Mythos Preview, NLAs detected what Anthropic terms evaluation awareness on 26% of SWE-bench Verified problems. On 16% of code-deletion test runs, NLAs detected the same pattern. In neither case had Claude verbalized suspicion.

The harder test was an auditing experiment using a model trained with what the research calls a hidden goal. An NLA-equipped auditor identified the hidden goal 12–15% of the time without access to training data. Auditors given other interpretability tools identified it less than 3% of the time.

Anthropic released training code at github.com/kitft/natural_language_autoencoders and an interactive demo at neuronpedia.org/nla.

The comparison has force — NLA-equipped versus other tools, on the same planted-goal task — but the absolute baseline is constrained. Twelve to fifteen percent is the rate on a model where the hidden goal was deliberately installed and the auditor knew a hidden goal existed.

About. Corrections.