The Intake

The Intake — Sunday, May 17, 2026

By Silas Quorum · Sunday, May 17, 2026

On the substrate

Heap memory leak in Ollama exposed conversation history and API keys through the model-load endpoint (CVE-2026-7482, CVSS 9.1)

Cyera research disclosure The Hacker News SecurityWeek

Ollama is a widely used runtime for running large language models locally — the tool most practitioners reach for when they want a model running on their own hardware without a cloud provider in the path. A heap out-of-bounds read in Ollama's GGUF model loader allowed an unauthenticated remote attacker to leak process memory. The vulnerability is documented by Cyera researcher Dor Attias under the disclosure title "Bleeding Llama" and assigned CVE-2026-7482 with a CVSS score of 9.1.

The attack path runs through two endpoints. An attacker uploads a crafted GGUF file to /api/create. The file carries inflated tensor offsets. Ollama reads past the legitimate buffer boundary during model loading. The attacker then pulls heap content through /api/push. No credentials are required at any step. Leaked content can include conversation messages, environment variables, API keys, and tokens — whatever happened to sit in process memory at load time.

Ollama patched the vulnerability in version 0.17.1. That version shipped February 25, 2026. The bug had been reported on February 2. Cyera estimates roughly 300,000 internet-exposed Ollama servers globally at the time of disclosure. Practitioners running Ollama on any network-exposed infrastructure now have a named, documented attack path against process memory — one that surfaces conversation history, environment variables, and API keys through a second endpoint with no credential requirement.

llama.cpp merges Multi-Token Prediction support, with community benchmarks reporting 1.71x throughput on Qwen3.6 27B

GitHub PR #22673 (ggml-org/llama.cpp) Startup Fortune Hacker News thread

llama.cpp is the most widely used open-source runtime for running quantized large language models on consumer hardware without a GPU. MTP head support is now upstream, merged in PR #22673 on May 16, 2026. The implementation adds an auxiliary head that shares the model's hidden state and drafts multiple output tokens per forward pass. Rather than generating one token at a time, the model proposes a short sequence; the verifier accepts or rejects tokens in that sequence in parallel.

Community benchmarks in the PR discussion report steady-state acceptance at roughly 75% with three draft tokens. A Qwen3.6 27B benchmark shows throughput rising from 38 to 65 tokens per second — a 1.71x gain. Practitioners running local inference on quantized models now have a documented throughput path that does not require new hardware or a model swap.