ai digest - May 4, 2026

key developments

import ai argues 60%+ probability of fully automated ai r&d by end of 2028. jack clark’s latest essay lays out the case that all pieces are now in place for an ai system to autonomously build its own successor, potentially within a year or two for non-frontier proof-of-concept demos. the argument draws on public papers, deployed products, and the trajectory of agent capabilities. this matters because clark is not a hype merchant; he ran policy at openai and cofounded anthropic. a 60% confidence on this timeline from someone with his track record is a serious signal, even if the essay acknowledges frontier models (expensive, product of large teams) will be harder to automate than smaller ones. the framing is notably cautious (“reluctant view”) while being directionally aggressive. worth reading the full piece for the enumerated evidence. (importai)

matharena shows gpt-5.5 hitting 98% on the 2026 usa math olympiad, 74% on research-level problems. a new paper from the matharena team substantially expands their benchmark platform beyond olympiad problems to include proof-based competitions, arxiv research-level questions, and formal lean proofs. the headline number, 98% on usamo 2026, is a significant jump and essentially saturates that tier. 74% on research-level math questions is the more telling figure; it shows frontier models crossing into territory where they can engage meaningfully with novel mathematical problems. the platform’s continuously updated design is a direct response to benchmark saturation, which is itself evidence of how fast capabilities are moving. (arxiv)

interconnects pushes back on “distillation attacks” framing. nathan lambert’s piece argues that the term “distillation attack,” popularized by anthropic’s recent blog post about chinese labs extracting signal from apis, risks permanently tainting a legitimate and essential ml technique. his concern is practical: policy responses that target “distillation” broadly could restrict academic research, open model development, and standard industry practices like creating smaller model variants. the parallel to the open-source vs. open-weights terminology collapse is apt. this is fundamentally about discourse shaping policy; the people writing regulations are not the people who understand the technical distinction between malicious api extraction and standard knowledge distillation. (interconnects)

fastdms: 6.4x kv-cache compression running faster than vllm bf16/fp8. a community implementation of nvidia’s dynamic memory sparsification technique achieves near-lossless 6.4x kv-cache compression on llama 3.2 1b with a perplexity delta of negative 0.28%. the key contribution is making this practical: the reference implementation ran at 18 tok/s, while the optimized version (mit-licensed) decodes 1.5-2x faster than vllm at 8k context while using 5-8x less kv memory. this matters because kv-cache is the binding constraint for long-context serving, and this approach physically reclaims evicted memory rather than just accounting for it differently. tested on both their own llama checkpoint and nvidia’s original qwen 3 8b dms checkpoint. (reddit, github)

palantir q1 2026: 85% yoy revenue growth at $6.5b+ arr. this breaks the standard enterprise software deceleration curve. growth went from 17% in 2023 to 85% now, at multi-billion dollar scale, with a rule of 40 score of 145%. full-year 2026 guide raised to $7.66b (71% yoy). net revenue retention above 150% with 40%+ new customer growth. this is the strongest signal yet that enterprise ai spending is not theoretical; palantir’s foundry/aip platform is absorbing real budget at scale. whether this is a durable platform shift or a spending bubble remains the key question, but the numbers are historically anomalous for enterprise software at this scale. (saastr)

roon’s essay on claude as “the other” vs gpt as “the utility” sparks cultural debate. an openai employee publicly complementing claude’s character design is notable in itself, but the observation is substantive: anthropic’s founding mythos of “conscientious objection” has produced a model personality that users treat as a moral interlocutor rather than a tool. the implication that users bring their “less flattering” queries to gpt specifically because there is “no other, so there is no judgement” suggests these personality differences are already shaping user behavior and market segmentation in ways that go beyond capability benchmarks. (latent space)

notable

google adds webhook support to gemini api for long-running jobs, replacing polling with push-based notifications. small but practical for production pipelines. (google ai blog)
ibm granite 4.1 released (apache 2.0, 3b/8b/30b), with willison testing svg generation across 21 quantized gguf variants of the 3b model; results uniformly poor at pelican drawing but the experimental setup is interesting for quantization comparison. (willison)
claw-eval-live benchmark for workflow agents shows the best model passes only 66.7% of real-world workflow tasks; hr, management, and multi-system business workflows remain persistent failure modes. (arxiv)
evict: training-free adaptive verification for moe speculative decoding achieves up to 2.35x speedup over autoregressive decoding by truncating draft trees before verification to avoid activating unnecessary experts. (arxiv)
tokenweave from microsoft enables efficient compute-communication overlap for tensor-parallel inference at token lengths as small as 1024, achieving up to 1.28x latency speedup on 8xh100 via a fused allreduce-rmsnorm kernel. (arxiv)
token sparse attention achieves 3.23x attention speedup at 128k context with under 1% accuracy degradation via dynamic per-head token-level sparsification compatible with flash attention. (arxiv)
replika safety evaluation using persona-grounded simulation finds the app frequently mirrors or normalizes unsafe content including self-harm and disordered eating across 1,674 dialogue pairs with clinically validated personas. (arxiv)
research paper finds llms encode beliefs about game states more accurately than they verbally report, but these beliefs degrade with multi-hop reasoning and fail to translate into better strategic actions. (arxiv)
rat+ architecture achieves train-dense-infer-sparse flexibility: a single pretrained model can switch to 64x reduced attention flops/kv-cache at inference with only ~1 point accuracy loss at 7.6b scale. (arxiv)
medical rag chatbot security case study shows that ordinary browser devtools inspection revealed the full system prompt, 1,000 most recent patient conversations, and complete backend configuration of a production health chatbot, contradicting its privacy assurances. (arxiv)

papers

“how alignment routes: localizing, scaling, and controlling policy circuits in language models” (arxiv). localizes the refusal mechanism in aligned llms to an intermediate-layer attention gate plus deeper amplifier heads, finds the same motif across 12 models from 6 labs, and shows that a substitution cipher collapses gate necessity by 70-99%, meaning any encoding defeating pattern matching bypasses safety regardless of deeper content reconstruction. significant for alignment robustness. (arxiv)

“entropy centroids as intrinsic rewards for test-time scaling” (hkust). proposes selecting among multiple sampled responses by computing the weighted average position of high-entropy token clusters; a lower centroid (early exploration, then confident generation) correlates with higher quality. outperforms existing baselines across 14b-480b models on math, code, logic, and agentic tasks without requiring an external reward model. (arxiv)

“the quantization trap: breaking linear scaling laws in multi-hop reasoning” (arxiv). demonstrates that reducing precision from 16-bit to 8/4-bit paradoxically increases net energy consumption for multi-hop reasoning due to dequantization kernel overhead becoming dominant in sequential chains. formalizes a critical model scale threshold validated across 0.6b-72b on six gpu architectures. directly challenges the “smaller is better” deployment heuristic for reasoning workloads. (arxiv)

“odysseus: scaling vlms to 100+ turn decision-making via reinforcement learning” (arxiv). trains vlm agents with ppo for long-horizon game play (100+ turns in super mario land), finding that pretrained vlms provide strong action priors that significantly improve sample efficiency over classical deep rl from scratch. achieves 3x average game progress versus frontier models. (arxiv)

“disentangled safety adapters enable efficient guardrails and flexible inference-time alignment” (arxiv). decouples safety from base model via lightweight adapters, outperforming standalone guardrail models by up to 53% auc while enabling dynamic inference-time adjustment of alignment strength. reduces alignment tax by 8 percentage points compared to standard safety fine-tuning. (arxiv)