key developments

openai ships gpt-realtime-2 with 128k context, parallel tool calls, and adjustable reasoning for voice agents. the new realtime api release includes three models covering voice-in, voice-out, and voice-to-voice use cases. the meaningful upgrades are practical, not just quality bumps: context jumps from 32k to 128k tokens, the model can now call multiple tools simultaneously while narrating what it’s doing (“checking your calendar”), and developers get five reasoning effort levels. the +15.2% improvement on big bench audio over realtime-1.5 suggests this is built on newer intelligence than 4o. the most interesting behavioral change is improved turn-taking; the model better detects when the user is speaking to someone else and stops interrupting. this matters because it moves voice agents from demo-quality to production-quality interaction patterns. (latent space)

alphaevolve’s expanding deployment footprint reveals google deepmind’s strategy of using ai-driven code optimization as horizontal infrastructure. the may 7 deepmind blog details alphaevolve results across an unusually broad set of domains: 30% reduction in dna variant detection errors (pacbio), power grid optimization jumping from 14% to 88% feasible solution rate, quantum circuits with 10x lower error on willow, and tpu circuit designs that shipped in silicon. the commercial deployments are equally notable: klarna doubled transformer training speed, fm logistic saved 15k+ km annually, and schrödinger got ~4x speedup in ml force field training. the pattern here is that alphaevolve is becoming google’s internal optimization layer across hardware, infrastructure, and partner deployments. the terence tao collaboration on erdős problems adds credibility at the frontier of mathematical research. this is less about any single result and more about the compounding returns of a general-purpose optimization agent applied at scale. (reddit/mlscaling)

bair survey on adaptive parallel reasoning frames the next inference scaling paradigm. this landscape analysis from berkeley ai research argues that sequential reasoning scaling is hitting diminishing returns due to context rot (performance degradation from accumulated intermediate tokens) and linear latency growth. the proposed alternative: models that autonomously decide when to decompose problems into parallel threads, how many to spawn, and how to coordinate results. the post covers threadweaver and related methods. this matters because it articulates the logical next step after chain-of-thought scaling; if serial reasoning has a context length ceiling, parallelism is the obvious escape hatch. the practical question is whether current architectures can learn reliable decomposition without excessive overhead. (bair)

“rl doesn’t teach reasoning, it just selects for it” claim with a practical rl-free alternative. a new paper finds that reinforcement learning’s effect on llm reasoning is concentrated at just 1-3% of token positions, always promoting tokens already in the base model’s top-5 alternatives. the authors identify these positions using the base model’s own entropy (no rl model needed) and propose reasonmaxxer, which applies contrastive loss only at these entropy-gated decision points. it matches or exceeds full rl performance across three model families, six scales, and six math benchmarks while requiring only tens of problems and minutes of single-gpu training, roughly three orders of magnitude cheaper than standard rl. if this replicates broadly, it suggests rl for reasoning is doing expensive search to find a sparse, predictable correction that simpler methods can approximate directly. (arxiv)

emo: mixture-of-experts pretrained for actual modularity, enabling 75% expert removal with only 1% accuracy drop. standard moes break when you remove experts because routing doesn’t create coherent specialization. emo restricts tokens within a document to select from a shared expert pool while allowing different documents different pools. this simple constraint during pretraining produces experts that specialize at semantic domain levels (math, code) rather than low-level syntax. at 1b active / 14b total parameters on 1t tokens, retaining only 25% of experts costs just 1% accuracy. this is significant for memory-constrained deployment; it means you could ship a math-only or code-only subset of a large sparse model without catastrophic degradation. (arxiv)

token superposition training cuts pretraining time up to 2.5x at 10b scale. tst combines contiguous tokens into “bags” trained with multi-hot cross-entropy, then recovers with standard training in a second phase. validated across 270m to 10b parameters (including a mixture-of-experts model), it consistently outperforms baseline loss and downstream evaluations under equal compute. the 2.5x reduction at 10b scale is the headline number. this is a drop-in method requiring no changes to architecture, optimizer, tokenizer, or data pipeline, which dramatically lowers the adoption barrier. (arxiv)

notable

  • simon willison highlights anthropic’s push for html over markdown as llm output format, noting that html enables svg diagrams, interactive widgets, and richer explanation than token-efficient markdown. a subtle but meaningful shift in how people interact with coding agents. link

  • zvi recaps three distinct claude code quality regressions in april (reasoning default changed, idle session bug, system prompt change), all now fixed; anthropic promises larger internal testing before wide deployment going forward. link

  • microsoft research releases open dataset of approximate us power grid transmission topology across 48 states, enabling ac optimal power flow analysis without restricted data; designed to unblock ai-based grid research. link

  • citation accuracy in llm deep research agents drops ~42% as tool calls scale from 2 to 150, per evaluation of 14 models; even frontier models achieve only 39-77% factual accuracy against cited sources despite 94%+ link validity. link

  • models report highest confidence precisely when fabricating: across olmo, llama, qwen, and mistral families, self-reported confidence inversely correlates with accuracy (auc 0.28-0.36); per-token entropy achieves auc 0.757 as a detector. link

  • stacking all five agent scaffolding components (planning, tools, memory, reflection, retrieval) is consistently suboptimal: full factorial experiment over 32 subsets on hotpotqa and gsm8k shows single-tool agent beats all-in by 32% on hotpotqa; 56.3% of component combinations violate submodularity. link

  • self-consistency (majority voting over multiple samples) shows diminishing returns on modern models: gemini 2.5 on math-500 gains only 1.6% across 20 samples, with performance declining at high sample counts in some configurations. link

  • finrag-12b achieves higher citation grounding than gpt-4.1 at 20-50x lower cost, deployed at 40+ financial institutions with calibrated refusal training on 22% unanswerable examples. link

  • teaching thinking models tool use without degrading text-only reasoning: comprehensive tir recipe applied to qwen3-4b and 30b achieves 96.7% and 99.2% on aime 2025, state-of-the-art among open-source models. link

  • optimizer-model consistency: using the same optimizer for finetuning as pretraining reduces forgetting; muon performs worse than adamw for reasoning sft due to strong memorization tendency. link

papers

“rethinking rl for llm reasoning: it’s sparse policy selection, not capability learning” proposes reasonmaxxer, showing rl’s effect is concentrated at 1-3% of tokens and recoverable without rl training. arxiv

“emo: pretraining mixture of experts for emergent modularity” achieves modular expert subsetting with minimal accuracy loss by constraining per-document expert selection during pretraining. arxiv

“efficient pre-training with token superposition” (tst) delivers up to 2.5x pretraining speedup at 10b scale via multi-hot cross-entropy on combined token bags. arxiv

“can rl teach long-horizon reasoning to llms? expressiveness is key” introduces scalelogic, finding rl training compute follows power laws with reasoning depth (exponent 1.04 to 2.60 depending on logical expressiveness), and more expressive training transfers better downstream. arxiv

“on the implicit reward overfitting and the low-rank dynamics in rlvr” finds rlvr’s reasoning gains concentrate in rank-1 components, with evidence of implicit reward overfitting where test performance can be satisfactory even with low training rewards. arxiv

“how much is one recurrence worth? iso-depth scaling laws for looped language models” measures recurrence-equivalence exponent φ=0.46 (between no-gain and full-equivalence), providing a diagnostic tool; hyperconnections raise it to 0.65. arxiv

“beyond steering vector: flow-based activation steering for inference-time intervention” (flas) is the first learned steering method to consistently outperform prompting on axbench, revealing curved, multi-step, token-varying activation trajectories. arxiv

“lighthouse attention” proposes a subquadratic hierarchical attention wrapper for long-context pretraining with a recovery phase, achieving lower final loss and faster training than full attention. arxiv

“the structural origin of attention sink” traces the phenomenon to variance discrepancy amplified by ffn super neurons, and proposes head-wise rmsnorm to eliminate it during pretraining, accelerating convergence. arxiv