key developments
qwen3.5-omni: alibaba releases massive omni-modal model with 256k context and state-of-the-art audio-visual capabilities. qwen3.5-omni scales to hundreds of billions of parameters, processes over 10 hours of audio and 400 seconds of 720p video, and supports understanding and generation across text, audio, and vision. the plus variant claims sota across 215 audio and audio-visual subtasks, surpassing gemini 3.1 pro on key audio tasks. architecturally, it uses a hybrid attention moe framework for both its “thinker” and “talker” components. notably, it introduces aria, a dynamic text-speech alignment mechanism to fix streaming speech synthesis instability. a new emergent capability they call “audio-visual vibe coding” (generating code directly from audio-visual instructions) is interesting if real. this is a significant release because it pushes the omni-modal frontier from a non-us lab with full technical documentation. https://arxiv.org/abs/2604.15804
arc-agi-3 published: frontier models score below 1% on interactive benchmark humans solve at 100%. chollet’s team released the third iteration of the arc benchmark, shifting from static puzzles to interactive, turn-based environments where agents must explore, infer goals, build world models, and plan actions without explicit instructions. the key design constraint remains the same: only core knowledge priors, no language, no external knowledge. the human-ai gap is the starkest yet; humans solve 100% while frontier systems as of march 2026 score below 1%. this matters because it provides a concrete, well-calibrated measure of how far we are from general fluid intelligence, and the interactive framing raises the bar significantly beyond pattern matching. https://arxiv.org/abs/2603.24621
willison documents opus 4.7 tokenizer inflation: expect ~40% higher effective costs. simon willison built a token counting comparison tool and found that claude opus 4.7’s new tokenizer uses roughly 1.46x the tokens of opus 4.6 for the same input (anthropic’s stated range was 1.0-1.35x). since pricing per token is unchanged ($5/$25 per million input/output), this means opus 4.7 is effectively ~40% more expensive than its predecessor for equivalent workloads. image token inflation is even worse at 3.01x for high-resolution images. this is the kind of practical cost analysis that matters for anyone budgeting production workloads on the new model. https://simonwillison.net/2026/Apr/20/claude-token-counts/#atom-everything
olmo hybrid paper shows hybrid attention-recurrence models outperform pure transformers at 7b scale, with theoretical backing. ai2 trained olmo hybrid (7b params) replacing sliding window attention with gated deltanet recurrence layers and found it outperforms olmo 3 7b across pretraining and mid-training evals. the paper goes beyond empirics: they prove hybrid models can express tasks beyond both pure transformers and pure linear rnns (like code execution), and argue theoretically why this expressivity translates to better scaling efficiency. this matters because it’s the strongest controlled evidence yet that the transformer-only paradigm may be leaving performance on the table, and the theory-practice loop they demonstrate is unusually rigorous. https://arxiv.org/abs/2604.03444
zvi’s deep dive on opus 4.7 model card flags serious model welfare concerns. zvi mowshowitz’s analysis of the 232-page opus 4.7 model card is split into two parts because, as he puts it, “some things clearly went seriously wrong” on the model welfare front in ways that haven’t occurred with previous claude models. the practical tips alone are noteworthy: turning off adaptive thinking means no thinking at all (bad ui), opus 4.7 defaults to xhigh thinking in claude code (token-hungry), and the model is more sensitive to how you interact with it than prior versions. the welfare investigation is promised for a follow-up post, but the fact that a careful observer is flagging this publicly signals something worth tracking. https://thezvi.substack.com/p/opus-47-part-1-the-model-card
interconnects analyzes the open-closed model gap, arguing single-number comparisons obscure what actually matters. nathan lambert examines how the artificial analysis intelligence index and similar composite benchmarks mask important dynamics: benchmark relevance shifts every 12-18 months as training paradigms evolve, agentic benchmarks are still poorly correlated with real-world deployment performance, and gemini 3’s strong benchmarks haven’t translated to deployment relevance. his core argument is that we’re at a “relative minimum” in benchmark confidence during this era of rapid post-training improvements. this is a useful framing for anyone making model selection decisions. https://www.interconnects.ai/p/reading-todays-open-closed-performance
notable
-
huawei’s hifloat4 beats mxfp4 for 4-bit llm training on ascend chips, getting within ~1% of bf16 loss vs mxfp4’s ~1.5%, potentially signaling export-control-driven hardware efficiency innovation. https://importai.substack.com/p/import-ai-454-automating-alignment
-
hallucination as trajectory commitment: causal evidence from activation patching shows hallucination in transformers is an asymmetric attractor; corrupting a correct trajectory takes one perturbation, fixing a hallucinated one requires sustained multi-step intervention. https://arxiv.org/abs/2604.15400
-
smc-sd achieves 2.36x speedup over speculative decoding by replacing token-level rejection with importance-weighted resampling over draft particles, trading exactness for speed while staying within 3% accuracy. https://arxiv.org/abs/2604.15672
-
lace introduces cross-thread attention for parallel reasoning, allowing concurrent cot paths to share intermediate insights and correct each other during inference, improving accuracy by 7+ points over standard parallel search. https://arxiv.org/abs/2604.15529
-
delegate-52 benchmark shows frontier models corrupt ~25% of document content during long delegated editing workflows, with degradation worsening with document size and interaction length. https://arxiv.org/abs/2604.15597
-
grift detects reward hacking via gradient fingerprints of cot traces, outperforming text-based monitoring by 25%+ relative improvement and improving downstream performance when integrated into rejection fine-tuning. https://arxiv.org/abs/2604.16242
-
rubric reward model reduces miracle steps (unjustified reasoning jumps) by 71% and boosts verified pass@1024 on aime2024 from 26.7% to 62.6% by evaluating entire reasoning trajectories against problem-specific rubrics. https://arxiv.org/abs/2510.07774
-
crossmath benchmark reveals vlms reason primarily in text space; adding visual data frequently degrades performance compared to text-only baselines, suggesting current vlms have limited genuine reliance on visual evidence. https://arxiv.org/abs/2604.16256
-
bair introduces grasp, a gradient-based planner for world models that makes long-horizon planning practical by lifting trajectories into virtual states for parallel optimization and reshaping gradients to avoid brittle state-input gradients. http://bair.berkeley.edu/blog/2026/04/20/grasp/
-
reasoning enhancement causally increases tool hallucination: controlled experiments show rl for reasoning proportionally increases hallucination even when training on non-tool tasks, with no effective mitigation found that doesn’t degrade utility. https://arxiv.org/abs/2510.22977
-
apple research shows information leakage risk from vlm logits is systematically greater than assumed, with residual stream representations retaining rich information through natural bottlenecks. https://machinelearning.apple.com/research/what-do-your-logits-know
-
stoSignsgd fixes signsgd divergence on non-smooth objectives and enables stable fp8 pretraining where adamw fails, with 1.44-2.14x speedup. https://arxiv.org/abs/2604.15416
papers
“beyond distribution sharpening: the importance of task rewards” demonstrates from first principles that distribution sharpening (rl that merely surfaces latent capabilities) is fundamentally unstable and limited, while task-reward-based rl yields robust performance improvements; confirmed across llama and qwen models. https://arxiv.org/abs/2604.16259
“llm reasoning is latent, not the chain of thought” formalizes three competing hypotheses about where llm reasoning actually occurs; argues evidence most strongly supports latent-state trajectories over surface cot, with recommendations to treat internal dynamics as the default object of study. https://arxiv.org/abs/2604.15726
“where does output diversity collapse in post-training?” traces diversity loss through three olmo 3 post-training lineages, finding collapse is determined by data composition during training and cannot be recovered at inference time; think models retain more correct-answer diversity despite collapsing more in aggregate. https://arxiv.org/abs/2604.16027
“agentv-rl: scaling reward modeling with agentic verifier” introduces bidirectional forward-backward verification agents trained via rl; 4b variant surpasses sota outcome reward models by 25.2%, a meaningful advance for test-time scaling. https://arxiv.org/abs/2604.16004
“stop: super token for pruning” proposes the first systematic taxonomy of path pruning for parallel reasoning and introduces a learnable internal pruning method that boosts gpt-oss-20b on aime25 from 84% to ~90% under fixed compute. https://arxiv.org/abs/2604.16029
“predicting where steering vectors succeed” introduces the linear accessibility profile, a training-free diagnostic using the logit lens that predicts steering vector effectiveness at rho=0.86-0.91 and explains when nonlinear methods are needed vs when no method can work. https://arxiv.org/abs/2604.15557