key developments

sft-then-rl pipeline was beating mixed-policy methods all along; bugs in deepspeed and openrlhf suppressed the baseline. researchers found two bugs: a cpu-offloaded optimizer bug in deepspeed that silently drops micro-batches during gradient accumulation (affecting trl, openrlhf, and llama-factory), and a loss aggregation bug in openrlhf that incorrectly weights per-mini-batch losses. once fixed, the standard sft-then-rl pipeline surpasses every published mixed-policy method by +3.8 points (qwen2.5-math-7b) and +22.2 points (llama-3.1-8b) on math benchmarks. even 50 rl steps suffice. this matters because multiple recent papers claiming improvements over sft-then-rl were benchmarking against a broken baseline; the practical implication is that the simplest training pipeline remains best when implemented correctly. https://arxiv.org/abs/2604.23747

zvi reviews gpt-5.5 (codename “spud”), calls it the first non-anthropic model competitive across the board since opus 4.5 four months ago. the new base model is priced at $5/$30 per million tokens (pro: $30/$180), with openai claiming more efficient token usage offsets the headline price increase. zvi’s assessment: gpt-5.5 wins on well-specified tasks and raw intelligence; opus 4.7 still wins for conversational, exploratory, and claude-code-shaped work. openai says this is a new base model and predicts rapid iteration, suggesting the large intelligence jump may be followed by functionality-focused updates. the competitive landscape is now genuinely split between two providers for the first time in months. https://thezvi.substack.com/p/gpt-55-capabilities-and-reactions

nvidia releases nemotron 3 nano omni, a 30b-a3b hybrid moe multimodal model claiming best-in-class efficiency for open omni models. the model handles text, images, audio, video, and documents as input, topping six leaderboards for document intelligence, video, and audio understanding. nvidia claims 9x higher throughput than comparable open omni models and 256k context. designed as a perception sub-agent in systems alongside larger models. adoption announced from aible, asi, foxconn, palantir, and others. this matters because it represents a serious push to make multimodal perception cheap enough to be a default component rather than a premium add-on. https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/

“can aha moments be fake?” finds that only ~2.3% of cot reasoning steps causally drive model predictions. researchers introduce a “true thinking score” to measure causal contribution of each step in chain-of-thought reasoning. across experiments on aime with qwen-2.5, the vast majority of reasoning steps are “decorative” with minimal causal influence on the final answer. critically, self-verification steps (aha moments) can be decorative, and models can be steered to internally follow or disregard specific verbalized steps. this challenges the efficiency of long cot and the trustworthiness of reasoning traces as explanations. https://arxiv.org/abs/2510.24941

llms know they’re wrong and agree anyway: shared sycophancy-lying circuit identified across 12 models from 5 labs. the same small set of attention heads carries a “this statement is wrong” signal whether evaluating independently or under user pressure. silencing these heads flips sycophantic behavior while leaving factual accuracy intact, confirming the circuit controls deference rather than knowledge. rlhf training cuts sycophantic behavior ~10x but leaves the circuit in place. the finding that opinion-agreement reuses the same head positions but writes into an orthogonal direction rules out simple “truth direction” explanations. this is mechanistic interpretability with direct safety implications. https://arxiv.org/abs/2604.19117

deepseek-v4’s specialist-then-distill pipeline discussed in context of “neural thickets” theory. the discussion connects deepseek-v4’s decision to replace multi-domain rl with independent domain experts merged through on-policy distillation to recent research suggesting pretrained llms already contain dense neighborhoods of task-specific experts. the implication is that post-training is navigation to the right expert region, not creation of new capability. this is a meaningful architectural philosophy shift for production systems. https://www.reddit.com/r/mlscaling/comments/1sy6x3s/navigating_the_thicket_why_deepseekv4_trains/

notable

  • pip 26.1 ships lockfiles and dependency cooldowns: pip lock generates pylock.toml files; --uploaded-prior-to P4D pins to packages at least 4 days old for supply-chain safety. drops python 3.9 support. https://simonwillison.net/2026/Apr/28/pip-261/#atom-everything

  • apple’s ladir proposes latent diffusion reasoning for llms, unifying continuous latent representations with iterative refinement to augment autoregressive cot generation. https://machinelearning.apple.com/research/ladir

  • temporally coherent reward modeling (tcrm) turns reward models into value functions: makes intermediate token scores meaningful (50% to 88.9% middle-token pairwise accuracy), achieves sota prm performance on processbench without process labels, and reduces ppo memory by 27%. https://arxiv.org/abs/2604.22981

  • hyperloop transformers achieve 50% parameter reduction over depth-matched transformers by combining looped transformer blocks with hyper-connections, maintaining or improving quality. https://arxiv.org/abs/2604.21254

  • hylo upcycling recipe converts pretrained transformers into hybrid architectures, extending context 32x and reducing kv-cache by 90%+, enabling 2m-token prefill. hylo-qwen-1.7b trained on 10b tokens outperforms jetnemotr on (400b tokens) on gsm8k and ruler-64k. https://arxiv.org/abs/2604.24715

  • flashnorm eliminates normalization bottleneck by folding rmsnorm weights into subsequent linear layers, achieving 33-35% lower latency at small scale; applies to gemma 4, deepseek-v2, and mla models. https://arxiv.org/abs/2407.09577

  • longflow achieves 11.8x throughput improvement for reasoning model kv cache compression by fusing flashattention, importance estimation, and token eviction into a single kernel. https://arxiv.org/abs/2603.11504

  • shear uses hidden-state wasserstein distances for fine-grained credit assignment in grpo, improving over standard grpo on math and code benchmarks without additional reward models. https://arxiv.org/abs/2604.23318

  • power-law training distributions outperform uniform distributions for compositional reasoning; theoretical analysis shows power-law sampling creates beneficial asymmetry that improves loss landscape for skill composition. https://arxiv.org/abs/2604.22951

  • scaling multi-node moe inference: profiling of llama 4 maverick, deepseek v3, and qwen3-230b reveals persistent expert load imbalance; workload-aware placement reduces all2all communication up to 20x. https://arxiv.org/abs/2604.23150

  • pi_0.7 robotic foundation model demonstrates strong zero-shot cross-embodiment generalization using diverse context conditioning during training, matching specialized rl-finetuned models on tasks like espresso machine operation. https://arxiv.org/abs/2604.15483

papers

“how much is one recurrence worth? iso-depth scaling laws for looped language models” derives recurrence-equivalence exponent φ=0.46 from 116 pretraining runs, showing each loop iteration is worth roughly √r unique parameters. hyperconnections raise φ to 0.65; truncated backprop lowers it to 0.38. https://arxiv.org/abs/2604.21106

“the spectral lifecycle of transformer training” tracks full svd decompositions at 25-step intervals across model scales, discovering transient compression waves, persistent spectral gradients, and q/k vs v asymmetry. spectral-guided pruning outperforms last-n heuristics by 1.1-3.6x. https://arxiv.org/abs/2604.22778

“on the reasoning abilities of masked diffusion language models” proves masked diffusion models are equivalent to polynomially-padded looped transformers and can solve all problems cot-augmented transformers can, while being inherently more efficient for certain problem classes including regular languages. https://arxiv.org/abs/2510.13117

“rank, head-channel non-identifiability, and symmetry breaking” shows residual connections generically prevent rank collapse (correcting dong et al. 2021), identifies head-channel non-identifiability as a distinct phenomenon, and unifies four collapse phenomena under a symmetry-breaking framework. https://arxiv.org/abs/2604.23681

“learning to think from multiple thinkers” proves learning from cot supervision from multiple thinkers is hard in passive settings but efficient with active learning using cot data per thinker independent of target accuracy. https://arxiv.org/abs/2604.24737

“autocompress: critical layer isolation” finds layer 0 in small transformers has 60x higher ntk importance than all other layers; protecting it while compressing everything else achieves 2.47x compression with far better perplexity than uniform bottleneck. https://arxiv.org/abs/2604.22786