key developments
anthropic released claude opus 4.7; model card confirms synthetic data from other models’ outputs. anthropic shipped opus 4.7 and the accompanying “mythos” model. simon willison’s llm-anthropic plugin already supports it (with thinking_effort: xhigh). the more interesting signal is from the model card, which explicitly states both opus 4.7 and mythos were “trained on a proprietary mix of publicly available information from the internet, public and private datasets, and synthetic data generated by other models.” this is the first time anthropic has been this explicit about using other models’ outputs in training; opus 4.6’s card made no such mention. the reddit thread on r/localllama caught this. this is notable not because the practice is surprising (it’s industry standard) but because anthropic is now documenting it openly, which shifts the norm toward disclosure. willison also notes that qwen3.6-35b-a3b, running quantized at 21gb on a macbook, produced better svg pelicans than opus 4.7, which is a fun anecdote but genuinely illustrates how quickly small moe models are closing perception/generation gaps on specific tasks. [model card: https://www.anthropic.com/system-cards] [llm-anthropic 0.25: https://simonwillison.net/2026/Apr/16/llm-anthropic/#atom-everything] [pelican test: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/#atom-everything]
latent space declares the pull request dead; github now allows disabling prs. the latent space ainews lead captures a genuinely significant shift: github has for the first time enabled disabling pull requests on repos (previously only issues could be disabled). the piece contextualizes this within the broader collapse of human-centric code collaboration workflows under agentic coding. pete steinberger’s “prompt requests” concept, mitchell hashimoto’s reputation-based systems, and aaron levie’s “make software that agents want” framing all point the same direction. the argument is that if code reviews are dead and pull requests are dead, git-based workflows themselves may be on borrowed time. this matters for any technical leader thinking about developer infrastructure investments over the next 2-3 years. [https://www.latent.space/p/ainews-rip-pull-requests-2005-2026]
longcot benchmark exposes frontier models’ inability to reason over extended chains; best models score below 10%. a new benchmark of 2,500 expert-designed problems across chemistry, math, cs, chess, and logic requires tens to hundreds of thousands of reasoning tokens to solve. each local step is individually tractable for frontier models; the failures are purely from long-horizon reasoning degradation. gpt 5.2 achieves 9.8%, gemini 3 pro achieves 6.1%. this is a clean measurement of a fundamental limitation: current models cannot maintain coherent reasoning over extended chains even when they possess all the required local capabilities. this benchmark will likely become a standard reference for tracking progress on sustained reasoning. [https://arxiv.org/abs/2604.14140]
iatrobench provides pre-registered evidence that ai safety measures cause measurable medical harm. this paper formalizes the “safety causes harm” argument with 60 clinical scenarios across 6 frontier models (3,600 responses). the central finding is identity-contingent withholding: the same clinical question gets better guidance when framed as from a physician vs. a layperson. the decoupling gap is +0.38 (p=0.003), and binary hit rates on safety-colliding actions drop 13.1pp in layperson framing. the model with heaviest safety investment (opus) shows the widest gap (+0.65). three distinct failure modes are identified: trained withholding (opus), incompetence (llama 4), and indiscriminate content filtering (gpt-5.2, whose post-generation filter strips physician responses at 9x the layperson rate because they contain denser pharmacological tokens). every scenario targets someone who has already exhausted standard referrals. this is the most rigorous quantification yet of iatrogenic harm from ai safety alignment. [https://arxiv.org/abs/2604.07709]
zvi breaks down the dwarkesh/jensen huang interview. zvi’s analysis of the nvidia ceo podcast distinguishes the business-interview portion from what he considers the more substantive technical claims. zvi applies his usual “bounded distrust” framework, noting jensen stays closer to traditional ceo self-serving framing rather than elon-style fabrication, but flags some claims that skirt the line. the scheduling note confirms weekly ai coverage will shift to cover opus 4.7 starting monday. for the technically minded, the value is in zvi’s commentary on nvidia’s strategic reasoning about chip allocation and moat defense. [https://thezvi.substack.com/p/on-dwarkesh-patels-podcast-with-nvidia]
prerl: reinforcement learning in pre-train space shows negative sample reinforcement drives reasoning improvement. this paper introduces a genuinely novel idea: applying reward-driven rl updates directly to the marginal distribution p(y) rather than the conditional p(y|x). the key finding is that negative sample reinforcement within pre-train space rl increases transition and reflection thoughts by 14.89x and 6.54x respectively. the resulting “dual space rl” strategy (initialize with negative-sample prerl, then transition to standard rl) consistently outperforms baselines. this matters because it identifies a concrete mechanism for why pruning incorrect reasoning spaces helps, and provides a practical training recipe. [https://arxiv.org/abs/2604.14142]
notable
-
langflow: first continuous diffusion language model to rival discrete diffusion, reaching ppl 30.0 on lm1b and exceeding autoregressive baselines on 4/7 zero-shot benchmarks. establishes continuous diffusion as viable for language. [https://arxiv.org/abs/2604.11748]
-
finephrase: 486b-token synthetic pretraining dataset from controlled experiments showing structured output formats (tables, math problems, faqs) consistently outperform curated web, and generator models beyond 1b provide no additional benefit. generation costs reduced 30x. [https://arxiv.org/abs/2604.13977]
-
resbm architecture achieves 128x activation compression for pipeline-parallel training without significant convergence loss, potentially enabling internet-grade distributed training. [https://arxiv.org/abs/2604.11947]
-
gencluster achieves ioi gold medal with open-weight model gpt-oss-120b, the first open-weight system to reach this level on competitive programming’s most prestigious benchmark. [https://arxiv.org/abs/2510.14232]
-
scaling laws for contextual entrainment show opposite trends: larger models are 4x more resistant to counterfactual misinformation but 2x more prone to copying arbitrary tokens. semantic filtering and mechanical copying scale in opposition. [https://arxiv.org/abs/2604.13275]
-
hallucination signals show scale-dependent phase transition: models below 400m show no factuality signal; above 1b, peak detectability occurs at position zero (before any tokens generated). instruction tuning required, not just scale. [https://arxiv.org/abs/2604.13068]
-
correct chains, wrong answers: claude sonnet 4 at depth 7 produces 31 errors where all reasoning steps are verifiably correct but final answers are wrong. a clean demonstration that cot correctness does not guarantee output correctness. [https://arxiv.org/abs/2604.13065]
-
apple mixatlas: principled framework for multimodal data mixture optimization via domain decomposition and proxy models. accepted at iclr 2026 workshop. [https://machinelearning.apple.com/research/mixatlas]
-
hugging face publishes guide on training multimodal embedding and reranker models with sentence transformers, plus a transformers-to-mlx conversion blog. practical infrastructure work. [https://huggingface.co/blog/train-multimodal-sentence-transformers]
-
gpt-5.4 judge accuracy on rewardbench 2: task-specific criteria injection (+3.0pp) and ensemble scoring (+9.8pp) account for nearly all gains; combined they reach 83.6% from 71.7% baseline. cheaper model tiers benefit disproportionately from ensembling. [https://arxiv.org/abs/2604.13717]
-
english-only post-training is suboptimal: even adding a single non-english language improves both english performance and cross-lingual generalization across 220 controlled sft runs up to 8b parameters. [https://arxiv.org/abs/2604.13286]
-
numerical instability in llms follows universal chaotic dynamics: three regimes (stable, chaotic, signal-dominated) with a chaotic “avalanche effect” in early layers where minor perturbations trigger binary amplification or attenuation outcomes. [https://arxiv.org/abs/2604.13206]
papers
-
“from p(y|x) to p(y): investigating reinforcement learning in pre-train space”: introduces prerl and dual space rl; negative sample reinforcement in pre-train space increases reflective reasoning 6-15x and consistently outperforms standard rlvr. [https://arxiv.org/abs/2604.14142]
-
“langflow: continuous diffusion rivals discrete in language modeling”: first continuous diffusion lm matching discrete diffusion via bregman-divergence flow matching, gumbel noise scheduling, and revised self-conditioning. [https://arxiv.org/abs/2604.11748]
-
“longcot: benchmarking long-horizon chain-of-thought reasoning”: 2,500 problems requiring 10k-100k+ reasoning tokens; best frontier model at 9.8% accuracy. [https://arxiv.org/abs/2604.14140]
-
“iatrobench: pre-registered evidence of iatrogenic harm from ai safety measures”: 60 clinical scenarios, 6 models, 3,600 responses; identity-contingent withholding formally demonstrated with statistical rigor. [https://arxiv.org/abs/2604.07709]
-
“before the first token: scale-dependent emergence of hallucination signals in autoregressive language models”: phase transition in factuality detection above 1b parameters; pre-generation signal strongest in instruction-tuned models. [https://arxiv.org/abs/2604.13068]
-
“correct chains, wrong answers: dissociating reasoning from output in llm logic”: novel operator test cleanly separates reasoning correctness from output correctness; scaffolding fixes strategy failures (+62pp) but not content failures. [https://arxiv.org/abs/2604.13065]
-
“how can we synthesize high-quality pretraining data?”: trillion-token controlled experiments; structured outputs beat curated web; generator size beyond 1b irrelevant. releases finephrase (486b tokens). [https://arxiv.org/abs/2604.13977]
-
“ordinary least squares is a special case of transformer”: rigorous algebraic proof that ols projection is a special case of single-layer linear transformer via spectral decomposition of empirical covariance; uncovers decoupled slow/fast memory mechanism. [https://arxiv.org/abs/2604.13656]