key developments

nvidia releases nemotron 3 super: 120b moe hybrid mamba-transformer, open-sourced with 1m context nvidia described nemotron 3 super, a 120b parameter (12b active) mixture-of-experts model that combines mamba attention with a new “latentmoe” architecture optimized for accuracy per flop and accuracy per parameter. the model is pretrained in nvfp4 on 25 trillion tokens, includes multi-token prediction layers for native speculative decoding, and supports up to 1m context length. nvidia claims up to 2.2x higher inference throughput vs gpt-oss-120b and 7.5x vs qwen3.5-122b at comparable accuracy. all checkpoints, datasets, and quantized variants are open-sourced on huggingface. this matters because it represents nvidia’s first serious play at the model layer with a genuinely competitive open architecture, and the mamba-attention hybrid with moe is a distinct design point that no other lab has shipped at this scale. the throughput claims, if they hold, make this immediately relevant for cost-sensitive inference deployments. https://arxiv.org/abs/2604.12374

google releases gemini 3.1 flash tts, a prompt-directed text-to-speech model google launched gemini 3.1 flash tts, a new text-to-speech model accessible through the standard gemini api (model id: gemini-3.1-flash-tts-preview). the model takes remarkably detailed scene-setting prompts rather than simple voice descriptions; google’s example prompt includes character backstory, accent direction, pacing notes, and environmental context to shape the output. simon willison tested it extensively and found it produces high quality audio that responds meaningfully to the prompt direction. this is notable because it represents tts moving from “pick a voice” to “describe a performance,” which is a qualitative shift in how speech synthesis is controlled. the prompt-as-directing-notes paradigm is new for a production api. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/ https://simonwillison.net/2026/Apr/15/gemini-31-flash-tts/#atom-everything

nathan lambert’s assessment: open models are keeping pace with closed labs, surprisingly interconnects published a detailed analysis of the open vs closed model capability gap as of mid-2026. the core claim: it is surprising that top closed models have not shown a growing capability margin over open models, despite large compute advantages, especially in the second half of 2025 through today. lambert argues open model labs are technically very strong at matching established benchmarks, that distillation enables fast-following, and that the supply of open models is fully dictated by economics rather than demand. this is a meaningful signal because lambert has deep visibility into the training landscape and is not a reflexive open-source advocate. the framing that “which business strategies support releasing open models” is still unresolved suggests the gap question is now more about economics than capability. https://www.interconnects.ai/p/my-bets-on-open-models-mid-2026

latent space editorial: the “turkey problem” for knowledge workers as ai saturates benchmarks latent space’s editorial frames a provocative argument: with swe-bench saturated (mythos at 78% on swe-bench pro), gdpval rating gpt 5.4 as equal to or better than human experts 83% of the time, and everyone working harder than ever despite agents doing more, knowledge workers may be in a “turkey problem” scenario. the observation that ai is not causing anyone to do less work, combined with agents handling increasing workloads, suggests we are in a phase where ai augmentation increases total output without reducing human effort, which may precede a sharper transition. this is worth reading for the framing rather than any single data point. https://www.latent.space/p/ainews-humanitys-last-gasp

dynamic expert caching in llama.cpp yields 27% faster moe inference on consumer hardware a community contributor implemented dynamic expert caching for mixture-of-experts models in llama.cpp, achieving 22.67 tok/s vs 17.87 tok/s for layer-based partial offload (26.8% improvement) when running qwen3.5-122b-a10b on an rtx 4090 with 96gb system ram. the approach tracks which experts are most frequently routed to and preloads them into vram, betting the speedup outweighs the transfer latency. this is practically significant because it makes 122b moe models meaningfully more usable on single-gpu consumer setups, and the technique is generalizable to other moe architectures including the newly released nemotron 3 super. https://github.com/ParmesanParty/llama.cpp https://www.reddit.com/r/LocalLLaMA/comments/1slue0z/

zvi covers claude code auto mode and the practical agentic coding landscape zvi’s latest deep dive covers claude code’s auto mode (the “skip some permissions safely” middle ground between clicking yes to everything and dangerously-skip-permissions), plus claude code desktop’s redesign for parallel agent management, full computer use for pro/max plans, and auto-fix in the cloud for ci failures. the practical framing is useful: auto mode addresses the real workflow problem that most users were just clicking “yes” without reading, which was no safer than skipping permissions entirely. the classifier distinguishing safe from dangerous operations is identified as the hard technical problem. https://thezvi.substack.com/p/claude-code-codex-and-agentic-coding

notable

  • latent planning emerges with scale in llms: study on qwen-3 family (0.6b-14b) shows models develop internal representations that plan future tokens (e.g., choosing “an” because they’ve already planned to output “accountant”), with capability increasing with model size. provides mechanistic evidence for implicit planning. https://arxiv.org/abs/2604.12493

  • multi-token prediction facilitates planning via reverse reasoning: theoretical and empirical work showing mtp induces a two-stage backward reasoning process in transformers, with a gradient decoupling property providing cleaner training signal than next-token prediction. https://arxiv.org/abs/2604.11912

  • lightning opd achieves 69.9% on aime 2024 with 30 gpu hours: proposes offline on-policy distillation that eliminates the need for a live teacher server, achieving 4x speedup over standard opd by enforcing “teacher consistency” (same teacher for sft and distillation). https://arxiv.org/abs/2604.13010

  • capo improves calibration of reasoning llms by 15% without sacrificing accuracy: addresses the problem that grpo-trained models become overconfident; proposes calibration-aware policy optimization using a logistic auc surrogate loss. practically relevant for knowing when to trust model outputs. https://arxiv.org/abs/2604.12632

  • codestruct: ast-based code editing improves swe-bench by 1.2-5% while cutting tokens 12-38%: reframes codebase as structured action space with syntax-validated transformations. gpt-5-nano improves 20.8% as empty-patch failures drop from 46.6% to 7.2%. https://arxiv.org/abs/2604.05407

  • instruction-tuned llm helpfulness collapses under trivial lexical constraints: banning a single punctuation mark or common word causes 14-48% comprehensiveness loss. base models show no such collapse, confirming instruction tuning creates the fragility by coupling competence to narrow templates. https://arxiv.org/abs/2604.13006

  • anypoc discovers 122 real bugs (105 confirmed) across firefox, chromium, llvm, openssl, sqlite: multi-agent framework for generating proof-of-concept tests from bug reports; produces 1.3x more valid pocs than claude code and codex while rejecting 9.8x more false positives. https://arxiv.org/abs/2604.11950

  • lasa reduces cross-lingual attack success rate from 24.7% to 2.8% by anchoring safety at the semantic bottleneck: identifies an intermediate layer where representations are governed by semantics rather than language identity, and aligns safety there. https://arxiv.org/abs/2604.12710

  • notion’s agent-building journey: deep dive into how notion rebuilt custom agents 4-5 times before shipping, covering the evolution from early failed tool-calling experiments in 2022 through progressive tool disclosure and the “software factory” vision. https://www.latent.space/p/notion

  • datasette 1.0a27 drops django-style csrf tokens in favor of modern browser headers (per filippo valsorda’s approach), plus new table rename events and internal database improvements. https://simonwillison.net/2026/Apr/15/datasette/#atom-everything

  • local-splitter measurement study: systematic evaluation of 7 tactics for reducing cloud llm token usage via local model triage; local routing + prompt compression achieves 45-79% cloud token savings on coding workloads. optimal tactic subset is workload-dependent. https://arxiv.org/abs/2604.12301

papers

“latent planning emerges with scale” (qwen-3 family study). defines latent planning as internal representations that cause specific future token generation and shape preceding context. mechanistic evidence that planning grows with model scale, with even 4b-8b models showing nascent planning mechanisms. https://arxiv.org/abs/2604.12493

“how transformers learn to plan via multi-token prediction.” proves mtp induces reverse reasoning in simplified transformers on graph path-finding, with results extending to countdown and boolean satisfiability. the gradient decoupling property of mtp is the key theoretical contribution. https://arxiv.org/abs/2604.11912

“nemotron 3 super: open, efficient mixture-of-experts hybrid mamba-transformer model for agentic reasoning.” nvidia. first model combining mamba-attention hybrid with latentmoe and mtp layers at 120b scale, pretrained in nvfp4 on 25t tokens. https://arxiv.org/abs/2604.12374

“thinking sparks!: emergent attention heads in reasoning models during post training.” circuit analysis showing post-training sparks functionally specialized attention heads; distillation/sft add stable heads while grpo operates in dynamic search mode with iterative activation and pruning. https://arxiv.org/abs/2509.25758

“one token away from collapse: the fragility of instruction-tuned helpfulness.” demonstrates that trivial lexical constraints cause instruction-tuned models (including gpt-4o-mini) to collapse outputs; linear probes predict collapse before generation begins, with base models unaffected. https://arxiv.org/abs/2604.13006

“calibration-aware policy optimization for reasoning llms.” proves grpo’s uncertainty-agnostic advantage estimation inevitably misaligns optimization with calibration. capo-1.5b improves calibration by up to 15% while matching or beating grpo accuracy, with downstream inference-time scaling gains of up to 5%. https://arxiv.org/abs/2604.12632

“anypoc: universal proof-of-concept test generation for scalable llm-based bug detection.” multi-agent framework with evidence-grounded verification and knowledge base evolution. 122 new bugs found across major systems (firefox, chromium, llvm, openssl, sqlite, ffmpeg, redis), 105 confirmed, 86 fixed. https://arxiv.org/abs/2604.11950