key developments

mistral released medium 3.5, a 128b dense model with 256k context and integrated reasoning mistral dropped mistral medium 3.5, a dense 128b parameter model combining instruction following, reasoning, and coding into unified weights with a 256k context window. reasoning effort is configurable per request, and it includes a vision encoder trained from scratch for variable image sizes. the model replaces both mistral medium 3.1 and magistral. released under a modified mit license that requires commercial licensing above a revenue threshold. this is notable as a dense model at 128b in an era dominated by moe architectures; the configurable reasoning effort toggle (from instant reply to deep reasoning) mirrors what anthropic and others have done but in a fully open-weights package. the practical question is whether 128b dense can compete with moe models that activate far fewer parameters per token. gguf quantizations are already available via unsloth. https://huggingface.co/mistralai/Mistral-Medium-3.5-128B

simon willison shipped llm 0.32a0, a major architectural refactor of his llm library the llm python library got its most significant structural change since launch. the core abstraction shifts from simple prompt/response pairs to message sequences as input and typed part streams as output. this matters because the old model (text in, text out) could no longer represent the reality of modern llms: multi-turn conversations, tool calls, image/audio inputs, structured json output, reasoning traces. the new design makes inputs a sequence of messages and responses a stream of differently typed parts, which properly maps to how every major api actually works. a quick bugfix in 0.32a1 addressed tool-calling conversation persistence in sqlite. this is an alpha, but it signals where the ecosystem tooling is heading; anyone building on llm should pay attention. https://simonwillison.net/2026/Apr/29/llm/#atom-everything

deepseek v4 pricing continues to reshape cost expectations, vision capability entering grayscale testing multiple threads on localllama are processing the deepseek v4 pricing math. v4 pro input at $0.145/million tokens is 34x cheaper than claude opus 4.7; with the promotional 75% discount running through end of may, it drops to 138x cheaper. cache hits at $0.0036/million are 173x cheaper than opus cached pricing. deepseek themselves acknowledge v4 is 3 to 6 months behind gpt-5.4 and gemini 3.1 pro on raw capability, so this is not frontier quality at frontier prices; it is near-frontier quality at prices that make agentic loops with heavy cached system prompts essentially free. separately, deepseek has begun grayscale testing vision/multimodal capabilities in their chat interface. the combination of aggressive pricing and expanding modality support is the real competitive pressure here. https://www.reddit.com/r/LocalLLaMA/comments/1syolyk/deepseek_v4_pricing_is_genuinely_silly_did_the/ https://www.reddit.com/r/LocalLLaMA/comments/1sysj7u/deepseek_has_began_grayscale_testing_for_deepseek/

ibm released granite 4.1 family (3b/8b/30b) with detailed training methodology ibm launched the granite 4.1 model family at three sizes. hugging face published a detailed blog on how the models were built. a separate granite speech 4.1 model was also released. the significance depends on benchmark performance relative to peers at those sizes, which early community discussion is still evaluating. https://huggingface.co/blog/ibm-granite/granite-4-1

qwen introduced flashqla, high-performance linear attention kernels for edge deployment qwen released flashqla, built on tilelang, delivering 2 to 3x forward speedup and 2x backward speedup for linear attention models. the design targets agentic ai on personal devices, using gate-driven automatic intra-card context parallelism and hardware-friendly algebraic reformulation. gains are especially pronounced for tensor parallelism setups, small models, and long-context workloads. the backward pass required a 16-stage warp-specialized pipeline under tight on-chip memory constraints. this is infrastructure work that matters if linear attention architectures gain adoption for on-device inference. https://github.com/QwenLM/FlashQLA

apple research on adaptive thinking: knowing when to use latent space reasoning apple published research on letting llms decide when chain-of-thought reasoning is actually necessary versus when the model can think in latent space. they use self-consistency (agreement among multiple reasoning paths) as a proxy for thinking necessity. this addresses compute-optimal inference, the question of how to allocate test-time compute based on query complexity rather than uniformly applying expensive reasoning to every prompt. this matters because the industry is moving toward configurable reasoning budgets, and understanding when reasoning helps versus wastes tokens is a practical efficiency question. https://machinelearning.apple.com/research/adaptive-thinking

notable

papers

barriers to universal reasoning with transformers (and how to overcome them) ; shows that under standard positional encodings, transformers with chain-of-thought cannot solve problems beyond tc0 even with arbitrary cot length, but introduces signpost tokens that enable length-generalizable turing machine simulation. https://arxiv.org/abs/2604.25800

why does reinforcement learning generalize? a feature-level mechanistic study of post-training in llms ; finds that sft introduces many specialized features that stabilize early while rl induces restrained, continually evolving changes that preserve base representations; identifies a compact task-agnostic feature set causally responsible for rl generalization. https://arxiv.org/abs/2604.25011

lessismore: cross-head unified sparse attention for reasoning ; training-free sparse attention exploiting the insight that token importance is global and stable across attention heads; matches or improves accuracy while achieving up to 1.6x end-to-end decoding speedup. https://arxiv.org/abs/2508.07101

the surprising universality of llm outputs ; discovers that token rank-frequency distributions from six frontier models converge to the same two-parameter mandelbrot distribution, enabling cpu-only model fingerprinting at 2.6 microseconds per token. https://arxiv.org/abs/2604.25634

mimo-embodied: x-embodied foundation model ; first cross-embodied model achieving sota on both autonomous driving (12 benchmarks) and embodied ai (17 benchmarks), showing strong positive transfer between domains through multi-stage learning and rl fine-tuning. open-sourced. https://arxiv.org/abs/2511.16518

marco-moe: open multilingual mixture-of-expert language models ; 5% parameter activation per token with upcycling from dense models; instruct variants surpass models with 3 to 14x more activated parameters; full training data, recipes, and weights released. https://arxiv.org/abs/2604.25578

odysseys: benchmarking web agents on realistic long horizon tasks ; 200 tasks from real browsing sessions on live internet; best frontier model achieves 44.5% success; introduces trajectory efficiency metric revealing agents achieve only 1.15% rubric score per step. https://arxiv.org/abs/2604.24964