key developments
llama.cpp adds native audio processing via gemma 4 models. llama-server now supports speech-to-text using google’s gemma 4 e2a and e4a models, confirmed by both the localllama community and simon willison’s testing. willison posted a working uv run recipe for transcribing audio on macos using the 10.28gb gemma 4 e2b model with mlx-vlm, noting mostly accurate transcription with minor misinterpretations on a 14-second clip. this matters because it brings multimodal audio capabilities into the most widely used local inference stack without requiring separate stt pipelines. the combination of llama.cpp server support and mlx compatibility means gemma 4’s audio models are now practically usable across both linux and macos local setups. https://simonwillison.net/2026/Apr/12/mlx-audio/#atom-everything https://www.reddit.com/r/LocalLLaMA/comments/1sjhxrw/audio_processing_landed_in_llamaserver_with_gemma4/
speculative decoding with gemma 4 e2b drafting for gemma 4 31b yields +29% average speedup, +50% on code. a detailed benchmark on an rtx 5090 shows gemma 4 e2b (4.65b) as a draft model achieves 52.2% average acceptance rate when paired with gemma 4 31b, pushing generation from 57 t/s baseline to 74 t/s average. code generation hits 86 t/s (+50.5%). the post also documents a practical trap: mismatched gguf metadata (add_bos_token) between early and late quantizations forced token translation mode that destroyed performance. this is useful signal for anyone running gemma 4 locally; the same-family draft model approach works well here, but gguf version compatibility is a real footgun. https://www.reddit.com/r/LocalLLaMA/comments/1sjct6a/speculative_decoding_works_great_for_gemma_4_31b/
kiv: tiered kv cache replacement enables 1m token context on 12gb vram with no retraining. a new open source project replaces huggingface’s dynamiccache with a system that keeps recent tokens in vram, offloads old k/v to system ram, and retrieves only ~256 relevant v entries per decode step using k vectors as search indices. tested on gemma 4 e2b (4-bit) on an rtx 4070: 1m tokens with 12mb kiv vram overhead, 4.1 tok/s at 1m context, 70/70 needle-in-haystack tests passed up to 32k. the core insight (k vectors are smooth and searchable; v vectors are chaotic and should be retrieved on demand) is architecturally interesting. real limitations exist: two-hop reasoning fails, bounded prefill loses info on dense similar data, and cpu ram scales linearly. but as a drop-in cache replacement requiring zero model modification, this is a notable engineering contribution. https://www.reddit.com/r/MachineLearning/comments/1sjkmwz/kiv_1m_token_context_window_on_a_rtx_4070_12gb/
the sequence frames this week’s releases as three divergent frontier strategies. the editorial synthesizes anthropic’s mythos (frontier as security instrument, metered through project glasswing), meta’s muse spark (frontier as consumer attention layer), and z.ai’s glm-5.1 (frontier as long-duration autonomous labor). the framing that “we are leaving the era where every release looks like a slightly smarter chatbot” and the real divergence is now “deployment geometry” is worth internalizing. this is less a development and more an analytical lens, but it captures a structural shift in how labs are differentiating. https://thesequence.substack.com/p/the-sequence-radar-841-three-model
notable
-
moss-tts-nano: 0.1b parameter multilingual tts model that runs in realtime on cpu. supports chinese, english, japanese, korean, arabic with streaming inference and voice cloning. practical for lightweight deployment scenarios. https://github.com/OpenMOSS/MOSS-TTS-Nano
-
nvidia open-sourced aitune, a toolkit that auto-benchmarks and selects the fastest inference backend (tensorrt, onnx runtime, etc.) for pytorch models. useful for anyone tired of manually comparing backends. https://www.reddit.com/r/LocalLLaMA/comments/1sj4a3p/nvidia_drops_aitune_autoselects_fastest_inference/
-
glm-5.1 shows competitive performance against frontier models in a social deduction game benchmark at ~25% the cost of claude opus 4.6 per game, with 0% tool error rate. small sample size so far but interesting signal on cost-performance. https://www.reddit.com/r/LocalLLaMA/comments/1sjm407/glm_51_sits_alongside_frontier_models_in_my/
-
educational pytorch repo implements distributed training parallelism (dp, fsdp, tp, pp) from scratch with explicit collectives rather than high-level abstractions. good learning resource for anyone wanting to understand the mechanics. https://github.com/shreyansh26/pytorch-distributed-training-from-scratch
papers
“neural computers” (meta). trains video models to generate simulations of terminal and desktop environments, exploring whether ai can act as a computer by producing visual simulations of computing interfaces. conceptually interesting but early-stage. https://arxiv.org/abs/2604.06425