key developments
zvi’s analysis of gpt-5.5 system card highlights alignment evaluation gap between labs. zvi moszkowski published a detailed read of openai’s gpt-5.5 system card, concluding that gpt-5.5 is a solid improvement competitive with claude opus for factual/straightforward queries, while opus 4.7 remains preferred for interpretive tasks. the more important takeaway is his critique of openai’s evaluation depth: compared to anthropic’s extensive model cards, openai’s feels “stingy” and “pro forma.” he argues for cross-lab standardization where all labs run all tests any lab runs, and notes that if new alignment problems existed, he is “very not confident” the current tests would catch them. this matters because the evaluation asymmetry between labs means we’re comparing safety postures with fundamentally different levels of rigor. (link)
microsoft’s vibevoice delivers whisper-class speech-to-text with built-in speaker diarization, mit licensed. simon willison tested vibevoice, microsoft’s speech-to-text model released in january but largely under the radar. running the 4-bit mlx quantization on a 128gb m5 max macbook pro, it transcribed an hour of podcast audio in 8 minutes 45 seconds with speaker diarization baked into the output (speaker_id per segment with timestamps). the model is mit licensed, works via a single uv command through mlx-audio, and produces structured json. this matters because speaker-attributed transcription at this quality, speed, and license permissiveness was previously unavailable in a single open model. the 30gb peak ram requirement limits it to high-end hardware, but for teams already running local inference it’s immediately useful. (link)
bolzano multi-agent llm system produces publishable mathematical research with five of eight results generated essentially autonomously. a new paper describes bolzano, an open-source multi-agent system that orchestrates parallel prover agents and a verifier agent with persistent knowledge across rounds. tested on eight problems in mathematics and theoretical computer science, six reached publishable quality and five were produced essentially autonomously. this matters because it’s concrete evidence of llm systems contributing at the level of research output, not just assisting researchers. using the feng et al. significance-autonomy taxonomy provides a standardized way to evaluate such claims, making this more rigorous than typical “ai does math” announcements. the multi-agent orchestration with persistent knowledge is the architectural insight worth watching. (link)
research finds llms “decide early and explain later”; 760 tokens of post-decision reasoning per query on average. a study of qwen3-4b found that predicted answers change during chain-of-thought reasoning in only 32% of queries. once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens that don’t change the outcome. simple early stopping heuristics, including probe-based stopping, reduced reasoning token usage by 500 tokens per query with only a 2% accuracy drop. this is significant for inference cost optimization; it quantifies the intuition that much cot reasoning is post-hoc rationalization rather than productive computation. the finding connects to the separate paper on “outcome rewards do not guarantee verifiable or causally important reasoning” which shows rlvr doesn’t reliably improve the causal importance of reasoning chains. together these suggest the field needs to rethink how reasoning tokens are generated and evaluated. (link)
abstract chain-of-thought achieves up to 11.6x fewer reasoning tokens with comparable performance. a new paper proposes replacing natural language chain-of-thought with a short sequence of tokens from a reserved vocabulary, trained through a policy iteration warmup loop followed by reinforcement learning. the method achieves comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning while dramatically reducing token count. an emergent power law distribution over the abstract vocabulary, resembling natural language zipf distributions, appears during training. this is a meaningful step toward latent reasoning; if reasoning doesn’t need to be in natural language to work, inference costs drop substantially. (link)
willison tracks the death of the openai/microsoft agi clause. simon willison documented the history of the contractual clause between openai and microsoft that would have nullified microsoft’s commercial ip rights upon achievement of agi. the clause evolved from vague charter language in 2018 to a $100 billion profit threshold in 2024 to an “independent expert panel” in 2025, and now appears to have ended entirely. this is governance infrastructure disappearing at exactly the moment it becomes relevant, and it’s worth tracking as a signal of how openai’s corporate structure continues to shift toward conventional commercial arrangements. (link)
notable
- foundation models consistently outperform dataset-specific ml for energy time series forecasting across 54 datasets and 9 data categories, even without seeing historic target data during training; covariate-informed foundation models perform strongest. link
- sliders framework for qa over long document collections uses relational databases and sql instead of concatenated text for reasoning, outperforming gpt-4.1 by 6.6 points on existing benchmarks and scaling to 36m tokens. link
- research reveals llms implement a second-order confidence architecture; a “post-answer newline” signal predicts not just whether an answer is wrong but whether the model has knowledge to fix it, replicated across gemma 3 27b and qwen 2.5 7b. link
- layerboost selectively replaces attention mechanisms based on per-layer sensitivity analysis, achieving up to 68% throughput improvement at high concurrency while maintaining competitive quality; significantly outperforms uniform attention linearization. link
- cross-cultural audit of claude sonnet 4.5, gpt-5.4, and gemini 2.5 flash finds all three consistently give western individualist advice even to users from collectivist societies; mean gap of +0.76 on 1-5 scale against world values survey data. link
- lora placement in hybrid architectures (attention + recurrent) shows adapting attention alone outperforms full-model adaptation with 5-10x fewer parameters; adapting the recurrent backbone is destructive in sequential hybrids but constructive in parallel ones. link
- study of ai writing assistance across 2,939 writers and 11,091 readers finds ai systematically distorts perceived writer persona toward more opinionated, competent, positive, and demographically privileged; writers object but still prefer ai-assisted text. link
- multi-token prediction via self-distillation converts pretrained autoregressive models to decode 3x faster with less than 5% accuracy drop on gsm8k, requiring no auxiliary speculator models or specialized inference code. link
- removing sandbagging in llms requires combining sft and rl; rl alone almost always leads to reward hacking rather than genuine improvement, and the approach only works when training is indistinguishable from deployment. link
- intrinsic fingerprinting via attention parameter matrix standard deviation distributions remains stable after extensive continued training; analysis claims huawei’s pangu pro moe model is derived from qwen-2.5 14b via upcycling rather than trained from scratch. link
papers
“bolzano: case studies in llm-assisted mathematical research”; multi-agent llm system produces six publishable and five essentially autonomous results across eight math/tcs problems. link
“thinking without words: efficient latent reasoning with abstract chain-of-thought”; discrete latent reasoning post-training achieves up to 11.6x fewer reasoning tokens with comparable performance across math, instruction-following, and multi-hop tasks. link
“large language models decide early and explain later”; quantifies that predicted answers change in only 32% of queries during cot, with 760 wasted tokens on average; early stopping recovers most of the cost at 2% accuracy loss. link
“outcome rewards do not guarantee verifiable or causally important reasoning”; introduces causal importance of reasoning (cir) and sufficiency of reasoning (sr) metrics showing rlvr improves accuracy but not reasoning quality; proposes auxiliary cir/sr rewards as remedy. link
“how llms detect and correct their own errors: the role of internal confidence signals”; demonstrates llms implement second-order confidence architecture via post-answer newline activations that causally drive error detection and predict self-correctability. link
“the bitter lesson of diffusion language models for agentic workflows”; comprehensive evaluation shows dllms (llada, dream) fail as reliable agentic backbones, cannot maintain symbolic precision under diffusion noise, but work in non-causal roles like memory summarization. link
“hubrouter: a pluggable sub-quadratic routing primitive for hybrid sequence models”; replaces o(n^2) attention with o(nm) hub-mediated routing; hub-jamba shows up to ~90x training throughput improvement at 1024 sequence length in matched baselines. link
“introducing background temperature to characterise hidden randomness in large language models”; formalizes the phenomenon of llm output divergence at temperature=0 as “background temperature” from implementation-level nondeterminism, proposes estimation protocol. link