key developments
zvi analyzes claude mythos cybersecurity capabilities and project glasswing; tom’s hardware pushes back on claims. zvi’s analysis covers anthropic’s decision to limit-release claude mythos to cybersecurity partners rather than the public, citing its ability to find and exploit vulnerabilities in major software at scale. the government response includes treasury and fed officials summoning wall street executives over cyber risk. tom’s hardware counters that the “thousands of severe zero-days” claims rest on only 198 manual reviews, calling the framing more sales pitch than substance. the truth likely sits between these poles: the capability is real enough to warrant restricted release, but the specific numbers deserve scrutiny. this matters because it sets a precedent for capability-gated model releases and signals that frontier cyber offense capabilities are arriving faster than most institutions anticipated. zvi analysis | tom’s hardware
simon willison highlights the growing capability gap between openai’s access points, citing karpathy. voice mode still runs on a gpt-4o era model with an april 2024 knowledge cutoff, while codex can autonomously restructure entire codebases over hour-long sessions. karpathy’s core observation: reinforcement learning works dramatically better on domains with verifiable reward functions (code, math) than subjective ones (writing, conversation), and the b2b value concentration means teams are disproportionately focused on the verifiable domains. this explains why the “ai is amazing” and “ai is disappointing” camps can both be correct simultaneously; they’re using different products. this is important context for anyone evaluating openai’s actual capability frontier versus what most users experience. willison
bytedance introduces in-place test-time training (ttt) for standard transformer llms. the method treats the final projection matrix of mlp blocks as adaptable “fast weights” that update during inference, combined with a next-token-prediction-aligned objective and chunk-wise updates compatible with context parallelism. a 4b model achieves superior performance on tasks up to 128k context. this is significant because it provides a drop-in mechanism for dynamic weight adaptation without architectural changes or retraining from scratch, potentially bridging the gap between static deployment and continual learning. the practical question is whether the inference overhead is acceptable for production use cases. reddit discussion
iatrobench documents identity-contingent medical information withholding across frontier models. sixty pre-registered clinical scenarios show that all five testable models provide better medical guidance to physician-framed queries than layperson-framed ones, with a statistically significant decoupling gap (+0.38, p=0.003). the model with the heaviest safety investment (opus) shows the largest gap (+0.65). three distinct failure modes emerge: trained withholding, incompetence, and indiscriminate content filtering. critically, standard llm judges miss 73% of omission harms that physician evaluators catch. this is a rigorous demonstration that safety training creates measurably worse outcomes for the people most likely to need help, particularly in scenarios where standard referrals are already exhausted. arxiv
atom report documents chinese open models overtaking u.s. counterparts in ecosystem adoption. the analysis covers approximately 1,500 mainline open models, tracking hugging face downloads, model derivatives, inference market share, and performance metrics. the crossover reportedly occurred in summer 2025 with the gap widening since. this is the first comprehensive adoption snapshot attempting to quantify what many have observed anecdotally; that qwen, deepseek, and related families now dominate the open model ecosystem. the policy implications are substantial for anyone advising on sovereign ai strategy or open model selection. arxiv
longwriter-zero achieves state-of-the-art ultra-long generation purely through rl, no synthetic data. trained from qwen2.5-32b using reinforcement learning from scratch (no sft on synthetic data), the model outperforms traditional sft methods on writingbench and arena-write, surpassing 100b+ models including deepseek r1 and qwen3-235b. the approach uses specialized reward models for length control, writing quality, and structural formatting with an r1-zero-style training setup. this matters because it demonstrates that rl alone can unlock capabilities previously thought to require large-scale supervised data pipelines, extending the r1-zero paradigm beyond math/code to open-ended generation. arxiv
notable
-
wildtoolbench shows no frontier model exceeds 15% accuracy on realistic tool-use patterns. 57 llms tested; the benchmark captures compositional tasks, implicit intent, and instruction transitions that existing benchmarks miss entirely. arxiv
-
llms form a coherent evaluative group misaligned with human readers on disinformation assessment. judges agree with each other far more than with humans, penalizing emotional intensity and rewarding logical rigor in ways readers don’t. internal agreement is not validity. arxiv
-
blind refusal documented across 18 model configs: 75.4% refuse defeated-rule requests even when models recognize the rule is indefensible. refusal behavior is decoupled from normative reasoning capacity. arxiv
-
selfdoubt proposes single-pass uncertainty estimation from reasoning traces via hedge-to-verify ratio. traces with no hedging markers are correct 96% of the time, creating a zero-cost confidence gate. outperforms semantic entropy at 10x lower cost. arxiv
-
slip achieves 90-100% jailbreak success across 11 models including gpt-5.1, claude sonnet 4.5, gemini 2.5 pro using only ~7.9 llm calls per attack. the model guides its own compromise via breadth-first tree search; no external red-team model needed. arxiv
-
o3 achieves only 17% of optimal collective performance in zero-cost cooperation scenarios while o3-mini reaches 50%. capability does not predict cooperation in multi-agent systems. arxiv
-
evo-l2s reduces reasoning trace length by 50%+ while preserving accuracy through evolutionary model merging across 1.5b-14b scales. arxiv
-
tracesafe-bench finds guardrail efficacy is driven by structural data competence (json parsing) not safety alignment. correlation with structured-to-text benchmarks is 0.79; near-zero with jailbreak robustness. architecture matters more than scale. arxiv
-
glm 5.1 tops code arena rankings for open models per localllama community discussion. reddit
-
prompting study across 764 calls finds too much detail kills sub-3b models (78% to 28% pass rate) and filler words are load-bearing for sub-2b models. format preference (xml vs markdown) is a myth across all sizes tested. reddit
-
mineru2.5-pro advances document parsing state of the art purely through data engineering with no architectural changes to a 1.2b model, outperforming 200x larger models. scores 95.69 on omnidocbench v1.6. arxiv
papers
-
“what do language models learn and when? the implicit curriculum hypothesis” proposes that pretraining follows a compositional, predictable curriculum; emergence orderings are consistent across model families (ρ=.81 across 45 pairs) and composite tasks emerge after components. arxiv
-
“learning is forgetting: llm training as lossy compression” shows pre-training produces models approaching the information bottleneck bound, and compression optimality predicts downstream performance across model families. arxiv
-
“loop, think, & generalize: implicit reasoning in recurrent-depth transformers” demonstrates that recurrent-depth transformers achieve systematic generalization and depth extrapolation that vanilla transformers cannot, via a three-stage grokking process. arxiv
-
“cross-tokenizer llm distillation through a byte-level interface” proposes byte-level distillation as a simple baseline for cross-tokenizer knowledge transfer; competitive with more complex methods across 1b-8b scales. arxiv
-
“what drives representation steering?” finds that steering vectors interact primarily with ov circuits (not qk), can be sparsified 90-99% while retaining performance, and different steering methodologies converge on the same important dimensions. arxiv
-
“the master key hypothesis” demonstrates training-free cross-model capability transfer via linear subspace alignment; transferring cot from 14b to 7b yields +12.1% on math, and transferring math reasoning from 4b to 14b surpasses the 14b post-trained model. arxiv