key developments

openai expands codex beyond coding; anthropic launches creative tool integrations and security features. both labs made significant moves this week to push their agent products into non-engineering workflows. openai’s codex now pitches itself as a general knowledge work agent, with 42% faster computer use agent performance, responsive browser, integration with microsoft/google/salesforce suites, and a dynamic ui that lets the agent route the experience rather than requiring user toggles. anthropic countered with claude integrations for creative tools (blender, autodesk, adobe creative cloud, ableton, splice, canva affinity) and launched claude security, a code review tool. the strategic picture is clear: coding agents are the beachhead, but both labs are racing to become the default agent for all computer work. openai is betting on a “super app” approach while anthropic is expanding through tool-specific integrations. separately, gpt-5.5 and claude opus 4.7 both appear to have launched or been referenced this week, though details are sparse in these sources. (latent space, r/mlscaling gpt-5.5, r/mlscaling opus 4.7)

~35% of newly published websites are now ai-generated or ai-assisted. a study using internet archive samples from 2022-2025 with a state-of-the-art detector found ai-generated web content went from zero pre-chatgpt to roughly 35% by mid-2025. the research found statistically significant negative correlation with semantic diversity and positive correlation with positive sentiment, but notably did not find significant evidence of decreased factual accuracy or stylistic diversity. public perception diverges sharply from these findings; most us adults believe all four negative hypotheses. this is the most rigorous measurement of ai content saturation to date, and the gap between perception and measurement is itself significant for policy discussions. (arxiv)

minicpm-o 4.5: full-duplex omni-modal interaction at 9b parameters. openbmb released minicpm-o 4.5, which can see, listen, and speak simultaneously in real time with proactive behaviors (unprompted reminders, comments on live scenes). the key innovation is “omni-flow,” a unified streaming framework aligning omni-modal inputs/outputs on a shared temporal axis, converting turn-based interaction into full-duplex processing. at 9b parameters it approaches gemini 2.5 flash on vision-language tasks and surpasses qwen3-omni-30b on omni-modal understanding, running on edge devices with under 12gb ram. this matters because it demonstrates that human-like multimodal interaction doesn’t require massive models, and the proactive behavior capability is a meaningful step beyond reactive assistants. (arxiv)

in-context prompting alone outperforms agent orchestration frameworks for procedural tasks. a controlled comparison found that putting an entire procedure in the system prompt and letting the model self-orchestrate scored 4.53-5.00 on a 5-point scale versus 4.17-4.84 for langgraph orchestration across travel booking, zoom support, and insurance claims (up to 55 nodes). the orchestrated system failed on 9-24% of conversations versus 0.5-11.5% for the in-context baseline. this is a pointed result for the agent framework ecosystem: for defined procedures, the overhead of external orchestration is not just unnecessary but actively harmful with current frontier models. the authors are careful to scope this to procedural tasks, but that covers a large fraction of enterprise use cases. (arxiv)

mcphunt reveals cross-boundary credential propagation in multi-server mcp agents. the first controlled benchmark for measuring credential leakage across mcp trust boundaries found policy-violating propagation rates of 11.5-41.3% across 5 models and 3,615 traces. this is not adversarial injection; it’s faithful execution causing credentials to flow between trust domains. browser-mediated data flows were the worst offenders. prompt-level mitigations reduced violations by up to 97% while preserving 80.5% utility, but effectiveness varied with instruction-following capability. this matters because mcp adoption is accelerating and this structural vulnerability hasn’t been widely recognized. (arxiv)

qed: multi-agent system produces verified proofs for 3 of 5 open math problems. the qed system identified seven failure modes in llm-based proof generation (context contamination, citation hallucination, hand-waving, etc.) and designed a multi-agent architecture where each component addresses a specific failure mode. evaluated on five open problems in applied analysis and pdes contributed by domain experts, it produced correct proofs for three, each verified by the contributing experts as original and nontrivial. this is a genuine milestone; not benchmark performance but actual mathematical contributions verified by human experts. (arxiv)

notable

  • healthformer models human physiological trajectories as a generative transformer, trained on 15,000+ deeply phenotyped individuals across 667 measurements. without task-specific training, it improves prediction for 27/30 disease endpoints and simulates intervention effects matching published trial results (r=0.78 for diastolic bp). a meaningful step toward clinical digital twins. (arxiv)

  • pflash achieves 10x prefill speedup over llama.cpp at 128k context on a single rtx 3090 using speculative prefill, reducing time-to-first-token from ~257s to 24.8s for qwen3.6-27b q4_k_m. pure c++/cuda, open source mit. (reddit)

  • escalation channels reduce harmful agent actions from 38.7% to 1.2% across 10 frontier llms (24,000 samples); providing an “instrumentally credible” authorized alternative path matters far more than simple monitoring. grounded in situational crime prevention theory. (arxiv)

  • in-context examples suppress scientific knowledge recall in llms; across 6,000 trials and 4 models, adding examples shifts models from knowledge-driven derivation to empirical pattern fitting, even when examples were generated by the same formula. cautionary for practitioners. (arxiv)

  • windowsworld benchmark shows all gui agents fail badly (<21% success) on multi-application professional workflows, far below single-app performance. tasks requiring reasoning across 3+ applications are especially problematic. (arxiv)

  • semantic features in llm hidden states closely mirror human psychological associations across 360 words projected on 32 semantic axes, with steering on one axis causing proportionate spillover on correlated axes. interpretability result suggesting features form meaningful geometric subspaces. (arxiv)

  • prompt optimization changes llm evaluation rankings significantly; using the same static prompt across all models (standard practice) produces different results than optimizing per-model, which is what practitioners actually do. (arxiv)

  • path-lock expert architecture separates think/no-think modes by routing to dedicated mlp experts per mode. on qwen3-4b, reduces no-think reflective tokens from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00% while preserving think-mode performance. (arxiv)

  • zipccl achieves 1.35x communication speedup in distributed llm training via lossless compression exploiting the near-gaussian distribution of training tensors; 1.18x end-to-end training speedup on 64 gpus with zero quality impact. (arxiv)

  • simon willison built an inaturalist observation viewer entirely on his phone using claude code, demonstrating the current state of mobile-first ai-assisted development for personal tools. (simonwillison.net)

papers

crosscoding through time: tracking emergence & consolidation of linguistic representations throughout llm pretraining. uses sparse crosscoders to align features across model checkpoints and introduces relative indirect effects metric to trace when features become causally important. architecture-agnostic approach to understanding representation learning dynamics during pretraining. (arxiv)

latent-grpo: group relative policy optimization for latent reasoning. identifies three fundamental bottlenecks when adapting grpo to latent reasoning (manifold collapse, exploration-optimization misalignment, mixture non-closure) and proposes solutions. improves over explicit grpo by 4.27 points on hard benchmarks while using 3-4x shorter reasoning chains. (arxiv)

adaptable diff formats for llm code editing (blockdiff/funcdiff + adaedit). introduces structure-aware diff formats aligned with syntactic units and trains llms to dynamically choose the most token-efficient format. matches full-code generation accuracy while reducing latency and cost by 30%+ on long edits. practical infrastructure contribution. (arxiv)

contextual agentic memory is a memo, not true memory. argues current agentic memory systems implement lookup, not memory, with provable consequences: generalization ceiling on compositionally novel tasks, indefinite accumulation without expertise development, and structural vulnerability to persistent memory poisoning. draws on complementary learning systems theory. (arxiv)

mars: efficient adaptive co-scheduling for heterogeneous agentic systems. addresses the gap between gpu inference and cpu tool execution in agent workloads with a unified scheduling system. reduces end-to-end latency by up to 5.94x; integrated with openhands coding agent for 1.87x acceleration. (arxiv)

shorthand for thought: compressing llm reasoning via entropy-guided supertokens. discovers reasoning tokens split into low-entropy structural tokens and higher-entropy organic tokens, then applies bpe merges to create supertokens that compress reasoning traces by 8.1% with no accuracy loss. correct vs. incorrect traces show systematically different structural patterns, with implications for reward shaping. (arxiv)