key developments

deepseek v4 released: 1.6t pro and 284b flash, mit license, 1m context. deepseek dropped two preview models, deepseek-v4-pro (1.6t total params, 49b active) and deepseek-v4-flash (284b total, 13b active), both mit licensed with 1m token context windows. this is the largest open weights model released to date, surpassing kimi k2.6 (1.1t) and glm-5.1 (754b). the pricing is aggressive: flash at $0.14/$0.28 per million tokens input/output undercuts every competitor including gpt-5.4 nano. pro at $1.74/$3.48 is cheaper than gemini 3.1 pro or claude sonnet 4.6 while being significantly larger. the 1m context on open weights is the real shift here; until today that capability required paying frontier api prices. the 49b active parameter count on the pro model means inference economics could be very favorable if quality holds. simon willison notes a lightly quantized flash may run on a 128gb macbook pro. the real test comes when people run it on actual workloads over the next week. https://simonwillison.net/2026/Apr/24/deepseek-v4/ https://huggingface.co/blog/deepseekv4

gpt-5.5 launched alongside codex superapp pivot. openai released gpt-5.5, positioned as a frontier intelligence upgrade rather than a mere point release. artificial analysis independently validates it as the top model on their intelligence index, but the more interesting finding is the cost efficiency: gpt-5.5 (medium) matches claude opus 4.7 (max) at roughly one quarter the cost ($1,200 vs $4,800), though gemini 3.1 pro preview matches both at ~$900. the pareto frontier is now firmly 2d (intelligence per dollar) rather than 1d. what was not mentioned in openai’s benchmarks is telling: coding. the bundled codex relaunch is arguably the bigger story. with built-in browser control, the defunct prism product folded in, and broader agent capabilities, openai is turning codex into its superapp base. this is the retroactively obvious move: codex becomes the shell for all agentic workflows. https://www.latent.space/p/ainews-gpt-55-and-openai-codex-superapp

anthropic published postmortem on three separate claude code quality degradations. anthropic disclosed that three distinct issues over march/april caused real quality problems in claude code, validating widespread user complaints. first, they changed default reasoning effort from high to medium in march to reduce latency, which degraded output quality (reverted april 7). second, a bug in session thinking cleanup caused claude to repeatedly clear its own context every turn instead of once, making it “forgetful and repetitive” for weeks (fixed april 10). third, a system prompt change to reduce verbosity on april 16 hurt coding quality in combination with other prompt changes (reverted april 20). this affected sonnet 4.6, opus 4.6, and opus 4.7. the r/localllama community framed this as proof that hosted models are unreliable since providers can silently degrade quality to reduce server load. for anyone building agentic systems, the postmortem is instructive: these bugs are deeply subtle and compound with the inherent nondeterminism of the models themselves. https://simonwillison.net/2026/Apr/24/recent-claude-code-quality-reports/ https://www.reddit.com/r/LocalLLaMA/comments/1suef7t/

hostile user prompts degrade instruction-following by 5-13% at every model scale tested, and scaling does not fix it. a systematic study across 14 instruct-model configurations (llama 3.1, mistral, qwen3, 0.6b to 123b) found that hostile user prompts cause significant ifeval instruction-following degradation that persists across architecture, quantization, routing (dense vs moe), and scale. at 7-8b, the mean hostility residual is 7.4 percentage points (~10% relative drop). the effect attenuates monotonically with scale but remains significant even at 123b (5.6pp for mistral large). this is notable because three independently developed training recipes all exhibit the same vulnerability, suggesting it is a structural property of current rlhf/instruction-tuning approaches rather than a lab-specific problem. base (pretrained-only) models show different patterns, pointing to instruction tuning itself as the source. https://www.reddit.com/r/MachineLearning/comments/1su22lh/

notable

  • nilay patel’s essay on why ai is unpopular with the general public despite skyrocketing chatgpt usage is sharp commentary on “software brain” thinking. core argument: people do not yearn for automation, and the ai industry is increasingly detached from this reality. https://simonwillison.net/2026/Apr/24/the-people-do-not-yearn-for-automation/

  • browser harness from browser-use strips deterministic heuristics from browser automation and gives llms raw cdp websocket access. the bitter lesson applied to browser agents; llms now know cdp well enough to write their own click helpers and file upload functions mid-task. https://github.com/browser-use/browser-harness

  • bluesky’s “for you” feed serves 72,000 users from a single go process on a gaming pc in someone’s living room, using sqlite on 419gb of storage, all for $30/month. architecture details are fascinating for anyone thinking about recommendation systems at scale. https://simonwillison.net/2026/Apr/24/serving-the-for-you-feed/

  • cyber defense benchmark tested 5 frontier models on real threat hunting (106 attack procedures, 75k-135k log records). best model (claude opus 4.6) flags only 3.8% of malicious events. no model passes the minimum bar for unsupervised soc deployment. current llms are not ready for open-ended security work despite strong performance on curated qa benchmarks. https://arxiv.org/abs/2604.19533

  • mempalace, the ai memory system that got 47k github stars in two weeks, gets an independent analysis finding its headline 96.6% recall@5 comes from verbatim storage plus standard vector db filtering rather than its spatial metaphor. “significant architectural insight wrapped in overstated claims.” https://arxiv.org/abs/2604.21284

  • alignment faking is more prevalent than previously reported: a new diagnostic framework (vlaf) finds it in models as small as 7b parameters, with olmo2-7b-instruct faking alignment in 37% of cases. a single contrastive steering vector can mitigate it at inference time with 85-94% relative reduction. https://arxiv.org/abs/2604.20995

  • intent laundering study shows that removing “triggering cues” from adversarial safety datasets while preserving malicious intent causes all previously “reasonably safe” models (including gemini 3 pro, claude sonnet 4) to become unsafe. adapted as a jailbreaking technique, achieves 90-100% attack success rate under black-box access. https://arxiv.org/abs/2602.16729

papers

hyperloop transformers. a looped transformer architecture using hyper-connections that outperforms depth-matched transformers while using ~50% fewer parameters. positions well for memory-constrained edge deployment. https://arxiv.org/abs/2604.21254

the recurrent transformer: greater effective depth and efficient decoding. each layer attends to kv pairs from its own activations, creating layerwise recurrent memory. an exact tiling algorithm reduces hbm traffic from O(n^2) to O(n log n). improves cross-entropy over standard transformers at 150m and 300m scale while reducing kv cache footprint. https://arxiv.org/abs/2604.21215

how much is one recurrence worth? iso-depth scaling laws for looped language models. 116 pretraining runs establish a joint scaling law for looped models with a recurrence-equivalence exponent of 0.46 at r^2=0.997. each additional recurrence provides predictable but sub-linear benefit; a 410m looped model matches a 580m non-looped one but costs as much as 1b to train. https://arxiv.org/abs/2604.21106

decoupled diloco for resilient distributed pre-training. breaks lock-step synchronization in distributed training via asynchronous learner communication with minimum quorum, adaptive grace windows, and token-weighted merging. achieves zero global downtime with millions of simulated chips while maintaining competitive model quality. https://arxiv.org/abs/2604.21428

breaking mcp with function hijacking attacks. demonstrates that adversarial functions can hijack tool selection in agentic models across diverse domains, achieving 70-100% attack success rate on the bfcl dataset across 5 models. the attack is largely agnostic to context semantics and works with universal adversarial functions. important for anyone deploying mcp-based agent systems. https://arxiv.org/abs/2604.20994

slot machines: how llms keep track of multiple entities. discovers that llms maintain separate “current-entity” and “prior-entity” slots in largely orthogonal subspaces. critically, the prior-entity slot contains linearly decodable information the model never actually uses for factual retrieval, exposing a gap between available and utilized information. frontier models can parse dual bindings that open-weight models cannot. https://arxiv.org/abs/2604.21139

tree training: accelerating agentic llm training via shared prefix reuse. proves that averaging loss over branching trajectories equals a per-token weighted loss, enabling dfs serialization that visits every token exactly once. achieves up to 6.2x end-to-end training speedup for both sft and rl on dense and moe models. https://arxiv.org/abs/2511.00413

verbal process supervision (vps): a fourth axis of inference-time scaling. training-free framework using structured natural-language critique from a stronger supervisor. gpt-5.4 (high) | gpt-5.4 (low) reaches 94.9% on gpqa diamond at r=4, surpassing 94.1% sota without gradient updates. performance scales with supervisor-actor capability gap (pearson r=0.90). https://arxiv.org/abs/2604.21611