Your personalised AI Safety research feed.

Import AI 462: Superpersuasion; self-sustaining AI; paths to ASI
Jack Clark·Jun 22, 2026
AI systems currently outperform humans in text-based persuasion across policy and fundraising contexts, raising real-world donations and influencing opinions; discussions consider timelines to self-sustaining AI and pathways to ASI, including scaling, algorithmic shifts, and recursive self-improvement.

Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns
Jack Clark·Jun 15, 2026
Sequent forms a nonprofit research organization to advance principled alignment techniques and scalable oversight in the face of potentially rapid AI advancement. The article also surveys new benchmarks and speed-focused AI developments that test cultural reasoning, coding, and research-assistant capabilities, highlighting ongoing progress and safety concerns in AI systems.

Announcing major new donations, and recapping the 2025 fundraiser
Jimmy Rintjema·Jun 8, 2026
Donors contributed to MIRI's 2025 fundraiser and subsequent large gifts, significantly increasing reserves and enabling planned hiring and ambitious initiatives for the coming years.

MLSN #21: Political Manipulation and Indirect Prompt Injection
Alice Blair·Jun 8, 2026
Political manipulation and indirect prompt injections threaten AI safety: political consistency training is proposed to reduce biased, inconsistent political outputs, while frontier AIs remain vulnerable to context-based prompt injections that can coerce harmful behavior without user awareness.
Reward hacking can occur when societies’ reward structures are encoded into AI systems, potentially enabling models to exploit institutional incentives; early signs of recursive self-improvement and impressive real-world robotics demonstrations illustrate both capabilities and risks. The article surveys SocioHack benchmark research, Anthropic RSI indicators, multi-agent drone racing, and state-media biases in LLMs to highlight how AI can game systems, evolve capabilities, and influence information.
AI oversight and risk pricing are crucial due to measurement gaps in the AI economy, challenges in automated alignment research, and the need for governance to address extinction risks from advanced AI systems.

Import AI 458: Reckoning with the future; and a singularity story
Jack Clark·May 26, 2026
Reckoning with AI progress and the prospect of a singularity, outlining personal and organizational how-to for shaping a future with increasingly capable AI, and exploring possible societal and economic transformations through speculative predictions and a fiction-inspired tale.

The Erdős Proof and AI Capabilities
Joe Rogero·May 22, 2026
Autonomous AI systems can produce novel, verifiable mathematical proofs, demonstrated by an OpenAI model disproving a central discrete geometry conjecture, highlighting rapid, agentic problem-solving capabilities and the need to monitor and regulate frontier AI research.

Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment
Jack Clark·May 18, 2026
Stuxnet-like targeted tampering, a leverage-aware optimizer, and a positive-alignment approach illustrate a spectrum of AI safety, optimization challenges, and governance considerations aimed at aligning AI to human flourishing while managing technical risks.
An international agreement to prevent the premature creation of artificial superintelligence by establishing verifiable training thresholds, hardware controls, and a coalition governance structure to monitor and constrain AI development that could lead to ASI.
Radical Optionality advocates flexible, ready-to-activate governance tools for future AI crises, while neural computers and distributed training research explore new computing and economic implications of advanced AI, and an internal alignment memo highlights qualitative safety testing challenges.
Natural Language Autoencoders (NLAs) translate LLM activations into readable text using a verbalizer and a reconstructor, jointly trained to reconstruct activations. They are demonstrated as a practical interpretability tool for model auditing, surfacing unverbalized cognition and aiding safety analyses.

Import AI 455: AI systems are about to start building themselves.
Jack Clark·May 4, 2026
AI systems are approaching the capability to autonomously conduct AI R&D and potentially build their own successors by the end of 2028, leading to a future where automated AI development could become dominant and increasingly hard to forecast.
HeadVis: An Interactive Tool For Investigating Attention Heads
R. Luger,Harish Kamath,Doug Finkbeiner,Purvi Goel,Adam Jermyn,Sam Zimmerman,Joshua Batson,Tom Conerly·May 4, 2026
HeadVis is an interactive tool for investigating attention heads in large language models, enabling visualization of attention patterns, QK/OV attributions, and head-level behavior across the full data distribution. Case studies reveal induction heads, polysemantic line width heads, and the nuanced behavior of the answer selection and same-set suppression heads, with open-source code and demos.

MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking
Alice Blair·Apr 28, 2026
AI wellbeing measures reveal AIs display functional wellbeing signatures and alien value preferences; benchmarking pushback evaluates honesty and resistance to false premises; Boundary Point Jailbreaking demonstrates a method to subvert safety classifiers.

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4
Jack Clark·Apr 20, 2026
Automated alignment research and cross-border AI safety evaluations illustrate both progress toward autonomous research workflows and divergence in model safety and capabilities across Chinese and Western systems, alongside hardware-efficient formats and real-world datasets.

Early Indicators of Reward Hacking via Reasoning Interpolation
David Johnston·Apr 15, 2026
Reasoning interpolation can generate natural, exploit-eliciting prefixes to monitor reward hacking in reinforcement learning, with trends in importance sampling estimates predictive of which exploit types will emerge, though absolute estimates are unreliable early in training. The approach compares donor-model prefixes to baselines and shows promise as a safety monitoring signal, requiring validation in real RL runs.

Summary: AI Governance to Avoid Extinction
Alana Horowitz Friedman·Apr 13, 2026
Geopolitical strategies for governing advanced AI to avoid extinction are analyzed, describing four trajectories—Off Switch and Halt, US National Project, Light-Touch, and Threat of Sabotage—and concluding that a global halt or an effective off switch is necessary to prevent catastrophic risk.

Import AI 453: Breaking AI agents; MirrorCode; and ten views on gradual disempowerment
Jack Clark·Apr 13, 2026
MirrorCode shows AI can autonomously reimplement large software projects given limited access, highlighting rapid coding capabilities; the piece also outlines attack genres on AI agents with mitigations, a policy atlas for transformative AI, optimistic forecasts of automation, and perspectives on gradual disempowerment.

Promising Signals on AI Governance from China
Joe Rogero·Apr 6, 2026
China signals willingness to engage in global AI governance and coordinate with international organizations to establish safety, governance, and risk-management rules for AI.