Student Projects / Supervision

We offer regular, close supervision on ambitious projects at the frontier of AI capability, safety, and security for senior undergraduates, masters’ students, and DPhil/PhD students. Our work combines rigorous foundations with high-impact applications, and students are encouraged to engage directly with open research questions that shape the future of autonomous systems. Student research with OWL has led to multiple publications and prizes - including a Best MSc Thesis Award (Tony Hoare Prize), multiple top-tier AI conference publications, and a workshop Best Paper Award - and we regularly submit to and publish at leading venues in machine learning, AI safety, and multi-agent systems. Projects are designed to be innovative, technically deep, and publication-oriented - ideal for students aiming to contribute original research at the highest level. Please find example project directions below.

We also welcome original project ideas from motivated students - if you have a proposal that aligns with the lab’s themes or pushes in an exciting new direction, we are very happy to consider it. We value intellectual initiative and enjoy shaping ambitious ideas into rigorous research projects. If you are interested in working with us, please get in touch at [email protected] with a brief description of your background and research interests, as well as a CV. Please note that due to the volume of requests we may not be able to get back to everyone.

Strategic Cyber Agents

We are exploring the next generation of autonomous cyber agents: AI systems capable of operating in complex, adversarial digital environments. Our work spans both offensive and defensive settings, combining optimisation- and verification-based methods to build agents that can reason, plan, and adapt over long horizons. A central focus is developing rich world models for cyber, enabling agents to act strategically while remaining stealthy, robust, and aligned. We are particularly interested in “low-and-slow” behaviours, long-term reasoning, and principled evaluation in realistic environments. Possible projects sit at the intersection of one or several of: machine learning capabilities, safety, security, evaluation science, and formal methods - ideal for students excited by high-impact, technically deep research.

Resources:

[1] Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents, C Schroeder de Witt, arXiv:2505.02077 [cs.CR]

[2] Secret Collusion among AI Agents: Multi-Agent Deception via Steganography, SR Motwani, M Baranchuk, M Strohmeier, V Bolina, PHS Torr, L Hammond, C Schroeder de Witt, NeurIPS 2024

[3] MALT: Improving Reasoning with Multi-Agent LLM Training, SR Motwani, C Smith, RJ Das, R Rafailov, I Laptev, PHS Torr, F Pizzati, R Clark, C Schroeder de Witt, COLM 2025

[4] Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs, X Davies, E Winsor, A Souly, T Korbak, R Kirk, C Schroeder de Witt*, Y Gal*, NeurIPS 2025

[5] Defeating Prompt Injections by Design, E Debenedetti, I Shumailov, T Fan, J Hayes, N Carlini, D Fabian, C Kern, C Shi, A Terzis, F Tramèr, arXiv:2503.18813 [cs.CR]

[6] REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites, D Garg, S VanWeelden, D Caples, A Draguns, N Ravi, P Putta, N Garg, T Abraham, M Lara, F Lopez, J Liu, A Gundawar, P Hebbar, Y Joo, J Gu, C London, C Schroeder de Witt, S Motwani, arXiv:2504.11543 [cs.AI]

Undetectable Threats

We investigate the emerging science of undetectable threats in multi-agent AI systems. As autonomous agents proliferate, security challenges are shifting from overt attacks to subtle, covert, and strategically grounded manipulations. Our work draws on game theory, information theory, and modern machine learning to study steganography, covert channels, illusory attacks, and undetectable neural network backdoors. At the same time, we develop principled monitoring tools for detecting out-of-distribution dynamics and hidden coordination. The goal is to anticipate and mitigate the most pressing security risks facing autonomous systems - building theoretical foundations and practical defenses for a world of increasingly capable, interacting AI agents.

Resources:

[1] Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents, C Schroeder de Witt, arXiv:2505.02077 [cs.CR]

[2] Perfectly Secure Steganography Using Minimum Entropy Coupling, C Schroeder de Witt*, S Sokota*, JZ Kolter, JN Foerster, M Strohmeier, ICLR 2023

[3] Unelicitable Backdoors via Cryptographic Transformer Circuits,
A Draguns, A Gritsevskiy, S Motwani, C Schroeder de Witt, NeurIPS 2024

[4] Rethinking Out-of-Distribution Detection for Reinforcement Learning: Advancing Methods for Evaluation and Detection, L Nasvytis, K Sandbrink, J Foerster, T Franzmeyer, C Schroeder de Witt, AAMAS 2024

[5] Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs, X Davies, E Winsor, A Souly, T Korbak, R Kirk, C Schroeder de Witt*, Y Gal*, NeurIPS 2025

[6] Multi-Agent Common Knowledge Reinforcement Learning, C Schroeder de Witt*, J Foerster*, G Farquhar, P Torr, W Boehmer, S Whiteson, NeurIPS 2019

Foundations of Interpretability

We study the foundations and limits of interpretability for modern AI systems. As models scale, mechanistic interpretability and automated interpretation (autointerp) face fundamental constraints: human input bottlenecks, causal identifiability limits, information-theoretic barriers, and even forms of program obfuscation within neural networks. Our goal is to rigorously characterise these limits - both theoretical and practical - and develop principled methods to work around them. By combining insights from representation learning, sparse modelling, causality, and information theory, we aim to clarify what interpretability can realistically achieve, and where it must evolve to remain effective for large-scale, safety-critical AI systems.

Resources:

[1] The Dead Salmons of AI Interpretability, M Méloux, G Dirupo, F Portet, M Peyrard, arXiv:2512.18792 [cs.AI]

[2] Efficient Dictionary Learning with Switch Sparse Autoencoders, A Mudide, J Engels, E J Michaud, M Tegmark, C Schroeder de Witt, ICLR 2024

[3] Unelicitable Backdoors via Cryptographic Transformer Circuits,
A Draguns, A Gritsevskiy, S Motwani, C Schroeder de Witt, NeurIPS 2024

Multi-Agent Evaluation Science

As AI systems increasingly operate through interacting agents, evaluation itself becomes a core scientific problem. We study how to rigorously measure capability, robustness, and security in multi-agent LLM systems deployed in dynamic, adversarial environments. This includes developing scalable metrics for coordination, strategic reasoning, and multi-agent security, as well as analysing failure under distribution shift and emergent dynamics. A central focus is sim-to-real transfer: understanding how behaviours validated in simulated settings generalise to open-world deployment. We aim to build scalable, principled evaluation frameworks that meaningfully shape training and enable reliable autonomous agents in complex real-world systems.

Resources:

[1] Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents, C Schroeder de Witt, arXiv:2505.02077 [cs.CR]

[2] Multi-Agent Risks from Advanced AI, Lewis Hammond et al., arXiv:2502.14143 [cs.MA]