Adversarial ML, security of deployed models
One of the most useful people to study if you care about what deployed models get wrong under pressure, especially around extraction, adversarial behavior, and practical security failures.
Topic
Researchers studying adversarial behavior, model extraction, jailbreaks, robustness, and practical deployment risks.
Start with Nicholas Carlini, Pushmeet Kohli, Ethan Perez if you want the clearest first pass through security & robustness as it shows up in practice.
This area overlaps heavily with Anthropic, Google DeepMind, AI21. Common institution signals include Anthropic, Google DeepMind, Google. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
Related Labs
Snapshot
Researchers
34
Related labs
4
Starting points
8
Developed dossiers
13
Useful entry points pulled from the strongest linked researcher dossiers.
Adversarial ML and extraction risks
Via Nicholas Carlini
Applying frontier AI to science and public-interest problems
Via Pushmeet Kohli
Constitutional AI
Via Ethan Perez
Safety evaluation and monitorability
Via Amelia Glaese
Generative adversarial networks
Via Ian Goodfellow
GPT-3 era language models
Via Tom Brown
Frequent institutions showing up across profiles in this area.
Papers, project pages, and repositories that recur across this part of the field.
Constitutional AI: Harmlessness from AI Feedback
7Linked by 7 profiles in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
7Linked by 7 profiles in this topic
Extracting Training Data from Large Language Models
6Linked by 6 profiles in this topic
Many-shot Jailbreaking
5Linked by 5 profiles in this topic
Adversarial Examples Are Not Bugs, They Are Features
4Linked by 4 profiles in this topic
Red Teaming Language Models with Language Models
4Linked by 4 profiles in this topic
Universal and Transferable Adversarial Attacks on Aligned Language Models
4Linked by 4 profiles in this topic
Measuring Faithfulness in Chain-of-Thought Reasoning
3Linked by 3 profiles in this topic
Source clusters that repeatedly anchor researchers in this area.
Constitutional AI: Harmlessness from AI Feedback
7Used across 7 researcher pages in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
7Used across 7 researcher pages in this topic
Extracting Training Data from Large Language Models
5Used across 5 researcher pages in this topic
Adversarial Examples Are Not Bugs, They Are Features
4Used across 4 researcher pages in this topic
Red Teaming Language Models with Language Models
4Used across 4 researcher pages in this topic
Universal and Transferable Adversarial Attacks on Aligned Language Models
4Used across 4 researcher pages in this topic
A stronger first pass through security & robustness, ranked by profile depth, evidence, and editorial importance.
Adversarial ML, security of deployed models
One of the most useful people to study if you care about what deployed models get wrong under pressure, especially around extraction, adversarial behavior, and practical security failures.
Robotics, vision, structured prediction
A strong person to follow if you want to understand how frontier AI gets pushed into science, security, and trustworthy deployment rather than staying inside benchmark culture.
Alignment via AI feedback (Constitutional AI)
Important because he sits near the boundary between alignment theory and concrete failure-mode discovery, especially jailbreaks, preference training, and behavior evaluations.
Gemini (multimodal foundation models)
A useful researcher to follow if you care about the bridge between safety evaluation, human data, and how frontier models are turned into practical tools and benchmarks.
GANs, adversarial ML
A foundational researcher in generative modeling and adversarial robustness whose work changed both how models are trained and how their failure modes are studied.
Large-scale language modeling
One of the clearest researchers to study for the GPT-3 era, especially around few-shot learning, scaling behavior, and what larger language models started making possible in practice.
Reasoning, verification, math
A useful person to study if you care about alignment proposals that try to make superhuman systems legible enough for humans to supervise in practice.
NLP systems and evaluation
A strong person to follow for practical language systems because his work sits right at the intersection of pretraining, retrieval, and question answering, where product-grade NLP systems either become robust or fall apart.
Alignment via AI feedback (Constitutional AI)
A strong person to know for the security-first side of AI risk work, especially where practical model behavior, jailbreak removal, and broader catastrophic-risk framing start to overlap.
34 linked profiles.