Topic

Security & Robustness

Researchers studying adversarial behavior, model extraction, jailbreaks, robustness, and practical deployment risks.

Start with Nicholas Carlini, Pushmeet Kohli, Ethan Perez if you want the clearest first pass through security & robustness as it shows up in practice.

This area overlaps heavily with Anthropic, Google DeepMind, AI21. Common institution signals include Anthropic, Google DeepMind, Google. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

Related Labs

Anthropic Google DeepMind AI21 OpenAI

Snapshot

Researchers

Related labs

Starting points

Developed dossiers

Angles To Understand

Useful entry points pulled from the strongest linked researcher dossiers.

Adversarial ML and extraction risks

Via Nicholas Carlini

Applying frontier AI to science and public-interest problems

Via Pushmeet Kohli

Constitutional AI

Via Ethan Perez

Safety evaluation and monitorability

Via Amelia Glaese

Generative adversarial networks

Via Ian Goodfellow

GPT-3 era language models

Via Tom Brown

Institution Signals

Frequent institutions showing up across profiles in this area.

Anthropic (8)Google DeepMind (4)Google (3)OpenAI (3)AISLE (1)Center for AI Policy (1)Chapman University (1)Crisis24 (1)

Canonical Starting Points

Papers, project pages, and repositories that recur across this part of the field.

Constitutional AI: Harmlessness from AI Feedback

Linked by 7 profiles in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Linked by 7 profiles in this topic

Extracting Training Data from Large Language Models

Linked by 6 profiles in this topic

Many-shot Jailbreaking

Linked by 5 profiles in this topic

Adversarial Examples Are Not Bugs, They Are Features

Linked by 4 profiles in this topic

Red Teaming Language Models with Language Models

Linked by 4 profiles in this topic

Universal and Transferable Adversarial Attacks on Aligned Language Models

Linked by 4 profiles in this topic

Measuring Faithfulness in Chain-of-Thought Reasoning

Linked by 3 profiles in this topic

Frequently Linked Sources

Source clusters that repeatedly anchor researchers in this area.

Constitutional AI: Harmlessness from AI Feedback

Used across 7 researcher pages in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Used across 7 researcher pages in this topic

Extracting Training Data from Large Language Models

Used across 5 researcher pages in this topic

Adversarial Examples Are Not Bugs, They Are Features

Used across 4 researcher pages in this topic

Red Teaming Language Models with Language Models

Used across 4 researcher pages in this topic

Universal and Transferable Adversarial Attacks on Aligned Language Models

Used across 4 researcher pages in this topic

Researchers To Start With

A stronger first pass through security & robustness, ranked by profile depth, evidence, and editorial importance.

Nicholas Carlini

Adversarial ML, security of deployed models

4 sources

One of the most useful people to study if you care about what deployed models get wrong under pressure, especially around extraction, adversarial behavior, and practical security failures.

Post-Training & Alignment Evaluation & Benchmarks

Start HereTowards Evaluating the Robustness of Neural Networks

Pushmeet Kohli

Robotics, vision, structured prediction

4 sources

A strong person to follow if you want to understand how frontier AI gets pushed into science, security, and trustworthy deployment rather than staying inside benchmark culture.

Google DeepMind Evaluation & Benchmarks Vision & Robotics

Start HereAccurate proteome-wide missense variant effect prediction with AlphaMissense

Ethan Perez

Alignment via AI feedback (Constitutional AI)

4 sources

Important because he sits near the boundary between alignment theory and concrete failure-mode discovery, especially jailbreaks, preference training, and behavior evaluations.

Anthropic Post-Training & Alignment Evaluation & Benchmarks

Start HereConstitutional AI: Harmlessness from AI Feedback

Amelia Glaese

Gemini (multimodal foundation models)

4 sources

A useful researcher to follow if you care about the bridge between safety evaluation, human data, and how frontier models are turned into practical tools and benchmarks.

Multimodal Evaluation & Benchmarks

Start HereMonitoring Monitorability

Ian Goodfellow

GANs, adversarial ML

3 sources

A foundational researcher in generative modeling and adversarial robustness whose work changed both how models are trained and how their failure modes are studied.

Diffusion & Generative Media Security & Robustness

Start HereGenerative Adversarial Nets

Tom Brown

Large-scale language modeling

4 sources

One of the clearest researchers to study for the GPT-3 era, especially around few-shot learning, scaling behavior, and what larger language models started making possible in practice.

OpenAI Security & Robustness

Start HereLanguage models are few-shot learners

Geoffrey Irving

Reasoning, verification, math

4 sources

A useful person to study if you care about alignment proposals that try to make superhuman systems legible enough for humans to supervise in practice.

Google DeepMind Multimodal Post-Training & Alignment

Start HereRed Teaming Language Models with Language Models

Kenton Lee

NLP systems and evaluation

4 sources

A strong person to follow for practical language systems because his work sits right at the intersection of pretraining, retrieval, and question answering, where product-grade NLP systems either become robust or fall apart.

Evaluation & Benchmarks Systems & Infrastructure

Start HereKenton Lee

Jeffrey Ladish

Alignment via AI feedback (Constitutional AI)

5 sources

A strong person to know for the security-first side of AI risk work, especially where practical model behavior, jailbreak removal, and broader catastrophic-risk framing start to overlap.

Anthropic Post-Training & Alignment Systems & Infrastructure

Start HereJeffrey Ladish

All Researchers In This Topic

34 linked profiles.