Topic

Post-Training & Alignment

Researchers shaping model behavior after pretraining, from instruction tuning and preference learning to scalable oversight.

Start with Chris Olah, Dario Amodei, Amanda Askell if you want the clearest first pass through post-training & alignment as it shows up in practice.

This area overlaps heavily with Anthropic, OpenAI, AI21. Common institution signals include Anthropic, OpenAI, Stanford University. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

Snapshot

Researchers

148

Related labs

Starting points

Developed dossiers

Angles To Understand

Useful entry points pulled from the strongest linked researcher dossiers.

Feature visualization and interpretability

Via Chris Olah

Frontier-model scaling and deployment tradeoffs

Via Dario Amodei

Behavior shaping in large models

Via Amanda Askell

Adversarial ML and extraction risks

Via Nicholas Carlini

Reward modeling

Via Paul Christiano

Policy optimization and reinforcement learning

Via John Schulman

Institution Signals

Frequent institutions showing up across profiles in this area.

Anthropic (48)OpenAI (9)Stanford University (6)AI21 Labs (5)Google (5)Google DeepMind (3)Alignment Research Center (2)Conjecture (2)

Canonical Starting Points

Papers, project pages, and repositories that recur across this part of the field.

Constitutional AI: Harmlessness from AI Feedback

Linked by 48 profiles in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Linked by 47 profiles in this topic

Discovering Language Model Behaviors with Model-Written Evaluations

Linked by 23 profiles in this topic

Discovering Language Model Behaviors with Model-Written Evaluations

Linked by 22 profiles in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Linked by 20 profiles in this topic

Training language models to follow instructions with human feedback

Linked by 17 profiles in this topic

Challenges in evaluating AI systems

Linked by 9 profiles in this topic

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Linked by 9 profiles in this topic

Frequently Linked Sources

Source clusters that repeatedly anchor researchers in this area.

Constitutional AI: Harmlessness from AI Feedback

Used across 48 researcher pages in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Used across 47 researcher pages in this topic

Discovering Language Model Behaviors with Model-Written Evaluations

Used across 20 researcher pages in this topic

Training language models to follow instructions with human feedback

Used across 17 researcher pages in this topic

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Used across 7 researcher pages in this topic

Self-Rewarding Language Models

Used across 7 researcher pages in this topic

Researchers To Start With

A stronger first pass through post-training & alignment, ranked by profile depth, evidence, and editorial importance.

Chris Olah

Mechanistic interpretability, visualization

4 sources

One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.

Post-Training & Alignment Reinforcement Learning

Start HereFeature Visualization

Dario Amodei

Alignment, post-training, frontier LLMs

3 sources

A high-signal figure for understanding the frontier model era because his work sits at the intersection of scaling, post-training, and deployment-risk framing.

Anthropic Post-Training & Alignment Reinforcement Learning

Start HereAnthropic company

Amanda Askell

Alignment, behavior shaping, safety

3 sources

A high-signal researcher for understanding how post-training and behavioral steering become concrete product behavior rather than abstract alignment talk.

Anthropic Post-Training & Alignment Reinforcement Learning

Start HereClaude's Constitution

Nicholas Carlini

Adversarial ML, security of deployed models

4 sources

One of the most useful people to study if you care about what deployed models get wrong under pressure, especially around extraction, adversarial behavior, and practical security failures.

Post-Training & Alignment Evaluation & Benchmarks

Start HereTowards Evaluating the Robustness of Neural Networks

Paul Christiano

Alignment theory, reward modeling

3 sources

A foundational thinker in oversight, reward modeling, and delegation-style alignment ideas that influenced much of the modern post-training conversation.

Post-Training & Alignment Reinforcement Learning

Start HereDeep Reinforcement Learning from Human Preferences

John Schulman

Reinforcement learning, post-training

3 sources

A key bridge between reinforcement-learning methodology and the post-training techniques now used to shape assistant behavior.

OpenAI Post-Training & Alignment Reinforcement Learning

Start HereProximal Policy Optimization Algorithms

Jan Leike

Alignment research, scalable oversight

3 sources

One of the clearest public anchors for scalable oversight and alignment research in the frontier-model era.

Post-Training & Alignment Agents & Reasoning

Start HereScalable agent alignment via reward modeling

Azalia Mirhoseini

Alignment via AI feedback (Constitutional AI)

5 sources

High-signal for the seam between machine learning and hardware systems, especially where learned optimization methods begin affecting the actual compute infrastructure underneath frontier models.

Anthropic Post-Training & Alignment Systems & Infrastructure

Start HereChip Design with Deep Reinforcement Learning

Dawn Drain

Alignment via AI feedback (Constitutional AI)

5 sources

Useful for the seam between Anthropic’s earlier alignment papers and its later audit-oriented safety work, where interpretability and evaluation start feeding into deployment practice.

Anthropic Post-Training & Alignment Evaluation & Benchmarks

Start HereTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback