Mechanistic interpretability, visualization
One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.
Topic
Researchers shaping model behavior after pretraining, from instruction tuning and preference learning to scalable oversight.
Start with Chris Olah, Dario Amodei, Amanda Askell if you want the clearest first pass through post-training & alignment as it shows up in practice.
This area overlaps heavily with Anthropic, OpenAI, AI21. Common institution signals include Anthropic, OpenAI, Stanford University. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
Related Labs
Snapshot
Researchers
148
Related labs
5
Starting points
8
Developed dossiers
35
Useful entry points pulled from the strongest linked researcher dossiers.
Feature visualization and interpretability
Via Chris Olah
Frontier-model scaling and deployment tradeoffs
Via Dario Amodei
Behavior shaping in large models
Via Amanda Askell
Adversarial ML and extraction risks
Via Nicholas Carlini
Reward modeling
Via Paul Christiano
Policy optimization and reinforcement learning
Via John Schulman
Frequent institutions showing up across profiles in this area.
Papers, project pages, and repositories that recur across this part of the field.
Constitutional AI: Harmlessness from AI Feedback
48Linked by 48 profiles in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
47Linked by 47 profiles in this topic
Discovering Language Model Behaviors with Model-Written Evaluations
23Linked by 23 profiles in this topic
Discovering Language Model Behaviors with Model-Written Evaluations
22Linked by 22 profiles in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
20Linked by 20 profiles in this topic
Training language models to follow instructions with human feedback
17Linked by 17 profiles in this topic
Challenges in evaluating AI systems
9Linked by 9 profiles in this topic
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
9Linked by 9 profiles in this topic
Source clusters that repeatedly anchor researchers in this area.
Constitutional AI: Harmlessness from AI Feedback
48Used across 48 researcher pages in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
47Used across 47 researcher pages in this topic
Discovering Language Model Behaviors with Model-Written Evaluations
20Used across 20 researcher pages in this topic
Training language models to follow instructions with human feedback
17Used across 17 researcher pages in this topic
Self-Instruct: Aligning Language Models with Self-Generated Instructions
7Used across 7 researcher pages in this topic
Self-Rewarding Language Models
7Used across 7 researcher pages in this topic
A stronger first pass through post-training & alignment, ranked by profile depth, evidence, and editorial importance.
Mechanistic interpretability, visualization
One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.
Alignment, post-training, frontier LLMs
A high-signal figure for understanding the frontier model era because his work sits at the intersection of scaling, post-training, and deployment-risk framing.
Alignment, behavior shaping, safety
A high-signal researcher for understanding how post-training and behavioral steering become concrete product behavior rather than abstract alignment talk.
Adversarial ML, security of deployed models
One of the most useful people to study if you care about what deployed models get wrong under pressure, especially around extraction, adversarial behavior, and practical security failures.
Alignment theory, reward modeling
A foundational thinker in oversight, reward modeling, and delegation-style alignment ideas that influenced much of the modern post-training conversation.
Reinforcement learning, post-training
A key bridge between reinforcement-learning methodology and the post-training techniques now used to shape assistant behavior.
Alignment research, scalable oversight
One of the clearest public anchors for scalable oversight and alignment research in the frontier-model era.
Alignment via AI feedback (Constitutional AI)
High-signal for the seam between machine learning and hardware systems, especially where learned optimization methods begin affecting the actual compute infrastructure underneath frontier models.
Alignment via AI feedback (Constitutional AI)
Useful for the seam between Anthropic’s earlier alignment papers and its later audit-oriented safety work, where interpretability and evaluation start feeding into deployment practice.
148 linked profiles.