Topic

Interpretability

People trying to open the black box of neural systems and make model internals more legible.

Start with Chris Olah, Dawn Drain, Catherine Olsson if you want the clearest first pass through interpretability as it shows up in practice.

This area overlaps heavily with Anthropic, AI21, EleutherAI. Common institution signals include Anthropic, Conjecture, Kempner Institute for the Study of Natural and Artificial Intelligence. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.

Related Labs

Anthropic AI21 EleutherAI

Snapshot

Researchers

Related labs

Starting points

Developed dossiers

Angles To Understand

Useful entry points pulled from the strongest linked researcher dossiers.

Feature visualization and interpretability

Via Chris Olah

Assistant alignment research

Via Dawn Drain

Mechanistic interpretability

Via Catherine Olsson

EleutherAI and open-source LLM training

Via Sid Black

Scaling laws

Via Sam McCandlish

Transformer circuits

Via Nelson Elhage

Institution Signals

Frequent institutions showing up across profiles in this area.

Anthropic (6)Conjecture (1)Kempner Institute for the Study of Natural and Artificial Intelligence (1)Technion (1)

Canonical Starting Points

Papers, project pages, and repositories that recur across this part of the field.

Constitutional AI: Harmlessness from AI Feedback

Linked by 5 profiles in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Linked by 5 profiles in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Linked by 4 profiles in this topic

A Mathematical Framework for Transformer Circuits

Linked by 3 profiles in this topic

Scaling Laws and Interpretability of Learning from Repeated Data

Linked by 3 profiles in this topic

In-context Learning and Induction Heads

Linked by 2 profiles in this topic

Toy Models of Superposition

Linked by 2 profiles in this topic

Analysis Methods in Neural Language Processing: A Survey

Linked by 1 profiles in this topic

Frequently Linked Sources

Source clusters that repeatedly anchor researchers in this area.

Constitutional AI: Harmlessness from AI Feedback

Used across 5 researcher pages in this topic

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Used across 5 researcher pages in this topic

Chris Olah (blog)

Used across 1 researcher pages in this topic

GPT-NeoX (GitHub)

Used across 1 researcher pages in this topic

Jamba-1.5: Hybrid Transformer-Mamba Models at Scale

Used across 1 researcher pages in this topic

Jamba: A Hybrid Transformer-Mamba Language Model

Used across 1 researcher pages in this topic

Researchers To Start With

A stronger first pass through interpretability, ranked by profile depth, evidence, and editorial importance.

Chris Olah

Mechanistic interpretability, visualization

4 sources

One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.

Post-Training & Alignment Reinforcement Learning

Start HereFeature Visualization

Dawn Drain

Alignment via AI feedback (Constitutional AI)

5 sources

Useful for the seam between Anthropic’s earlier alignment papers and its later audit-oriented safety work, where interpretability and evaluation start feeding into deployment practice.

Anthropic Post-Training & Alignment Evaluation & Benchmarks

Start HereTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Catherine Olsson

Alignment via AI feedback (Constitutional AI)

5 sources

One of the clearest people to follow if you want the mechanistic-interpretability thread at Anthropic rather than only its safety-policy surface.

Anthropic Post-Training & Alignment Reinforcement Learning

Start HereA Mathematical Framework for Transformer Circuits

Sid Black

Open-source LLMs, training

3 sources

A useful anchor for the open-model ecosystem because his path runs from EleutherAI’s training efforts into a more explicit alignment and interpretability agenda at Conjecture.

EleutherAI Open Models Post-Training & Alignment

Start HereConjecture

Sam McCandlish

Alignment via AI feedback (Constitutional AI)

5 sources

One of the clearest people to follow if you care about scaling laws, training efficiency, and the systems choices that quietly shape frontier-model progress.

Anthropic Post-Training & Alignment Evaluation & Benchmarks

Start HereRahul Patil joins Anthropic as Chief Technology Officer

Nelson Elhage

Alignment via AI feedback (Constitutional AI)

5 sources

One of the most important people to follow for mechanistic interpretability and transformer-circuits-style attempts to reverse engineer how large language models work.

Anthropic Post-Training & Alignment Reinforcement Learning

Start HereA Mathematical Framework for Transformer Circuits

Yonatan Belinkov

Hybrid Transformer–Mamba language models (Jamba)

5 sources

A high-signal researcher for understanding what large language models represent internally, especially where interpretability, robustness, and multilingual NLP meet.

AI21 Systems & Infrastructure Interpretability

Start HereYonatan Belinkov

Andy Jones

Alignment via AI feedback (Constitutional AI)

4 sources

One of the earlier Anthropic contributors worth tracking if you care about the transition from RLHF-style assistant training into scaling and evaluation work.

Anthropic Post-Training & Alignment Evaluation & Benchmarks

Start HereTraining a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

All Researchers In This Topic

8 linked profiles.