Mechanistic interpretability, visualization
One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.
Topic
People trying to open the black box of neural systems and make model internals more legible.
Start with Chris Olah, Dawn Drain, Catherine Olsson if you want the clearest first pass through interpretability as it shows up in practice.
This area overlaps heavily with Anthropic, AI21, EleutherAI. Common institution signals include Anthropic, Conjecture, Kempner Institute for the Study of Natural and Artificial Intelligence. Recurring starting points include Constitutional AI: Harmlessness from AI Feedback, Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.
Related Labs
Snapshot
Researchers
8
Related labs
3
Starting points
8
Developed dossiers
3
Useful entry points pulled from the strongest linked researcher dossiers.
Frequent institutions showing up across profiles in this area.
Papers, project pages, and repositories that recur across this part of the field.
Constitutional AI: Harmlessness from AI Feedback
5Linked by 5 profiles in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
5Linked by 5 profiles in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
4Linked by 4 profiles in this topic
A Mathematical Framework for Transformer Circuits
3Linked by 3 profiles in this topic
Scaling Laws and Interpretability of Learning from Repeated Data
3Linked by 3 profiles in this topic
In-context Learning and Induction Heads
2Linked by 2 profiles in this topic
Toy Models of Superposition
2Linked by 2 profiles in this topic
Analysis Methods in Neural Language Processing: A Survey
1Linked by 1 profiles in this topic
Source clusters that repeatedly anchor researchers in this area.
Constitutional AI: Harmlessness from AI Feedback
5Used across 5 researcher pages in this topic
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
5Used across 5 researcher pages in this topic
Chris Olah (blog)
1Used across 1 researcher pages in this topic
GPT-NeoX (GitHub)
1Used across 1 researcher pages in this topic
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
1Used across 1 researcher pages in this topic
Jamba: A Hybrid Transformer-Mamba Language Model
1Used across 1 researcher pages in this topic
A stronger first pass through interpretability, ranked by profile depth, evidence, and editorial importance.
Mechanistic interpretability, visualization
One of the clearest interpreters of neural-network internals, especially in the line of work that turned interpretability into a concrete research agenda rather than a vague aspiration.
Alignment via AI feedback (Constitutional AI)
Useful for the seam between Anthropic’s earlier alignment papers and its later audit-oriented safety work, where interpretability and evaluation start feeding into deployment practice.
Alignment via AI feedback (Constitutional AI)
One of the clearest people to follow if you want the mechanistic-interpretability thread at Anthropic rather than only its safety-policy surface.
Open-source LLMs, training
A useful anchor for the open-model ecosystem because his path runs from EleutherAI’s training efforts into a more explicit alignment and interpretability agenda at Conjecture.
Alignment via AI feedback (Constitutional AI)
One of the clearest people to follow if you care about scaling laws, training efficiency, and the systems choices that quietly shape frontier-model progress.
Alignment via AI feedback (Constitutional AI)
One of the most important people to follow for mechanistic interpretability and transformer-circuits-style attempts to reverse engineer how large language models work.
Hybrid Transformer–Mamba language models (Jamba)
A high-signal researcher for understanding what large language models represent internally, especially where interpretability, robustness, and multilingual NLP meet.
Alignment via AI feedback (Constitutional AI)
One of the earlier Anthropic contributors worth tracking if you care about the transition from RLHF-style assistant training into scaling and evaluation work.
8 linked profiles.