Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Safety Fundamentals: Alignment - A podcast by BlueDot Impact

Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads ...