“Takes on ‘Alignment Faking in Large Language Models’” by Joe_Carlsmith
EA Forum Podcast (All audio) - A podcast by EA Forum Team
Categories:
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year.My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally. In this post, I give some takes on the results in [...] ---Outline:(01:18) Condensed list of takes(10:18) Summary of the results(16:48) Scheming: theory and empirics(24:25) Non-myopia in default AI motivations(27:49) Default anti-scheming motivations don’t consistently block scheming(32:01) The goal-guarding hypothesis(37:18) Scheming in less sophisticated models(39:05) Scheming without a chain of thought?(42:19) Scheming therefore reward-hacking?(44:29) How hard is it to prevent scheming?(46:55) Will models scheme in pursuit of highly alien and/or malign values?(53:25) Is “models won’t have the situational awareness they get in these cases” good comfort?(56:05) Are these models “just role-playing”?(01:01:13) Do models “really believe” that they’re in the scenarios in question?(01:09:21) Why is it so easy to observe the scheming?(01:12:30) Is the model's behavior rooted in the discourse about scheming and/or AI risk?(01:16:59) Is the model being otherwise “primed” to scheme?(01:20:09) Scheming from human imitation(01:28:59) The need for model psychology(01:36:31) Good people sometimes scheme(01:44:04) Scheming moral patients(01:49:52) AI companies shouldn’t build schemers(01:51:08) Evals and further workThe original text contained 104 footnotes which were omitted from this narration. The original text contained 13 images which were described by AI. --- First published: December 18th, 2024 Source: https://forum.effectivealtruism.org/posts/sEsguXTiKBA6LzX55/takes-on-alignment-faking-in-large-language-models --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.