“Announcing Apollo Research” by mariushobbhahn
EA Forum Podcast (Curated & popular) - A podcast by EA Forum Team
Categories:
TL;DRWe are a new AI evals research organization called Apollo Research based in London. We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.Our agenda is split into interpretability and behavioral evals:On the interpretability side, we are currently working on two main research bets toward characterizing neural network cognition. We are also interested in benchmarking interpretability, e.g. testing whether given interpretability tools can meet specific requirements or solve specific challenges.On the behavioral evals side, we are conceptually breaking down ‘deception’ into measurable components in order to build a detailed evaluation suite using prompt- and finetuning-based tests. As an evals research org, we intend to use our research insights and tools directly on frontier models by serving as an external auditor of AGI labs, thus reducing the chance that deceptively misaligned AIs are developed and deployed. We also intend to engage with AI governance efforts, e.g. by working with policymakers and providing technical expertise to aid the drafting of auditing regulations.We have starter funding but estimate a $1.4M funding gap in our first year. We estimate that the maximal amount we could effectively use is $4-6M in addition to current funding levels (reach out if you are interested in donating). We are currently fiscally sponsored by Rethink Priorities. Our starting team consists of 8 researchers and engineers with strong backgrounds in technical alignment research. We are interested in collaborating with both technical and governance researchers. Feel free to reach out at [email protected] intend to hire once our funding gap is closed. If you’d like to stay informed about opportunities, you can fill out our expression of interest form.Research AgendaWe believe that AI deception – where a model outwardly seems aligned but is in fact misaligned and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI (see here for more). We also think that detecting/measuring deception is causally upstream of many potential solutions. For example, having good detection tools enables higher quality and safer feedback loops for empirical alignment approaches, enables us to point to concrete failure modes for lawmakers and the wider public, and provides evidence to AGI labs whether the models they are developing or deploying are deceptively misaligned.Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. Unfortunately, we think that interpretability is not yet at the stage where it can be used effectively on state-of-the-art models. Therefore, we have split the agenda into an interpretability research arm and a behavioral evals arm. We aim to eventually combine interpretability and behavioral evals into a comprehensive model evaluation suite.On the interpretability side, we are currently working on a new unsupervised approach and continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results, but it [...] --- First published: May 30th, 2023 Source: https://forum.effectivealtruism.org/posts/ysC6crBKhDBGZfob3/announcing-apollo-research --- Narrated by TYPE III AUDIO.