“METR: Measuring AI Ability to Complete Long Tasks” by Ben_West🔸
EA Forum Podcast (All audio) - A podcast by EA Forum Team

Categories:
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years. The shaded region represents 95% CI calculated by hierarchical bootstrap over task families, tasks, and task attempts. Full paper | Github repo Blogpost; tweet thread. --- First published: March 19th, 2025 Source: https://forum.effectivealtruism.org/posts/YJ7Pk2bwTd3ieimG8/metr-measuring-ai-ability-to-complete-long-tasks --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.