AI Excels at Writing Code But Falters When Running It, PaperBench Study Reveals

"Current AI systems can write impressive code, but they're surprisingly inept at running it correctly or analyzing the results," reveals a comprehensive new study examining frontier AI models' abilities to replicate machine learning research. The stark capability gap shows AI systems scoring 35-43% on code development tasks but plummeting to just 1-7% on execution and below 1.5% on results analysis.
End of Miles informs that this striking disparity was uncovered in the PaperBench study, which tested leading AI models' abilities to replicate complex ML research papers from scratch.
Writing vs. Executing: The Critical Gap
The PaperBench analysis revealed that Claude 3.5 Sonnet achieved the highest code development score at 35.4%, while OpenAI's o1 using an alternative "IterativeAgent" scaffold reached 43.3%. However, these same models scored dramatically lower on execution tasks, with Claude achieving just 1.8% and o1 reaching only 4.5%.
"This suggests that models are good at writing lots of code, but aren't successful at integrating, testing, and successfully running that code to achieve results." PaperBench researchers
The AI engineering expert team meticulously broke down each paper into three types of requirements: Code Development (writing correct implementations), Execution (running code successfully), and Result Match (obtaining the expected outputs). This granular assessment revealed the consistent pattern across all evaluated models.
Human Researchers Maintain Edge
The capability divide becomes even more evident when comparing AI performance to human ML researchers. On a subset of PaperBench papers, top ML PhDs achieved 72.4% on code development tasks, 20.4% on execution, and 8.9% on results analysis—maintaining a substantial lead over AI systems in all categories.
The researchers note that AI's initial performance advantage quickly plateaus: "o1's scores mostly plateau after the first hour, suggesting that the model is proficient at writing a lot of code quickly at the beginning of the attempt, but fails to effectively work beyond this time horizon to strategize how to improve its submission."
Implications for AI Development Tools
The study's findings have significant implications for how AI coding assistants might be developed and deployed in real-world settings. While current systems excel at generating code rapidly, they struggle with the critical validation and debugging phases that determine whether code actually works as intended.
"We observe that models perform poorly on Execution and Result Match requirement types, while scoring better at Code Development nodes." PaperBench requirement type analysis
This execution gap suggests that human-AI collaboration remains necessary for complex coding tasks, with humans potentially focusing on execution and validation while AI systems handle initial code generation. The PaperBench team emphasizes that their findings represent just a first baseline for these capabilities.
Broader Significance
The data science community has welcomed this assessment as it provides concrete metrics relevant to frameworks monitoring progress toward autonomous AI research capabilities. The scoring breakdown helps identify specific areas where AI systems need improvement rather than treating "coding ability" as a single monolithic skill.
Looking ahead, the research team suggests that future work on agentic scaffolds should focus on improving AI's ability to strategize about code execution and validation rather than just generating more code. "We expect performance on PaperBench to improve incrementally," the team notes, indicating that this particular capability gap might be addressable through improved system design rather than fundamental model limitations.