AI Researchers Can Now "Trace" How Models Think Step-by-Step, Revealing Inner Reasoning

AI systems have begun to surrender their secrets, as researchers can now track the precise chains of thought occurring inside neural networks, tracing how concepts emerge, interact, and generate responses in what Anthropic CEO Dario Amodei calls "circuits" – a breakthrough that could fundamentally change our relationship with AI.
These groundbreaking capabilities represent a significant advance in AI transparency, End of Miles reports, potentially addressing one of the field's most persistent challenges: understanding how AI actually "thinks" when solving problems.
Following the AI thought process
For years, leading AI companies have struggled to understand the internal mechanisms of their own creations. The Anthropic CEO explains that modern AI systems have been fundamentally opaque, with their decision-making processes hidden behind "vast matrices of billions of numbers."
"With circuits, we can 'trace' the model's thinking. For example, if you ask the model 'What is the capital of the state containing Dallas?', there is a 'located within' circuit that causes the 'Dallas' feature to trigger the firing of a 'Texas' feature, and then a circuit that causes 'Austin' to fire after 'Texas' and 'capital'." Dario Amodei, Anthropic CEO
This capability represents a significant evolution from earlier interpretability efforts that could only identify isolated concepts in models. The AI researcher describes how his team can now observe complex reasoning chains in action, including how models plan ahead when writing creative content.
Seeing poetry in motion
What makes these findings particularly striking is how they reveal the sophisticated planning that occurs inside AI systems. According to Amodei, researchers can now observe how models plan ahead for rhymes when writing poetry and how they share concepts across different languages.
"Even though we've only found a small number of circuits through a manual process, we can already use them to see how a model reasons through problems—for example how it plans ahead for rhymes when writing poetry, and how it shares concepts across languages." Amodei
The Anthropic founder notes that while these advances are impressive, they represent only the beginning. His team is "working on ways to automate the finding of circuits," as they believe "there are millions within a model that interact in complex ways."
Why this matters now
This breakthrough comes at a critical moment for AI development. Interpreting model thinking processes has implications that extend far beyond academic interest, potentially addressing key concerns about AI safety and reliability.
The technology leader describes a vision of using these techniques to perform what he calls a "brain scan" of advanced AI systems, identifying issues including "tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole."
"Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a 'brain scan': a checkup that has a high probability of identifying a wide range of issues... This would then be used in tandem with the various techniques for training and aligning models, a bit like how a doctor might do an MRI to diagnose a disease, then prescribe a drug to treat it." Amodei
These advances come as Anthropic and other major AI labs race to develop increasingly capable systems. The company's research efforts suggest that the interpretability field has moved from identifying individual concepts to mapping the complex relationships between them, offering unprecedented visibility into artificial reasoning.