Forget Tokens, AI Needs 'Joint Embedding' Architectures to Understand the Real World

Abstract neural architecture transforming pixels to latent space, illustrating Yann LeCun's JEPA concept for AI world models beyond token prediction

While Large Language Models (LLMs) have captured the public imagination, Meta's Chief AI Scientist Yann LeCun believes the real frontier lies beyond simply predicting the next word or token. To build machines that truly understand reality, possess memory, and can reason and plan effectively, he argues, we need fundamentally different approaches, specifically what he calls Joint Embedding Predictive Architectures (JEPA).

LeCun, also a professor at New York University, suggests that the core mechanism of LLMs – predicting discrete tokens from a finite set – hits a wall when dealing with the complexity of the physical world, End of Miles reports. The real world, unlike language, is high-dimensional and continuous, making direct prediction incredibly difficult, if not impossible.

Why Tokens Aren't Enough for Reality

The AI pioneer explains that attempts to build world models – internal representations of how the world works – by training systems to predict raw data like video pixels have consistently fallen short. "Every attempt at trying to get a system to understand the world or build mental models of the world by being trained to predict videos at a pixel level has basically failed," LeCun stated during a conversation with NVIDIA's Bill Dally at GTC 2025.

"If you train a system to predict at a pixel level, it spends all its resources trying to come up with details it just cannot invent. That's a complete waste of resources... It only works if you do it at a representation level." Yann LeCun, Chief AI Scientist at Meta

He elaborated, citing experiments where predicting video using masked autoencoders (similar to LLM pre-training but for pixels) required immense computational power ("boiling a small lake") and ultimately proved unsuccessful ("basically a failure"). The issue, according to the NYU professor, is that the world contains too much unpredictable detail.

Predicting in Abstract Space: The JEPA Idea

Instead of predicting raw pixels or tokens, LeCun champions Joint Embedding Predictive Architectures (JEPA). These models work differently: they first encode inputs (like video frames) into an abstract representation space, or latent space. The prediction then happens within this abstract space, not the original input space.

"Take a chunk of video... run it through an encoder, you get a representation," LeCun explained. "Then take the continuation... run it through an encoder as well, and now try to make a prediction in that representation space instead of making it in the input space."

This approach avoids wasting resources on predicting unknowable details. LeCun pointed to successful projects like VJA (Video Joint-embedding Predictive Architecture) which, using this method, can learn representations from video more effectively and efficiently, even showing an ability to detect physically implausible events in videos – a step towards learning intuitive physics.

Building Minds That Plan

The ultimate goal of developing these architectures, the AI specialist notes, is to enable machines to reason and plan like humans do – not by manipulating language tokens, but by simulating outcomes within their internal world models.

"When we reason, when we think, we do this in some sort of abstract mental state that has nothing to do with language. You don't want to be kicking tokens around; you want to be reasoning in your latent space, not in token space." Yann LeCun

He envisions systems that can observe the state of the world, propose an action, predict the next state of the world within their learned abstract model, and thus plan a sequence of actions to achieve a goal. This, LeCun believes, is the path toward more capable AI, moving beyond the limitations of current token prediction models. "I think that's the future," he stated, referencing JEPA world models as the key architectural shift needed.

Read more