I agree with
@karpathy 's take here. The interview between
@RichardSSutton and
@dwarkesh_sp was interesting, but I think at times there was a communication gap due to some misunderstandings.
I would say that the current LLM training setup is very similar to the classic model-free RL setup, except that with LLMs:
(1) the policy is warm-started from a supervised model (no de-novo, self-directed learning);
(2) there is a train/test distinction (no continual learning);
(3) most of the observation stream comes from human words, which already "carve nature at its joints", bypassing the harder problem of learning useful abstractions from raw sensorimotor streams.
(4) when using multimodal models, the perceptual encoder is usually pre-trained and frozen, and often relies on a lot of human engineering (eg contrastive losses, or pixel-prediction losses) to come up with a good set of (soft) tokens.
Most of the interview seem to focus on issue #1. However, the discussion seemed confused here due to the fact that LLMs are both a world model (predict what humans would typically say) and a policy (predict what the agent should do).
Obviously the model from the supervised pretraining stage is not action-conditioned, so Sutton does not want to call it a WM - but it is a predictor of future observations given the past, so it's like a WM that marginalizes over actions (resulting in a mixture).
The WM is then converted into a (goal-conditioned) policy using IFT (imitation learning) and then improved with RLFT, which further confuses the discussion. In current practice, the RLFT stage mostly just uses human provided reasoning tasks, which are bandit problems that do not involve interacting with an environment. But there is a recent move towards true multi-step RL, where LLMs do learn from external environments, as in classic RL. This fact was not emphasized enough in the interview, IMHO.
Andrej argues that warm-starting is a practical alternative to evolution's outer meta-learning loop, and I agree, so I don't have a problem with #1. But I do agree with Sutton's criticisms #2-#4.
In particular, I expect a lot of future progress to come from continual RL applied to multimodal problems (eg. visual GUI-using agents) in non-stationary multi-agent environments (e.g., e-commerce or embodied AI), where the agent learns its own abstractions over time (eg creating tool libraries), it learns both a (goal agnostic) world model and a (goal conditioned) policy (so it can do decision time planning), and both kinds of model become semi-parametric (eg. combining memories and ICL with gradient-based weight updates).
Future agents will not just be a frozen "omni-transformer", consuming and generating tokens, they will be heterogeneous adaptive systems, with many different specialized modules, more like the brain. (This may make serving hard, but who said intelligence would be easy to reproduce?) I think Sutton will like this new paradigm more :)