If you traveled back to 2015 and had to give a recipe for AGI, you'd need to convey just three insights, deceptively simple in retrospect. Number two was the most surprising and the biggest unlock.
1. Replace RNNs with attention mechanisms operating on fixed context windows. This solves the fundamental problem of compressing sequential information into hidden states, while enabling massive model parallelism - letting you efficiently train models with hundreds of billions of parameters.
2. Train autoregressive next-token prediction on vast, curated, text corpora. While early training produces surface-level pattern matching, beyond a critical scale threshold, general intelligence and in-context learning emerge reliably. The scaling behavior of perplexity vs. compute/data/parameters follows clear power laws across many orders of magnitude.
3. Teach reasoning by training models to generate explicit step-by-step solutions. For tasks like mathematics and programming, sample multiple completion attempts, then use reinforcement learning to increase the probability of solution paths that reach correct answers.