Andrej Karpathy · Feb 20, 2024 · 5:40 PM UTC

Andrej Karpathy · Feb 20, 2024 · 5:40 PM UTC

Andrej Karpathy

Andrej Karpathy

@karpathy

20 Feb 2024

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer" Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.

Feb 20, 2024 · 5:40 PM UTC

350

1,837

13,608