New (2h13m ๐
) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
Feb 20, 2024 ยท 5:40 PM UTC
350
1,837
13,608

