if you start with GRPO you’re cooked
if you want to understand RL*, start with the policy gradient theorem. then natural policy gradient, generalized advantage estimation, trust region policy optimization, proximal policy optimization, and then group relative policy optimization
If I were you I'd be studying either RL (starting with GRPO) or PTX (starting with cuda). If I were much younger me I'd be studying my ass off in both subjects plus MuZero and training 0.5B models every day on my 4090