We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. πŸš€ Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components. πŸš€ Increased CoT length and self-reflection emerge We share the details and our findings in the blog: hkust-nlp.notion.site/simple… Training code and implementation details here: github.com/hkust-nlp/simpleR…

Jan 25, 2025 Β· 4:00 PM UTC

68
625
3,802