We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong.
π Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components.
π Increased CoT length and self-reflection emerge
We share the details and our findings in the blog:
hkust-nlp.notion.site/simpleβ¦
Training code and implementation details here: github.com/hkust-nlp/simpleRβ¦
Jan 25, 2025 Β· 4:00 PM UTC
68
625
3,802


