Junxian He · Jan 25, 2025 · 4:00 PM UTC

Junxian He · Jan 25, 2025 · 4:00 PM UTC

Junxian He

25 Jan 2025

We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. 🚀 Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the resultant model achieves (pass@1) 33.3% on AIME, 62.5% on AMC, and 77.2% on MATH, outperforming Qwen2.5-math-7B-instruct and being comparable to PRIME and rStar-MATH that use >50x more data and more complicated components. 🚀 Increased CoT length and self-reflection emerge We share the details and our findings in the blog: hkust-nlp.notion.site/simple… Training code and implementation details here: github.com/hkust-nlp/simpleR…

Jan 25, 2025 · 4:00 PM UTC

625

3,802