People are racing to push math reasoning performance in
#LLMs—but have we really asked why? The common assumption is that improving math reasoning should transfer to broader capabilities in other domains. But is that actually true?
In our study (
arxiv.org/pdf/2507.00432), we evaluated over 20 open-weight reasoning models and found that:
➡️Only models trained with RL exhibit broad transfer of math reasoning skills to other tasks.
➡️Models trained with SFT show limited or no transfer—especially to non-reasoning domains.
To quantify this, we introduce the Transferability Index (TI), which measures how much gain in math could transfer to others. A positive score indicates effective transfer; a negative one suggests loss of general capability.
We evaluate the models on three benchmark categories:
- Math reasoning: MATH-500, AIME24/25, Olympiad
- Other reasoning: GPQA-D (Science), LiveCodeBench2 (Code), ACPBench (Agent Planning), HeadQA (Medical)
- Non-reasoning: CoQA (Conversational QA), IFEval (Instruction Following), HalluEval (Hallucination), MC-TACO (Commonsense)
Our findings challenge the blind pursuit of leaderboard performance in math reasoning via SFT. Simply creating more math-like SFT data may inadvertently harm a model’s broader generalization. Instead, RL appears to be key for truly transferable reasoning development.