Understanding AI Benchmarks - GSM8K
@mattshumer_’s Reflection Llama model was released two days ago, achieving higher metrics on several popular benchmarks compared to GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B. Notably, the post claimed the model reached a score of 99.2% on GSM8K. Since then, there has been heated discussion regarding this score on GSM8K. Some of the comments include:
> “99.2% performance on GSM8k even though GSM8k has more than 1% error rate.” —
@gazorp5 nitter.app/gazorp5/status/1831844…
> “On GSM8K, 98% is better than 99%.” —
@kohjingyu nitter.app/kohjingyu/status/18320…
> “This is super interesting, but I’m quite surprised to see a GSM8k score of over 99%.” —
@hughbzhang nitter.app/hughbzhang/status/1831…
So, what do these discussions mean?
Let me start with some background on GSM8K:
GSM8K, short for Grade School Math 8K, is a dataset comprising 8,500 high-quality, linguistically diverse math word problems aimed at a grade school level. It was designed to facilitate research in multi-step mathematical reasoning and problem-solving using language models. You can explore the dataset here:
huggingface.co/datasets/open…
> Content: The dataset includes word problems that require between 2 to 8 steps to solve. These problems primarily involve basic arithmetic operations such as addition, subtraction, multiplication, and division.
> Structure: GSM8K is divided into 7,500 training problems and 1,000 test problems, providing a robust framework for training and evaluating models.
> Educational Level: The problems are intended to be solvable by bright middle school students and require no concepts beyond early algebra.
> Benchmarking: It serves as a benchmark for evaluating the performance of language models in solving math word problems.
> An example from the dataset:
Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Answer: Natalia sold 48/2 = 24 clips in May. Natalia sold 48+24 = 72 clips altogether in April and May. #### 72
> Issues: It has been discovered that some answers in the dataset are incorrect:
github.com/openai/grade-scho…. As a result, it is impossible to achieve a perfect score of 100 on this benchmark. I haven’t found a complete list of wrong answers, but some claim that the dataset has an error rate of over 1%. This is why people are shocked by Reflection Llama’s reported result of 99.2% on GSM8K.
So is it possible that Llama Reflection or any AI model achieves 99.2% on GSM8K? I think it's still too early to tell at the moment before finishing the following steps:
> The weights of the Reflection Llama model on HuggingFace were not the correct version (ref.
nitter.app/mattshumer_/status/183…). At
@hyperbolic_labs, we are currently hosting Reflection Llama based on the weights on HuggingFace. Once the correct version is updated, we will update the model and conduct a rigorous evaluation with
@ArtificialAnlys to clarify the situation publicly.
> Given the potential issues of GSM8K, it would be helpful to thoroughly review the dataset’s answers using both human experts and state-of-the-art AIs to identify and correct all errors. In this way, we can know the upper bound of the score on this benchmark and have a corrected version to aim for an AI to achieve 100% accuracy, a critical milestone in quantitative reasoning.
> We always want to support the open-source AI community in pushing the limits of AI. If you have trained or plan to train an AI model that is good at math, feel free to reach out to us, and we will help host the model or collaborate together!
P.S.: In this GitHub issue
github.com/openai/grade-scho…, the author points out the following answer as incorrect:
Question: After scoring 14 points, Erin now has three times more points than Sara, who scored 8. How many points did Erin have before?
Answer (according to the dataset): 18
Both GPT-4o and Claude Sonnet 3.5 believe the correct answer should be 10.
However, according to Wikipedia (
en.wikipedia.org/wiki/Wikipe…), “three times more” actually means four times as many. Should the correct answer actually be 4 x 8 - 14 = 18, which means the answer in the dataset is actually correct?