MMLU is the standard LM evaluation but model developers (i) use different prompting strategies and (ii) often do not release prompts. 3rd-party researchers often obtain lower scores 🤯
📢 HELM MMLU uses simple, standardized prompts, resulting in fair, reproducible comparisons of models:
May 2, 2024 · 3:44 AM UTC
11
28
202

