MMLU is the standard LM evaluation but model developers (i) use different prompting strategies and (ii) often do not release prompts. 3rd-party researchers often obtain lower scores 🤯 📢 HELM MMLU uses simple, standardized prompts, resulting in fair, reproducible comparisons of models:

May 2, 2024 · 3:44 AM UTC

11
28
202