Percy Liang · May 2, 2024 · 3:44 AM UTC

Percy Liang · May 2, 2024 · 3:44 AM UTC

Percy Liang

Percy Liang

@percyliang

2 May 2024

MMLU is the standard LM evaluation but model developers (i) use different prompting strategies and (ii) often do not release prompts. 3rd-party researchers often obtain lower scores 🤯 📢 HELM MMLU uses simple, standardized prompts, resulting in fair, reproducible comparisons of models:

May 2, 2024 · 3:44 AM UTC

202