Reflection Llama 3.1 70B independent eval results: We have been unable to replicate the eval results claimed in our independent testing and are seeing worse performance than Meta’s Llama 3.1 70B, not better.
These evaluations were conducted using our standard methodology, including using our standard system prompt and accessing the model via DeepInfra’s API, which claims bf16 precision. Our evaluation methodology uses a 0-shot prompt with a think step by step instruction.
This is not to say there is no merit in Reflective's prompting approach for achieving higher evaluation results as claimed. We are aware that the Glaive team has been updating the model, and we would be more than happy to test further releases.
We also ran tests comparing our standard system prompt to Glaive’s provided system prompt and we did not observe any differences in the evaluation results on Reflection Llama 3.1 70B, Llama 3.1 70B, GPT-4o or Claude 3.5 Sonnet.
This does not mean the claimed results were not achieved, but we look forward to hearing more about the evaluation approach that led to these results, particularly regarding the exact prompt used and how the evaluation answers were extracted.
Sep 7, 2024 · 4:35 PM UTC
28
74
688

