We scale Hawk to 7B parameters, and Griffin to 14B. Both models exhibit power law scaling, just like Transformers!
Griffin achieves lower held out loss than a strong transformer baseline across all model sizes, while Hawk closes the gap as we scale training FLOPs.
Mar 1, 2024 · 11:02 AM UTC
1
1
5

