We scale Hawk to 7B parameters, and Griffin to 14B. Both models exhibit power law scaling, just like Transformers! Griffin achieves lower held out loss than a strong transformer baseline across all model sizes, while Hawk closes the gap as we scale training FLOPs.

Mar 1, 2024 · 11:02 AM UTC

1
1
5