efficiencyPublished: March 29, 2022
Training compute-optimal large language models
By Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford
Research TL;DR
"Establishes the \"Chinchilla scaling laws\" for compute-optimal LLM training. Shows that scaling data tokens is as vital as scaling model parameters."
Abstract
We investigate the optimal allocation of compute budget for training autoregressive language models. By training over 400 baseline models, we find that current LLMs are severely undertrained, and compute-optimal models should scale parameters and tokens in equal proportions.