efficiencyPublished: March 29, 2022

Training compute-optimal large language models

By Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford

Research TL;DR

"Establishes the \"Chinchilla scaling laws\" for compute-optimal LLM training. Shows that scaling data tokens is as vital as scaling model parameters."

Abstract

We investigate the optimal allocation of compute budget for training autoregressive language models. By training over 400 baseline models, we find that current LLMs are severely undertrained, and compute-optimal models should scale parameters and tokens in equal proportions.

Read full paper on arXiv →

Training compute-optimal large language models

Abstract

Related Research

FlashAttention: Fast and memory-efficient exact attention with IO-awareness

LoRA: Low-rank adaptation of large language models