llmPublished: June 23, 2026

Can Scale Save Us From Plasticity Loss in Large Language Models?

By J. Fernando Hernandez-Garcia, Tomás Figliolia, Beren Millidge

Research TL;DR

"Plasticity loss in Transformer LLMs follows a scaling law with model size, delaying but not preventing it. Even stationary multilingual training shows plasticity loss, challenging prior assumptions."

Abstract

The loss of plasticity - the ability of a network to learn new information after having already learned older information - is a fundamental challenge in creating artificial neural networks capable of continual learning. Although this phenomenon has been known for decades, it has mostly been studied in older, relatively small architectures and rarely in natural-language domains. To determine whether loss of plasticity remains a problem in the modern transformer-based LLM paradigm, we study plasticity loss in GPT-style Transformer models trained on a multilingual continual learning problem. Consistent with prior work, we find evidence of plasticity loss across models ranging from 5M to 314M non-embedding parameters, as measured by deterioration on a held-out Vietnamese probing task. We further find that the onset of plasticity loss follows a predictable scaling law, growing sublinearly with model size. These results suggest that larger models may delay the measurable effects of plasticity loss, but that increasing parameter count alone is likely to be insufficient to completely prevent it. We also find evidence of plasticity loss under stationary multilingual training, challenging the view that the phenomenon is exclusive to continual learning with abrupt task changes. Overall, our results suggest that even large Transformer language models trained on natural-language will eventually lose the ability to efficiently adapt to new data after sufficiently long training, in both continual and stationary settings.

Read full paper on arXiv →

Related Research

Jun 2026

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk

Detect churn before it happens →

Can Scale Save Us From Plasticity Loss in Large Language Models?

Abstract

Related Research

Matching Tasks to Objectives: Fine-Tuning and Prompt-Tuning Strategies for Encoder-Decoder Pre-trained Language Models

Grad Detect: Gradient-Based Hallucination Detection in LLMs

Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

Accelerate your workflow with Feedalyze