Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
By Zijian Zhang, Rizhen Hu, Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Hongzhou Lin, Mingyi Hong
"RL adaptation gains in LLMs are concentrated in a single middle transformer layer; training only that layer can recover or surpass full-parameter RL performance."
Abstract
Reinforcement learning (RL) has become a central component of post-training large language models (LLMs), yet little is understood about how RL adaptation is distributed across transformer layers. Existing approaches typically update all model parameters uniformly, implicitly assuming that every layer contributes similarly to the gains obtained during RL post-training. In this work, we challenge this assumption through a systematic layer-wise study of RL training. Surprisingly, we find that training a single transformer layer can recover most of the gains achieved by full-parameter RL training, and in some cases even surpass it. To quantify this phenomenon, we introduce the quantity layer contribution, which measures the fraction of full RL improvement recovered by training a layer in isolation. Across seven models spanning two model families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and multiple task domains including mathematical reasoning, code generation, and agentic decision-making, we observe a remarkably stable pattern: RL gains are highly concentrated in a small subset of, and in many cases even a single, transformer layers. More strikingly, the same structural pattern consistently emerges: high-contribution layers concentrate in the middle of the transformer stack, while layers near the input and output ends contribute substantially less. The resulting layer rankings remain strongly correlated across datasets, tasks, model families, and RL algorithms.
Technical Analysis & Implementation
Technical Breakdown§
Core Methodology§
The paper investigates how reinforcement learning (RL) post-training affects different transformer layers in large language models (LLMs). The authors propose a metric called layer contribution to quantify the fraction of full-parameter RL improvement recovered by training only a single layer in isolation. Formally, for a given layer $\ell$, the contribution is:
$$ C_\ell = \frac{\text{Score}_{\text{RL-only}(\ell)} - \text{Score}_{\text{base}}}{\text{Score}_{\text{full-RL}} - \text{Score}_{\text{base}}} $$
where $\text{Score}_{\text{base}}$ is the pretrained model score, $\text{Score}_{\text{full-RL}}$ is the score after full-parameter RL, and $\text{Score}_{\text{RL-only}(\ell)}$ is the score after RL training only on layer $\ell$ while freezing all other layers.
Experimental Setup§
The study spans seven models from two families (Qwen3, Qwen2.5), three RL algorithms (GRPO, GiGPO, Dr. GRPO), and tasks in mathematical reasoning, code generation, and agentic decision-making. For each model and task, they perform layer-wise ablation: for each layer, they freeze all other parameters and run the standard RL algorithm (e.g., GRPO) updating only that layer's parameters. The training budget (number of RL steps) is kept the same as for the full-parameter baseline.
Key Findings§
1. Concentration of gains: In most cases, training a single layer recovers over 80% of the full-parameter improvement, and sometimes even surpasses it. \ 2. Location of high-contribution layers: The high-contribution layers consistently appear in the middle of the transformer stack (e.g., layers 20-30 out of 72 for Qwen3-32B), while input and output layers contribute minimally. \ 3. Robustness: The layer ranking (sorted by contribution) remains highly correlated across different datasets, tasks, and RL algorithms (Spearman correlation >0.9).
Implementation Details§
The training procedure modifies the standard RL loop to update only the selected layer's parameters. Below is a simplified PyTorch-style code snippet for performing RL training on a single transformer layer:
import torch
import torch.nn as nn
class SingleLayerRLTrainer:
def __init__(self, model, layer_idx, lr=1e-5):
self.model = model
self.layer_idx = layer_idx
# Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# Unfreeze the selected layer
for param in model.transformer.layers[layer_idx].parameters():
param.requires_grad = True
self.optimizer = torch.optim.AdamW(
model.transformer.layers[layer_idx].parameters(), lr=lr
)
def train_step(self, batch):
# Standard RL loss (e.g., from GRPO)
loss = compute_rl_loss(self.model, batch)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()Deeper Analysis§
The authors provide intuition: middle layers act as a "bottleneck" for learned RL behaviors, while early layers capture universal linguistic features and later layers specialize in final output distribution. The finding suggests that full-parameter RL is wasteful, and targeted layer training could be more compute-efficient.
Implications§
This work has major implications for practical RL post-training: one can identify and train only the most impactful layer(s), drastically reducing memory and computation costs without sacrificing performance. It also opens questions about layer-specific learning dynamics in LLMs.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.
Related Research
Measuring the Gap Between Human and LLM Research Ideas
Read Synopsis →Jun 2026When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
Read Synopsis →Jun 2026Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk