Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials
By Gil Harari, Yoel Zimmermann, Ola Tangen Kulseng, Laura Zichi, Chuin Wei Tan, Marc L. Descoteaux, Boris Kozinsky
"Replace Adam with SOAP or SOAP-Muon optimizers for faster convergence and better accuracy in training MLIPs like NequIP and Allegro, especially with limited force labels."
Abstract
Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both convergence speed and final accuracy. SOAP and SOAP-Muon emerge as robust and consistently strong methods, while Muon only provides partial gains relative to Adam. The improvements are particularly pronounced under partial force supervision. Our results indicate that optimizer choice is an overlooked yet impactful design axis for MLIPs.
Technical Analysis & Implementation
Beyond Adam: SOAP and Muon for MLIP Training§
Core Methodology§
The paper systematically evaluates three matrix-structured optimizers—Muon, SOAP, and a hybrid SOAP-Muon—against Adam for training equivariant MLIPs (NequIP and Allegro). The key insight is that leveraging second-order (or pseudo-second-order) information via structured preconditioning accelerates convergence and improves final accuracy, particularly under partial force supervision.
SOAP (Second-order Optimization with Approximate Preconditioning) is a variant of the Shampoo algorithm. For a parameter matrix $W \in \mathbb{R}^{m \times n}$, it maintains preconditioners $L$ and $R$ as running averages of the gradient's Gram matrices: $$L_{t+1} = \beta L_t + (1-\beta) G_t G_t^\top, \quad R_{t+1} = \beta R_t + (1-\beta) G_t^\top G_t$$ where $G_t$ is the gradient. The update is then: $$W_{t+1} = W_t - \eta \cdot L_t^{-1/4} G_t R_t^{-1/4}$$
Muon uses the Newton-Schulz iteration to compute the matrix orthogonalization of the gradient. For a gradient matrix $G$, it iterates:
X = G
for _ in range(K):
X = 0.5 * X @ (3 * I - X @ X.T)Then $W_{t+1} = W_t - \eta \cdot X$ (after scaling). Muon effectively enforces a near-orthogonal update direction.
SOAP-Muon Hybrid: Apply Muon to weight matrices (typically square or near-square) and SOAP to biases, embeddings, and other non-matrix parameters. This combines the benefits of both: Muon's fast convergence on large weight matrices and SOAP's robustness on remaining parameters.
Implementation Details§
The optimizers are implemented using the torch.optim API with optional Newton-Schulz iterations (5 iterations for Muon). The code snippet below illustrates training with SOAP-Muon:
import torch
from torch.optim.optimizer import Optimizer
# Assuming soap_muon optimizer class is defined as per paper
optimizer = soap_muon(model.parameters(), lr=0.001, muon_modules=[model.layers])
for epoch in range(epochs):
for batch in dataloader:
pred = model(batch['positions'], batch['atomic_numbers'])
loss = energy_force_loss(pred['energy'], pred['forces'], batch['energy'], batch['forces'])
loss.backward()
optimizer.step()
optimizer.zero_grad()Key Results§
- SOAP and SOAP-Muon consistently outperform Adam in both convergence speed (fewer steps to reach target accuracy) and final validation error.
- Muon alone provides partial gains but is less robust than SOAP or hybrid.
- Under partial force supervision (e.g., only 10% of atoms have force labels), the improvement is even more pronounced: SOAP/Muon achieve ~30% lower force errors than Adam.
- Experiments on NequIP and Allegro with the 3BPA dataset (small organic molecules) and a water dataset show consistent benefits.
Theoretical Insight§
The advantage stems from better conditioning of the optimization landscape. Matrix-structured preconditioning captures gradient correlations across parameters, which is particularly beneficial for equivariant architectures where weight matrices exhibit structured symmetries. The orthogonalization enforced by Muon prevents greedy directions, while SOAP's full covariance approximation reduces ill-conditioning.
Impact§
This work highlights that optimizer choice is an underappreciated design axis for MLIPs. The recommended default is SOAP-Muon (or SOAP alone) for faster training and improved accuracy, reducing computational costs and label requirements.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| GPT-5 | $1.25 | $10.00 |
| GPT-5.5 | $5.00 | $30.00 |
| GLM 4.7 Flash | $0.06 | $0.40 |
| GPT-5.2-Codex | $1.75 | $14.00 |
| Claude Opus 4 | $15.00 | $75.00 |
| Seed 1.6 Flash | $0.07 | $0.30 |
| Seed 1.6 | $0.25 | $2.00 |
| DeepSeek V3.1 | $0.21 | $0.79 |
| Mistral Medium 3.1 | $0.40 | $2.00 |
| o1 | $15.00 | $60.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude Sonnet 5 | $2.00 | $10.00 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| Gemini 3.1 Pro | $2.00 | $12.00 |
| Gemini 3.1 Flash | $0.25 | $1.50 |
| Grok 4.20 | $1.25 | $2.50 |
| GPT-4o | $2.50 | $10.00 |
| Nano Banana 2 Lite (Gemini 3.1 Flash Lite Image) | $0.25 | $1.50 |
| Claude Opus 4.7 (Fast) | $30.00 | $150.00 |
| Gemini 3.1 Flash Lite | $0.25 | $1.50 |
| Claude Sonnet 4.6 | $3.00 | $15.00 |
| o3 Mini | $1.10 | $4.40 |
| DeepSeek R1 | $0.70 | $2.50 |
| GLM 4.5V | $0.60 | $1.80 |
| GPT-5 Chat | $1.25 | $10.00 |
| GPT-5 Nano | $0.05 | $0.40 |
| gpt-oss-120b | $0.03 | $0.15 |
| GPT Chat Latest | $5.00 | $30.00 |
| Qwen 2.5 72B | $0.40 | $0.80 |
| Mistral Medium 3.5 | $1.50 | $7.50 |
| Anthropic Claude Haiku Latest | $1.00 | $5.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 |
| MoonshotAI Kimi Latest | $0.66 | $3.41 |
| GPT-5 Mini | $0.25 | $2.00 |
| Qwen 2.5-Coder 32B | $0.35 | $0.70 |
| Google Gemini Flash Latest | $1.50 | $9.00 |
| Anthropic Claude Sonnet Latest | $2.00 | $10.00 |
| Qwen3.5 Plus 2026-04-20 | $0.30 | $1.80 |
| gpt-oss-20b | $0.03 | $0.14 |
| Claude Opus 4.1 | $15.00 | $75.00 |
| DeepSeek V3 0324 | $0.24 | $0.90 |
| o1-pro | $150.00 | $600.00 |
| Mistral Small 3.1 24B | $0.35 | $0.56 |
| Qwen3.6 Flash | $0.19 | $1.13 |
| Qwen3.6 27B | $0.28 | $2.40 |
| Llama 4 Scout | $0.10 | $0.30 |
| Mistral Small 3 | $0.07 | $0.20 |
| Mistral Large 3 | $0.50 | $1.50 |
| GPT-5.5 Pro | $30.00 | $180.00 |
| DeepSeek V4 Flash | $0.09 | $0.18 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Claude Opus 4.8 | $5.00 | $25.00 |
| Hy3 preview | $0.06 | $0.21 |
| GPT-5.4 Image 2 | $8.00 | $15.00 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| DeepSeek V4 Pro | $0.43 | $0.87 |
| Command R+ | $2.50 | $10.00 |
| Command R | $0.15 | $0.60 |
| MiniMax M2.7 | $0.18 | $0.72 |
| GPT-5.4 Nano | $0.20 | $1.25 |
| GPT-5.4 Mini | $0.75 | $4.50 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
| Mistral Small 4 | $0.15 | $0.60 |
| GLM 5 Turbo | $1.20 | $4.00 |
| Llama 4 Maverick | $0.15 | $0.60 |
| Llama 3.3 70B Instruct | $0.10 | $0.32 |
| Yi-Lightning | $0.15 | $0.30 |
| ERNIE 4.0 | $1.20 | $2.40 |
| Doubao Pro | $0.80 | $1.60 |
| Mistral Large 2 | $0.60 | $1.80 |
| Mixtral 8x22B | $0.50 | $1.00 |
| GPT-5.3-Codex | $1.75 | $14.00 |
| Gemini 3.1 Pro Preview | $2.00 | $12.00 |
| Llama 3.1 405B | $0.80 | $0.80 |
| Llama 3.1 8B | $0.04 | $0.04 |
| Qwen3.5 Plus 2026-02-15 | $0.26 | $1.56 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
| Gemini 3.5 Flash | $1.50 | $9.00 |
| GPT-4.1 | $2.00 | $8.00 |
| Step 3.5 Flash | $0.10 | $0.30 |
| Llama 3.2 11B Vision | $0.34 | $0.34 |
| Kimi K2.5 | $0.38 | $2.02 |
| Claude 3.5 Sonnet v2 | $3.00 | $15.00 |
| Gemini 2.0 Flash | $0.10 | $0.40 |
| Hunyuan Pro | $0.60 | $1.20 |
| DeepSeek V3.2 | $0.23 | $0.34 |
| Nano Banana Pro (Gemini 3 Pro Image Preview) | $2.00 | $12.00 |
| GPT-5.1 | $1.25 | $10.00 |
| GPT-5.1 Chat | $1.25 | $10.00 |
| GPT-5.1-Codex | $1.25 | $10.00 |
| GPT-5.1-Codex-Mini | $0.25 | $2.00 |
| Kimi K2 Thinking | $0.60 | $2.50 |
| GPT-5 Image Mini | $2.50 | $2.00 |
| Nano Banana 2 (Gemini 3.1 Flash Image) | $0.50 | $3.00 |
| Nano Banana Pro (Gemini 3 Pro Image) | $2.00 | $12.00 |
| Claude Opus 4.8 (Fast) | $10.00 | $50.00 |
| Qwen3.7 Max | $1.25 | $3.75 |
| Grok Build 0.1 | $1.00 | $2.00 |
| Grok 4.3 | $1.25 | $2.50 |
| Google Gemini Pro Latest | $2.00 | $12.00 |
| Qwen3.6 35B A3B | $0.14 | $1.00 |
| Qwen3.6 Max Preview | $1.04 | $6.24 |
| Claude Opus Latest | $5.00 | $25.00 |
| Kimi K2.6 | $0.66 | $3.41 |
| Claude Opus 4.7 | $5.00 | $25.00 |
| GLM 5.1 | $0.97 | $3.04 |
| Gemma 4 26B A4B | $0.06 | $0.33 |
| Gemma 4 31B | $0.12 | $0.35 |
| Qwen3.6 Plus | $0.33 | $1.95 |
| GLM 5V Turbo | $1.20 | $4.00 |
| Grok 4.20 Multi-Agent | $1.25 | $2.50 |
| Grok 4.20 | $1.25 | $2.50 |
| Lyria 3 Pro Preview | $0.00 | $0.00 |
| Lyria 3 Clip Preview | $0.00 | $0.00 |
| KAT-Coder-Pro V2 | $0.30 | $1.20 |
| Qwen Plus 0728 | $0.26 | $0.78 |
| Qwen3 235B A22B Thinking 2507 | $0.15 | $1.50 |
| Qwen3 Coder 480B A35B | $0.22 | $1.80 |
| UI-TARS 7B | $0.10 | $0.20 |
| Gemini 2.5 Flash Lite | $0.10 | $0.40 |
| Qwen3 235B A22B Instruct 2507 | $0.09 | $0.10 |
| Hunyuan A13B Instruct | $0.14 | $0.57 |
| ERNIE 4.5 VL 424B A47B | $0.42 | $1.25 |
| Mistral Small 3.2 24B | $0.07 | $0.20 |
| MiniMax M1 | $0.40 | $2.20 |
| Gemini 2.5 Flash | $0.30 | $2.50 |
| o3 Pro | $20.00 | $80.00 |
| Gemini 2.5 Pro Preview 06-05 | $1.25 | $10.00 |
| R1 0528 | $0.50 | $2.15 |
| Gemma 3n 4B | $0.06 | $0.12 |
| Seed-2.0-Lite | $0.25 | $2.00 |
| Qwen3.5-122B-A10B | $0.26 | $2.08 |
| Qwen3.5-Flash | $0.07 | $0.26 |
| Gemini 3.1 Pro Preview Custom Tools | $2.00 | $12.00 |
| Qwen3.5 397B A17B | $0.39 | $2.45 |
| MiniMax M2.5 | $0.12 | $0.48 |
| GLM 5 | $0.60 | $1.92 |
| Qwen3 Max Thinking | $0.78 | $3.90 |
| Qwen3 Coder Next | $0.11 | $0.80 |
| MiniMax M2-her | $0.30 | $1.20 |
| GPT Audio | $2.50 | $10.00 |
| GPT Audio Mini | $0.60 | $2.40 |
| MiniMax M2.1 | $0.30 | $1.20 |
| GLM 4.7 | $0.40 | $1.75 |
| Gemini 3 Flash Preview | $0.50 | $3.00 |
| GPT-5.2 Chat | $1.75 | $14.00 |
| Kimi K2 0711 | $0.57 | $2.30 |
| GPT-5.2 Pro | $21.00 | $168.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Devstral 2 2512 | $0.40 | $2.00 |
| GLM 4.6V | $0.30 | $0.90 |
| GPT-5.1-Codex-Max | $1.25 | $10.00 |
| Ministral 3 14B 2512 | $0.20 | $0.20 |
| Ministral 3 8B 2512 | $0.15 | $0.15 |
| Ministral 3 3B 2512 | $0.10 | $0.10 |
| Mistral Large 3 2512 | $0.50 | $1.50 |
| Mistral Medium 3 | $0.40 | $2.00 |
| Gemini 2.5 Pro Preview 05-06 | $1.25 | $10.00 |
| Llama Guard 4 12B | $0.18 | $0.18 |
| Qwen3 30B A3B | $0.12 | $0.50 |
| Qwen3 8B | $0.12 | $0.46 |
| Qwen3 235B A22B | $0.46 | $1.82 |
| o4 Mini High | $1.10 | $4.40 |
| o3 | $2.00 | $8.00 |
| o4 Mini | $1.10 | $4.40 |
| GPT-4.1 Mini | $0.40 | $1.60 |
| GPT-4.1 Nano | $0.10 | $0.40 |
| Llama 4 Maverick | $0.15 | $0.60 |
| Qwen3 VL 8B Thinking | $0.12 | $1.36 |
| Qwen3 VL 8B Instruct | $0.12 | $0.46 |
| GPT-5 Image | $10.00 | $10.00 |
| o3 Deep Research | $10.00 | $40.00 |
| o4 Mini Deep Research | $2.00 | $8.00 |
| Nano Banana (Gemini 2.5 Flash Image) | $0.30 | $2.50 |
| Qwen3 VL 30B A3B Thinking | $0.13 | $1.56 |
| Qwen3 VL 30B A3B Instruct | $0.13 | $0.52 |
| GPT-5 Pro | $15.00 | $120.00 |
| GLM 4.6 | $0.43 | $1.74 |
| DeepSeek V3.2 Exp | $0.27 | $0.41 |
| Gemini 2.5 Flash Lite Preview 09-2025 | $0.10 | $0.40 |
| Qwen3 VL 235B A22B Thinking | $0.26 | $2.60 |
| Qwen3 VL 235B A22B Instruct | $0.20 | $0.88 |
| Qwen3 Max | $0.78 | $3.90 |
| Qwen3 Coder Plus | $0.65 | $3.25 |
| GPT-5 Codex | $1.25 | $10.00 |
| DeepSeek V3.1 Terminus | $0.27 | $0.95 |
| Qwen3 Coder Flash | $0.20 | $0.97 |
| GLM 5.2 | $0.91 | $2.86 |
| Kimi K2.7 Code | $0.74 | $3.50 |
| Claude Fable Latest | $10.00 | $50.00 |
| Claude Fable 5 | $10.00 | $50.00 |
| Qwen3.7 Plus | $0.32 | $1.28 |
| MiniMax M3 | $0.30 | $1.20 |
| Step 3.7 Flash | $0.20 | $1.15 |
| Qwen3.5-9B | $0.10 | $0.15 |
| GPT-5.4 Pro | $30.00 | $180.00 |
| GPT-5.4 | $2.50 | $15.00 |
| GPT-5.3 Chat | $1.75 | $14.00 |
| Gemini 3.1 Flash Lite Preview | $0.25 | $1.50 |
| Seed-2.0-Mini | $0.10 | $0.40 |
| Nano Banana 2 (Gemini 3.1 Flash Image Preview) | $0.50 | $3.00 |
| Qwen3.5-35B-A3B | $0.14 | $1.00 |
| Qwen3.5-27B | $0.20 | $1.56 |
| Voxtral Small 24B 2507 | $0.10 | $0.30 |
| gpt-oss-safeguard-20b | $0.07 | $0.30 |
| MiniMax M2 | $0.26 | $1.02 |
| Qwen3 VL 32B Instruct | $0.10 | $0.42 |
| Qwen3 14B | $0.10 | $0.24 |
| Codestral 2508 | $0.30 | $0.90 |
| Qwen3 Coder 30B A3B Instruct | $0.07 | $0.27 |
| Qwen3 30B A3B Instruct 2507 | $0.05 | $0.19 |
| GLM 4.5 | $0.60 | $2.20 |
| GLM 4.5 Air | $0.13 | $0.85 |
| Qwen3 32B | $0.08 | $0.28 |
| Qwen-Plus | $0.26 | $0.78 |
| Qwen3 Next 80B A3B Thinking | $0.10 | $0.78 |
| Qwen3 Next 80B A3B Instruct | $0.09 | $1.10 |
| Qwen Plus 0728 (thinking) | $0.26 | $0.78 |
| Kimi K2 0905 | $0.60 | $2.50 |
| Qwen3 30B A3B Thinking 2507 | $0.13 | $1.56 |
| Llama 3.1 70B Instruct | $0.40 | $0.40 |
| Gemma 3 4B | $0.05 | $0.10 |
| Gemma 3 12B | $0.05 | $0.15 |
| Command A | $2.50 | $10.00 |
| GPT-4o-mini Search Preview | $0.15 | $0.60 |
| GPT-4o Search Preview | $2.50 | $10.00 |
| Gemma 3 27B | $0.08 | $0.16 |
| Saba | $0.20 | $0.60 |
| o3 Mini High | $1.10 | $4.40 |
| Qwen2.5 VL 72B Instruct | $0.80 | $1.00 |
| R1 Distill Llama 70B | $0.80 | $0.80 |
| R1 | $0.70 | $2.50 |
| MiniMax-01 | $0.20 | $1.10 |
| DeepSeek V3 | $0.20 | $0.80 |
| Command R7B (12-2024) | $0.04 | $0.15 |
| Llama 3.3 70B Instruct | $0.10 | $0.32 |
| GPT-4o (2024-11-20) | $2.50 | $10.00 |
| Mistral Large 2407 | $2.00 | $6.00 |
| Qwen2.5 Coder 32B Instruct | $0.66 | $1.00 |
| Qwen2.5 7B Instruct | $0.04 | $0.10 |
| GPT-3.5 Turbo | $0.50 | $1.50 |
| Llama 3.2 3B Instruct | $0.05 | $0.34 |
| Llama 3.2 1B Instruct | $0.03 | $0.20 |
| Llama 3.2 11B Vision Instruct | $0.34 | $0.34 |
| Qwen2.5 72B Instruct | $0.36 | $0.40 |
| Command R (08-2024) | $0.15 | $0.60 |
| GPT-4o (2024-08-06) | $2.50 | $10.00 |
| Llama 3.1 8B Instruct | $0.02 | $0.03 |
| Mistral Nemo | $0.02 | $0.03 |
| GPT-4o-mini (2024-07-18) | $0.15 | $0.60 |
| Gemma 2 27B | $0.65 | $0.65 |
| GPT-4o (2024-05-13) | $5.00 | $15.00 |
| Llama 3 8B Instruct | $0.14 | $0.14 |
| Mixtral 8x22B Instruct | $2.00 | $6.00 |
| GPT-4 Turbo | $10.00 | $30.00 |
| Mistral Large | $2.00 | $6.00 |
| GPT-3.5 Turbo (older v0613) | $1.00 | $2.00 |
| GPT-4 Turbo Preview | $10.00 | $30.00 |
| GPT-3.5 Turbo Instruct | $1.50 | $2.00 |
| GPT-3.5 Turbo 16k | $3.00 | $4.00 |
| GPT-4 | $30.00 | $60.00 |