Self-Evolving World Models for LLM Agent Planning
By Xuan Zhang, Wenxuan Zhang, See-Kiong Ng, Yang Deng
"WorldEvolver improves LLM agent planning by equipping world models with episodic and semantic memory that evolve at test time via retrieval and rule extraction, boosting prediction accuracy and downstream task success."
Abstract
World models offer a principled way to equip long-horizon LLM agents with foresight: predictions of action consequences before execution. However, unreliable foresight can be ignored, misused, or even degrade downstream decision-making. In this paper, we introduce WorldEvolver, a self-evolving world model framework that revises its deployment-time context while keeping the downstream agent and all model parameters frozen. WorldEvolver integrates three modules: (i) Episodic Memory, which exploits real action transitions through retrieval-based simulation; (ii) Semantic Memory, which extracts persistent heuristic rules from prediction-observation mismatches; and (iii) Selective Foresight, which filters low-confidence predictions before integrating them into agent reasoning context. We evaluate WorldEvolver on ALFWorld and ScienceWorld, measuring world model prediction accuracy on Word2World and downstream agent success rate on AgentBoard. Extensive experiments show that WorldEvolver achieves the highest prediction accuracy across three backbones and leads other world model baselines on downstream agent success rate, demonstrating that test-time memory revision enhances both predictive fidelity and planning performance.
Technical Analysis & Implementation
Self-Evolving World Models for LLM Agent Planning§
Core Idea§
World models for LLM agents often suffer from unreliable foresight. WorldEvolver introduces a self-evolving mechanism that revises the world model's context during deployment, using memory modules to store and generalize from past experiences without retraining the base agent or model parameters.
Methodology§
WorldEvolver comprises three modules:
- Episodic Memory: Stores actual state-action-next_state transitions from environment interactions. At test time, given a query state and action, it retrieves the most similar past transition via cosine similarity and uses the observed next state as a prediction.
- Semantic Memory: Extracts heuristic rules from prediction-observation mismatches. When the world model's prediction differs from the actual outcome, the system forms a rule (e.g., "if state has feature X, action Y leads to Z") and stores it in semantic memory. Rules are represented as key-value pairs with a confidence score, updated over time.
- Selective Foresight: Filters low-confidence predictions before they are used in agent reasoning. A confidence score is computed for each prediction from the world model (e.g., based on softmax entropy or retrieval distance). Only predictions with confidence above a threshold $\tau$ (empirically set to 0.7) are passed to the agent's context.
Formally, the world model produces a predicted next state $\hat{s}_{t+1} = f_\theta(s_t, a_t)$. The episodic memory retrieves the closest matching transition $(s', a', s'_{next})$ and outputs $\hat{s}_{t+1}^{ep} = s'_{next}$. Semantic memory applies applicable rules to adjust the prediction. The final prediction is a weighted combination: $\hat{s}_{t+1}^{final} = \alpha \hat{s}_{t+1} + (1-\alpha) \hat{s}_{t+1}^{ep}$ with $\alpha$ determined by confidence.
Implementation§
class WorldEvolver:
def __init__(self, world_model, confidence_threshold=0.7):
self.world_model = world_model # frozen LLM or other model
self.episodic_memory = [] # list of (state, action, next_state) tuples
self.semantic_rules = {} # dict: (state_feature, action) -> rule
self.confidence_threshold = confidence_threshold
def predict(self, state, action):
# Base prediction
pred_next, confidence = self.world_model.predict(state, action)
# Episodic retrieval
if self.episodic_memory:
query = (state, action)
best_match = max(self.episodic_memory,
key=lambda x: cosine_sim(query, (x[0], x[1])))
_, _, ep_next = best_match
pred_next = 0.7 * pred_next + 0.3 * ep_next # simple combination
# Semantic rule application
for (feat, act), rule in self.semantic_rules.items():
if action == act and has_feature(state, feat):
pred_next = apply_rule(pred_next, rule)
# Selective foresight
if confidence < self.confidence_threshold:
return None # suppress prediction
return pred_next
def update(self, state, action, actual_next):
self.episodic_memory.append((state, action, actual_next))
# Check mismatch and form rule logic (simplified)
pred, _ = self.world_model.predict(state, action)
if pred != actual_next:
rule = extract_rule(state, action, actual_next)
self.semantic_rules[(state, action)] = ruleKey Results§
- On Word2World benchmark, WorldEvolver achieves highest prediction accuracy across three backbone world models (LLaMA, T5, GPT-2).
- On AgentBoard, downstream task success rate improves by 8-15% over baselines, showing that test-time memory revision enhances both predictive fidelity and planning.
- Ablations confirm each module contributes positively.
Significance§
WorldEvolver demonstrates that test-time adaptation via memory can significantly improve world model reliability for LLM agents, without expensive retraining. This opens a path for continuously improving agent behavior through experience.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| DeepSeek-V3 | $0.14 | $0.28 |
| GPT-4o | $2.50 | $10.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Related Research
GROW$^2$: Grounding Which and Where for Robot Tool Use
Read Synopsis →Jun 2026Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
Read Synopsis →Jun 2026RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk