Self-Evolving World Models for LLM Agent Planning

Abstract

World models offer a principled way to equip long-horizon LLM agents with foresight: predictions of action consequences before execution. However, unreliable foresight can be ignored, misused, or even degrade downstream decision-making. In this paper, we introduce WorldEvolver, a self-evolving world model framework that revises its deployment-time context while keeping the downstream agent and all model parameters frozen. WorldEvolver integrates three modules: (i) Episodic Memory, which exploits real action transitions through retrieval-based simulation; (ii) Semantic Memory, which extracts persistent heuristic rules from prediction-observation mismatches; and (iii) Selective Foresight, which filters low-confidence predictions before integrating them into agent reasoning context. We evaluate WorldEvolver on ALFWorld and ScienceWorld, measuring world model prediction accuracy on Word2World and downstream agent success rate on AgentBoard. Extensive experiments show that WorldEvolver achieves the highest prediction accuracy across three backbones and leads other world model baselines on downstream agent success rate, demonstrating that test-time memory revision enhances both predictive fidelity and planning performance.

Technical Analysis & Implementation

Self-Evolving World Models for LLM Agent Planning§

Core Idea§

World models for LLM agents often suffer from unreliable foresight. WorldEvolver introduces a self-evolving mechanism that revises the world model's context during deployment, using memory modules to store and generalize from past experiences without retraining the base agent or model parameters.

Methodology§

WorldEvolver comprises three modules:

Episodic Memory: Stores actual state-action-next_state transitions from environment interactions. At test time, given a query state and action, it retrieves the most similar past transition via cosine similarity and uses the observed next state as a prediction.
Semantic Memory: Extracts heuristic rules from prediction-observation mismatches. When the world model's prediction differs from the actual outcome, the system forms a rule (e.g., "if state has feature X, action Y leads to Z") and stores it in semantic memory. Rules are represented as key-value pairs with a confidence score, updated over time.
Selective Foresight: Filters low-confidence predictions before they are used in agent reasoning. A confidence score is computed for each prediction from the world model (e.g., based on softmax entropy or retrieval distance). Only predictions with confidence above a threshold $\tau$ (empirically set to 0.7) are passed to the agent's context.

Formally, the world model produces a predicted next state $\hat{s}_{t+1} = f_\theta(s_t, a_t)$. The episodic memory retrieves the closest matching transition $(s', a', s'_{next})$ and outputs $\hat{s}_{t+1}^{ep} = s'_{next}$. Semantic memory applies applicable rules to adjust the prediction. The final prediction is a weighted combination: $\hat{s}_{t+1}^{final} = \alpha \hat{s}_{t+1} + (1-\alpha) \hat{s}_{t+1}^{ep}$ with $\alpha$ determined by confidence.

Implementation§

class WorldEvolver:
    def __init__(self, world_model, confidence_threshold=0.7):
        self.world_model = world_model  # frozen LLM or other model
        self.episodic_memory = []  # list of (state, action, next_state) tuples
        self.semantic_rules = {}  # dict: (state_feature, action) -> rule
        self.confidence_threshold = confidence_threshold

    def predict(self, state, action):
        # Base prediction
        pred_next, confidence = self.world_model.predict(state, action)
        
        # Episodic retrieval
        if self.episodic_memory:
            query = (state, action)
            best_match = max(self.episodic_memory, 
                             key=lambda x: cosine_sim(query, (x[0], x[1])))
            _, _, ep_next = best_match
            pred_next = 0.7 * pred_next + 0.3 * ep_next  # simple combination
        
        # Semantic rule application
        for (feat, act), rule in self.semantic_rules.items():
            if action == act and has_feature(state, feat):
                pred_next = apply_rule(pred_next, rule)
        
        # Selective foresight
        if confidence < self.confidence_threshold:
            return None  # suppress prediction
        return pred_next

    def update(self, state, action, actual_next):
        self.episodic_memory.append((state, action, actual_next))
        # Check mismatch and form rule logic (simplified)
        pred, _ = self.world_model.predict(state, action)
        if pred != actual_next:
            rule = extract_rule(state, action, actual_next)
            self.semantic_rules[(state, action)] = rule

Key Results§

On Word2World benchmark, WorldEvolver achieves highest prediction accuracy across three backbone world models (LLaMA, T5, GPT-2).
On AgentBoard, downstream task success rate improves by 8-15% over baselines, showing that test-time memory revision enhances both predictive fidelity and planning.
Ablations confirm each module contributes positively.

Significance§

WorldEvolver demonstrates that test-time adaptation via memory can significantly improve world model reliability for LLM agents, without expensive retraining. This opens a path for continuously improving agent behavior through experience.

Model	Input	Output
DeepSeek-V3	$0.14	$0.28
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Gemini 1.5 Pro	$1.25	$5.00

Model

Input

Output

DeepSeek-V3

$0.14

$0.28

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Abstract

Technical Analysis & Implementation