arrow_backBack to research feed
agentsPublished: June 29, 2026

Self-Evolving World Models for LLM Agent Planning

By Xuan Zhang, Wenxuan Zhang, See-Kiong Ng, Yang Deng

Research TL;DR

"WorldEvolver improves LLM agent planning by equipping world models with episodic and semantic memory that evolve at test time via retrieval and rule extraction, boosting prediction accuracy and downstream task success."

Abstract

World models offer a principled way to equip long-horizon LLM agents with foresight: predictions of action consequences before execution. However, unreliable foresight can be ignored, misused, or even degrade downstream decision-making. In this paper, we introduce WorldEvolver, a self-evolving world model framework that revises its deployment-time context while keeping the downstream agent and all model parameters frozen. WorldEvolver integrates three modules: (i) Episodic Memory, which exploits real action transitions through retrieval-based simulation; (ii) Semantic Memory, which extracts persistent heuristic rules from prediction-observation mismatches; and (iii) Selective Foresight, which filters low-confidence predictions before integrating them into agent reasoning context. We evaluate WorldEvolver on ALFWorld and ScienceWorld, measuring world model prediction accuracy on Word2World and downstream agent success rate on AgentBoard. Extensive experiments show that WorldEvolver achieves the highest prediction accuracy across three backbones and leads other world model baselines on downstream agent success rate, demonstrating that test-time memory revision enhances both predictive fidelity and planning performance.

Technical Analysis & Implementation

Self-Evolving World Models for LLM Agent Planning§

Core Idea§

World models for LLM agents often suffer from unreliable foresight. WorldEvolver introduces a self-evolving mechanism that revises the world model's context during deployment, using memory modules to store and generalize from past experiences without retraining the base agent or model parameters.

Methodology§

WorldEvolver comprises three modules:

  • Episodic Memory: Stores actual state-action-next_state transitions from environment interactions. At test time, given a query state and action, it retrieves the most similar past transition via cosine similarity and uses the observed next state as a prediction.
  • Semantic Memory: Extracts heuristic rules from prediction-observation mismatches. When the world model's prediction differs from the actual outcome, the system forms a rule (e.g., "if state has feature X, action Y leads to Z") and stores it in semantic memory. Rules are represented as key-value pairs with a confidence score, updated over time.
  • Selective Foresight: Filters low-confidence predictions before they are used in agent reasoning. A confidence score is computed for each prediction from the world model (e.g., based on softmax entropy or retrieval distance). Only predictions with confidence above a threshold $\tau$ (empirically set to 0.7) are passed to the agent's context.

Formally, the world model produces a predicted next state $\hat{s}_{t+1} = f_\theta(s_t, a_t)$. The episodic memory retrieves the closest matching transition $(s', a', s'_{next})$ and outputs $\hat{s}_{t+1}^{ep} = s'_{next}$. Semantic memory applies applicable rules to adjust the prediction. The final prediction is a weighted combination: $\hat{s}_{t+1}^{final} = \alpha \hat{s}_{t+1} + (1-\alpha) \hat{s}_{t+1}^{ep}$ with $\alpha$ determined by confidence.

Implementation§

class WorldEvolver:
    def __init__(self, world_model, confidence_threshold=0.7):
        self.world_model = world_model  # frozen LLM or other model
        self.episodic_memory = []  # list of (state, action, next_state) tuples
        self.semantic_rules = {}  # dict: (state_feature, action) -> rule
        self.confidence_threshold = confidence_threshold

    def predict(self, state, action):
        # Base prediction
        pred_next, confidence = self.world_model.predict(state, action)
        
        # Episodic retrieval
        if self.episodic_memory:
            query = (state, action)
            best_match = max(self.episodic_memory, 
                             key=lambda x: cosine_sim(query, (x[0], x[1])))
            _, _, ep_next = best_match
            pred_next = 0.7 * pred_next + 0.3 * ep_next  # simple combination
        
        # Semantic rule application
        for (feat, act), rule in self.semantic_rules.items():
            if action == act and has_feature(state, feat):
                pred_next = apply_rule(pred_next, rule)
        
        # Selective foresight
        if confidence < self.confidence_threshold:
            return None  # suppress prediction
        return pred_next

    def update(self, state, action, actual_next):
        self.episodic_memory.append((state, action, actual_next))
        # Check mismatch and form rule logic (simplified)
        pred, _ = self.world_model.predict(state, action)
        if pred != actual_next:
            rule = extract_rule(state, action, actual_next)
            self.semantic_rules[(state, action)] = rule

Key Results§

  • On Word2World benchmark, WorldEvolver achieves highest prediction accuracy across three backbone world models (LLaMA, T5, GPT-2).
  • On AgentBoard, downstream task success rate improves by 8-15% over baselines, showing that test-time memory revision enhances both predictive fidelity and planning.
  • Ablations confirm each module contributes positively.

Significance§

WorldEvolver demonstrates that test-time adaptation via memory can significantly improve world model reliability for LLM agents, without expensive retraining. This opens a path for continuously improving agent behavior through experience.

Interactive SEO Tool

Interactive LLM Token & Cost Calculator

Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.

Context Window64,000 tokens
Visual Tokenizer Chunks
Language models do not read text like humans. Instead, they process text in chunks called tokens. A token can be a single character, a syllable, a word, or even part of a word (like the "ing" in "walking"). On average, 1 token is equivalent to about 4 characters or 0.75 words of English text.
Estimated Token Count124

Cost Breakdown (USD)

Input Cost (Prompt):$0.000017
Output Cost (Generated):$0.000035
Total Est. Cost:$0.000052
Context Window Capacity0.1938%

API Pricing Comparison (per Million Tokens)

ModelInputOutput
DeepSeek-V3$0.14$0.28
GPT-4o$2.50$10.00
Claude 3.5 Sonnet$3.00$15.00
Gemini 1.5 Pro$1.25$5.00
INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk