arrow_backBack to research feed
agentsPublished: June 25, 2026

Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

By Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao

Research TL;DR

"PEEU lets small MLLMs autonomously explore GUI environments and use hindsight to synthesize high-level planning experiences, boosting compositional generalization and outperforming larger models on web tasks."

Abstract

Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.

Technical Analysis & Implementation

Methodology§

The Planning Experience Exploration and Utilization (PEEU) method enables small Multimodal Large Language Models (MLLMs) to improve task planning for GUI agents. It consists of two phases:

Autonomous Experience Exploration§

An agent (based on a small MLLM) interacts with a GUI environment (e.g., web pages) using a set of atomic skills (e.g., click, type, scroll). It randomly explores and records trajectories $\tau = (o_1, a_1, ..., o_T, a_T)$ where $o_t$ is the observation (screenshot + DOM) and $a_t$ is the action. Successful trajectories that complete a task are stored as positive experiences. The agent also collects unsuccessful trajectories as negative experiences.

Hindsight Experience Utilization§

For failed or suboptimal trajectories, the agent uses hindsight relabeling: Given a failed trajectory that did not achieve the original goal $g$, the agent imagines a simpler goal $g'$ that the trajectory actually achieved. This is done by prompting the MLLM to summarize the sequence of actions into a high-level task description. The relabeled experience $(\tau, g')$ is used to train the model to predict action sequences from goals.

Compositional Generalization via TDHAF§

The Task Decomposition Hierarchical Analysis Framework (TDHAF) categorizes tasks into three granularities:

  • Low-level: atomic skills (e.g., clicking a button)
  • Mid-level: sequences of atoms (e.g., filling a form)
  • High-level: complete multi-step tasks (e.g., booking a flight)

Training on high-level tasks yields stronger out-of-distribution (OOD) generalization, as the model learns to compose skills in novel ways.

Training Objective§

The model $\pi_\theta(a_t|o_{\le t}, g)$ is optimized via behavior cloning on the collected experiences. The loss is:

$$ \mathcal{L} = -\mathbb{E}_{(\tau, g) \sim \mathcal{D}} \left[ \sum_{t=1}^T \log \pi_\theta(a_t | o_{\le t}, g) \right] $$

where $\mathcal{D}$ includes both original and hindsight-relabeled trajectories.

Implementation Details§

  • Base model: 7B parameter MLLM (e.g., Qwen2.5-VL-7B)
  • Exploration: agent uses a set of predefined atomic skills; environment simulator (e.g., WebArena)
  • Hindsight relabeling: prompts the MLLM to generate a new goal given the trajectory (e.g., "What task does this sequence achieve?")
  • Training: fine-tune with LoRA on collected data

Code Snippet§

# Pseudo-code for hindsight experience generation
def hindsight_relabel(trajectory, original_goal, mllm):
    # Prompt MLLM to summarize trajectory into a goal
    prompt = f"Given the following sequence of actions: {trajectory['actions']}, what high-level task does this accomplish?"
    new_goal = mllm.generate(prompt)
    return new_goal

# Training loop
dataloader = DataLoader(experience_buffer)
for batch in dataloader:
    obs, actions, goals = batch
    logits = model(obs, goals)
    loss = cross_entropy(logits.view(-1, vocab_size), actions.view(-1))
    loss.backward()
    optimizer.step()

Results§

  • On real-world web benchmarks, the 7B PEEU model achieves 30.6% accuracy, surpassing Qwen2.5-VL-32B (which has 4.5x parameters).
  • Ablation studies show that high-level task training is crucial for OOD generalization, while low-level skill training alone is insufficient.
Interactive SEO Tool

Interactive LLM Token & Cost Calculator

Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.

Context Window64,000 tokens
Visual Tokenizer Chunks
Language models do not read text like humans. Instead, they process text in chunks called tokens. A token can be a single character, a syllable, a word, or even part of a word (like the "ing" in "walking"). On average, 1 token is equivalent to about 4 characters or 0.75 words of English text.
Estimated Token Count124

Cost Breakdown (USD)

Input Cost (Prompt):$0.000017
Output Cost (Generated):$0.000035
Total Est. Cost:$0.000052
Context Window Capacity0.1938%

API Pricing Comparison (per Million Tokens)

ModelInputOutput
DeepSeek-V3$0.14$0.28
GPT-4o$2.50$10.00
Claude 3.5 Sonnet$3.00$15.00
Gemini 1.5 Pro$1.25$5.00
INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk