Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

Abstract

Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.

Technical Analysis & Implementation

Methodology§

The Planning Experience Exploration and Utilization (PEEU) method enables small Multimodal Large Language Models (MLLMs) to improve task planning for GUI agents. It consists of two phases:

Autonomous Experience Exploration§

An agent (based on a small MLLM) interacts with a GUI environment (e.g., web pages) using a set of atomic skills (e.g., click, type, scroll). It randomly explores and records trajectories $\tau = (o_1, a_1, ..., o_T, a_T)$ where $o_t$ is the observation (screenshot + DOM) and $a_t$ is the action. Successful trajectories that complete a task are stored as positive experiences. The agent also collects unsuccessful trajectories as negative experiences.

Hindsight Experience Utilization§

For failed or suboptimal trajectories, the agent uses hindsight relabeling: Given a failed trajectory that did not achieve the original goal $g$, the agent imagines a simpler goal $g'$ that the trajectory actually achieved. This is done by prompting the MLLM to summarize the sequence of actions into a high-level task description. The relabeled experience $(\tau, g')$ is used to train the model to predict action sequences from goals.

Compositional Generalization via TDHAF§

The Task Decomposition Hierarchical Analysis Framework (TDHAF) categorizes tasks into three granularities:

Low-level: atomic skills (e.g., clicking a button)
Mid-level: sequences of atoms (e.g., filling a form)
High-level: complete multi-step tasks (e.g., booking a flight)

Training on high-level tasks yields stronger out-of-distribution (OOD) generalization, as the model learns to compose skills in novel ways.

Training Objective§

The model $\pi_\theta(a_t|o_{\le t}, g)$ is optimized via behavior cloning on the collected experiences. The loss is:

$$ \mathcal{L} = -\mathbb{E}_{(\tau, g) \sim \mathcal{D}} \left[ \sum_{t=1}^T \log \pi_\theta(a_t | o_{\le t}, g) \right] $$

where $\mathcal{D}$ includes both original and hindsight-relabeled trajectories.

Implementation Details§

Base model: 7B parameter MLLM (e.g., Qwen2.5-VL-7B)
Exploration: agent uses a set of predefined atomic skills; environment simulator (e.g., WebArena)
Hindsight relabeling: prompts the MLLM to generate a new goal given the trajectory (e.g., "What task does this sequence achieve?")
Training: fine-tune with LoRA on collected data

Code Snippet§

# Pseudo-code for hindsight experience generation
def hindsight_relabel(trajectory, original_goal, mllm):
    # Prompt MLLM to summarize trajectory into a goal
    prompt = f"Given the following sequence of actions: {trajectory['actions']}, what high-level task does this accomplish?"
    new_goal = mllm.generate(prompt)
    return new_goal

# Training loop
dataloader = DataLoader(experience_buffer)
for batch in dataloader:
    obs, actions, goals = batch
    logits = model(obs, goals)
    loss = cross_entropy(logits.view(-1, vocab_size), actions.view(-1))
    loss.backward()
    optimizer.step()

Results§

On real-world web benchmarks, the 7B PEEU model achieves 30.6% accuracy, surpassing Qwen2.5-VL-32B (which has 4.5x parameters).
Ablation studies show that high-level task training is crucial for OOD generalization, while low-level skill training alone is insufficient.

Model	Input	Output
DeepSeek-V3	$0.14	$0.28
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
Gemini 1.5 Pro	$1.25	$5.00

Model

Input

Output

DeepSeek-V3

$0.14

$0.28

GPT-4o

$2.50

$10.00

Claude 3.5 Sonnet

$3.00

$15.00

Gemini 1.5 Pro

$1.25

$5.00

Abstract

Technical Analysis & Implementation

Methodology§

Autonomous Experience Exploration§

Hindsight Experience Utilization§

Compositional Generalization via TDHAF§

Training Objective§

Implementation Details§

Code Snippet§

Results§

Interactive LLM Token & Cost Calculator

Cost Breakdown (USD)

API Pricing Comparison (per Million Tokens)

Related Research

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

Grading the Grader: Lessons from Evaluating an Agentic Data Analysis System

Accelerate your workflow with Feedalyze