Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning
By Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
"PEEU lets small MLLMs autonomously explore GUI environments and use hindsight to synthesize high-level planning experiences, boosting compositional generalization and outperforming larger models on web tasks."
Abstract
Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.
Technical Analysis & Implementation
Methodology§
The Planning Experience Exploration and Utilization (PEEU) method enables small Multimodal Large Language Models (MLLMs) to improve task planning for GUI agents. It consists of two phases:
Autonomous Experience Exploration§
An agent (based on a small MLLM) interacts with a GUI environment (e.g., web pages) using a set of atomic skills (e.g., click, type, scroll). It randomly explores and records trajectories $\tau = (o_1, a_1, ..., o_T, a_T)$ where $o_t$ is the observation (screenshot + DOM) and $a_t$ is the action. Successful trajectories that complete a task are stored as positive experiences. The agent also collects unsuccessful trajectories as negative experiences.
Hindsight Experience Utilization§
For failed or suboptimal trajectories, the agent uses hindsight relabeling: Given a failed trajectory that did not achieve the original goal $g$, the agent imagines a simpler goal $g'$ that the trajectory actually achieved. This is done by prompting the MLLM to summarize the sequence of actions into a high-level task description. The relabeled experience $(\tau, g')$ is used to train the model to predict action sequences from goals.
Compositional Generalization via TDHAF§
The Task Decomposition Hierarchical Analysis Framework (TDHAF) categorizes tasks into three granularities:
- Low-level: atomic skills (e.g., clicking a button)
- Mid-level: sequences of atoms (e.g., filling a form)
- High-level: complete multi-step tasks (e.g., booking a flight)
Training on high-level tasks yields stronger out-of-distribution (OOD) generalization, as the model learns to compose skills in novel ways.
Training Objective§
The model $\pi_\theta(a_t|o_{\le t}, g)$ is optimized via behavior cloning on the collected experiences. The loss is:
$$ \mathcal{L} = -\mathbb{E}_{(\tau, g) \sim \mathcal{D}} \left[ \sum_{t=1}^T \log \pi_\theta(a_t | o_{\le t}, g) \right] $$
where $\mathcal{D}$ includes both original and hindsight-relabeled trajectories.
Implementation Details§
- Base model: 7B parameter MLLM (e.g., Qwen2.5-VL-7B)
- Exploration: agent uses a set of predefined atomic skills; environment simulator (e.g., WebArena)
- Hindsight relabeling: prompts the MLLM to generate a new goal given the trajectory (e.g., "What task does this sequence achieve?")
- Training: fine-tune with LoRA on collected data
Code Snippet§
# Pseudo-code for hindsight experience generation
def hindsight_relabel(trajectory, original_goal, mllm):
# Prompt MLLM to summarize trajectory into a goal
prompt = f"Given the following sequence of actions: {trajectory['actions']}, what high-level task does this accomplish?"
new_goal = mllm.generate(prompt)
return new_goal
# Training loop
dataloader = DataLoader(experience_buffer)
for batch in dataloader:
obs, actions, goals = batch
logits = model(obs, goals)
loss = cross_entropy(logits.view(-1, vocab_size), actions.view(-1))
loss.backward()
optimizer.step()Results§
- On real-world web benchmarks, the 7B PEEU model achieves 30.6% accuracy, surpassing Qwen2.5-VL-32B (which has 4.5x parameters).
- Ablation studies show that high-level task training is crucial for OOD generalization, while low-level skill training alone is insufficient.
Interactive LLM Token & Cost Calculator
Estimate token usage and model pricing. Enter your prompt below to see how it is parsed into tokens and calculate the exact API cost for different providers.
Cost Breakdown (USD)
API Pricing Comparison (per Million Tokens)
| Model | Input | Output |
|---|---|---|
| DeepSeek-V3 | $0.14 | $0.28 |
| GPT-4o | $2.50 | $10.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
Related Research
Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk