AutoMem: Automated Learning of Memory as a Cognitive Skill

Abstract

Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.

Technical Analysis & Implementation

Technical Breakdown§

AutoMem addresses the challenge of memory management in LLM agents operating over long horizons. The core idea is to treat memory operations (read, write, delete, etc.) as first-class actions, and then automate the optimization of both (1) the memory structure (how memory files are organized, schemas, prompts) and (2) the memory proficiency of the model.

Two-Loop Architecture§

Outer Loop (Structure Optimization): A strong LLM (e.g., GPT-4) reviews complete trajectories of a base agent (e.g., Llama 32B) and proposes revisions to the memory prompts, file schemas, and action definitions. This loop runs iteratively, improving the scaffolding that governs memory interactions.

Inner Loop (Proficiency Tuning): From many episodes, episodes where the agent made particularly good memory decisions (e.g., correctly retrieving relevant info, organizing files logically) are identified. These are used as supervised fine-tuning data to update the base model's weights directly, sharpening its memory usage.

Memory Actions§

The agent's action space is extended with file-system operations:

memory_actions = ['read', 'write', 'append', 'delete', 'create_file', 'list_dir', 'search']

These actions are executed on a structured memory directory that the agent can reorganize. For example, an agent can create a file goal.txt and later read it to recall its current goal.

Formalization§

The overall objective is to maximize task reward $R$ by optimizing two components:

Memory structure parameters $\theta_s$ (prompts, schemas)
Model weights $\theta_m$

$$ \max_{\theta_s, \theta_m} \mathbb{E}_{\tau \sim \pi(\cdot|\theta_s,\theta_m)} [R(\tau)] $$

Where $\tau$ is a trajectory of (observation, memory_action, task_action) steps.

Training Signal for Inner Loop§

Given a set of trajectories, AutoMem identifies positive memory episodes using a heuristic: if the agent later performs a task action that depends on a previously stored memory piece, that storage decision is considered good. These episodes are extracted and used for SFT (supervised fine-tuning) on the base model.

Implementation Sketch§

# Pseudocode for memory action execution
class MemoryManager:
    def __init__(self, base_dir):
        self.base_dir = base_dir
    
    def execute(self, action_type, **kwargs):
        if action_type == 'write':
            path = os.path.join(self.base_dir, kwargs['filename'])
            with open(path, 'w') as f:
                f.write(kwargs['content'])
        elif action_type == 'read':
            path = os.path.join(self.base_dir, kwargs['filename'])
            return open(path).read()
        # ... other actions

# Outer loop: revise memory prompt
revised_prompt = strong_llm_review(trajectory, current_prompt)

# Inner loop: fine-tune model on good memory sequences
for memory_seq in positive_memory_episodes:
    train_step(model, memory_seq)

Experimental Results§

AutoMem was evaluated on three procedurally generated games: Crafter, MiniHack, and NetHack. By optimizing only memory (task actions unchanged), the base 32B open model improved ~2x-4x, matching or surpassing frontier closed models like Claude Opus 4.5 and Gemini 3.1 Pro Thinking. This demonstrates that memory management is a separable, high-leverage skill.

Model	Input	Output
GPT-5.5	$5.00	$30.00
GPT-5	$1.25	$10.00
GPT-4o Mini	$0.15	$0.60
OpenAI o1	$15.00	$60.00
Claude 4.6 Opus	$5.00	$25.00
Gemini 3.1 Pro	$2.00	$12.00
GPT-4o	$2.50	$10.00
DeepSeek R1	$0.70	$2.50
DeepSeek V4 Flash	$0.09	$0.18
GPT-5.5 Pro	$30.00	$180.00
OpenAI o3-mini	$1.10	$4.40
Claude 4.6 Sonnet	$3.00	$15.00
Claude 4.5 Sonnet	$3.00	$15.00
GPT-5 Mini	$0.25	$2.00
Claude 4.7 Opus	$5.00	$25.00
Claude 4.5 Haiku	$1.00	$5.00
Llama 4 Scout	$0.10	$0.30
Mistral Large 3	$0.50	$1.50
Mistral Small 3	$0.07	$0.20
Claude 4.5 Opus	$5.00	$25.00
Gemini 3.1 Flash	$0.25	$1.50
Command R+	$2.50	$10.00
Command R	$0.15	$0.60
Grok 4.20	$1.25	$2.50
Grok 4.3	$1.25	$2.50
DeepSeek V4 Pro	$0.43	$0.87
Llama 4 Maverick	$0.15	$0.60
Llama 3.3 70B Instruct	$0.10	$0.32
Gemini 3.5 Flash	$1.50	$9.00
Gemini 2.5 Pro	$1.25	$10.00
Llama 3.2 11B Vision	$0.34	$0.34

Model

Input

Output

GPT-5.5

$5.00

$30.00

GPT-5

$1.25

$10.00

GPT-4o Mini

$0.15

$0.60

OpenAI o1

$15.00

$60.00

Claude 4.6 Opus

$5.00

$25.00

Gemini 3.1 Pro

$2.00

$12.00

GPT-4o

$2.50

$10.00

DeepSeek R1

$0.70

$2.50

DeepSeek V4 Flash

$0.09

$0.18

GPT-5.5 Pro

$30.00

$180.00

OpenAI o3-mini

$1.10

$4.40

Claude 4.6 Sonnet

$3.00

$15.00

Claude 4.5 Sonnet

$3.00

$15.00

GPT-5 Mini

$0.25

$2.00

Claude 4.7 Opus

$5.00

$25.00

Claude 4.5 Haiku

$1.00

$5.00

Llama 4 Scout

$0.10

$0.30

Mistral Large 3

$0.50

$1.50

Mistral Small 3

$0.07

$0.20

Claude 4.5 Opus

$5.00

$25.00

Gemini 3.1 Flash

$0.25

$1.50

Command R+

$2.50

$10.00

Command R

$0.15

$0.60

Grok 4.20

$1.25

$2.50

Grok 4.3

$1.25

$2.50

DeepSeek V4 Pro

$0.43

$0.87

Llama 4 Maverick

$0.15

$0.60

Llama 3.3 70B Instruct

$0.10

$0.32

Gemini 3.5 Flash

$1.50

$9.00

Gemini 2.5 Pro

$1.25

$10.00

Llama 3.2 11B Vision

$0.34

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Two-Loop Architecture§

Memory Actions§

Formalization§

Training Signal for Inner Loop§

Implementation Sketch§

Experimental Results§

Interactive LLM Token & Cost Calculator

Cost Breakdown (USD)

API Pricing Comparison (per Million Tokens)

Related Research

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Self-Evolving World Models for LLM Agent Planning

GROW$^2$: Grounding Which and Where for Robot Tool Use

Accelerate your workflow with Feedalyze