ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning
By Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, Jingrui He
"ReContext recursively replays query-relevant evidence from long contexts using attention signals, improving LLM reasoning without training or pruning."
Abstract
Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.
Technical Analysis & Implementation
Technical Breakdown of ReContext§
ReContext (Recursive Evidence Replay) is a training-free inference method to enhance long-context reasoning in LLMs by separating evidence organization from answer generation. It uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation, preserving the full original context.
Core Methodology§
ReContext operates in two phases: 1. Evidence Selection: Given a context $C$ (length $N$) and question $Q$, the model computes attention scores $A = \text{softmax}(Q^T K / \sqrt{d})$ where $K$ are key embeddings from $C$. It then selects the top-$k$ tokens with highest attention values to form an evidence pool $E$. This selection is recursive: in each round, the query is concatenated with previously selected evidence to refine relevance signals. 2. Evidence Replay: After $R$ rounds of recursive selection, the final evidence pool $E_R$ is prepended to the original context $C$ (ensuring full context preservation) and passed to the model for answer generation.
Mathematically, the process for round $r$: $$ A_r = \text{Attention}(Q \oplus E_{r-1}, C) \quad \text{(all tokens)} $$ $$ E_r = \text{top-}k(A_r) \quad \text{(token indices)} $$ where $\oplus$ denotes concatenation and $E_0 = \emptyset$.
Theoretical Justification§
The authors frame ReContext through associative memory theory:
- Context $C$ is a memory store with cues (token positions).
- Question $Q$ acts as a retrieval cue.
- Attention computes cue-trace association strength.
- Evidence replay corresponds to trace reactivation, strengthening the memory signal.
Implementation Details§
ReContext requires no training or external memory. It uses the LLM's own attention mechanism to guide evidence selection. Key hyperparameters: $k$ (number of tokens per round), $R$ (number of recursive rounds). In experiments, $k=128$ and $R=3$ for 128K context. The method is applied to the final layer's attention heads, averaging across heads for a single relevance score per token.
Code Snippet (PyTorch-like)§
def recontext(model, input_ids, question_ids, context_start, k=128, R=3):
# input_ids: full context + question
# context_start: index where context begins
evidence_indices = []
for r in range(R):
with torch.no_grad():
outputs = model(input_ids, output_attentions=True)
attn = outputs.attentions[-1] # last layer, shape [batch, heads, seq, seq]
# Average attention over heads and last query positions (question tokens)
q_start = len(input_ids) - len(question_ids)
attn_q = attn[:, :, q_start:, :].mean(dim=(0,1)) # [num_q, seq]
attn_mean = attn_q.mean(dim=0) # [seq]
# Focus on context portion; exclude evidence already selected
context_attn = attn_mean[context_start:]
# Select top-k from context not already in evidence
sorted_indices = torch.argsort(context_attn, descending=True)
new_evidence = []
for idx in sorted_indices:
if idx.cpu().item() not in evidence_indices and len(new_evidence) < k:
new_evidence.append(idx.cpu().item())
evidence_indices.extend(new_evidence)
# Update input: prepend evidence tokens before context
evidence_tokens = input_ids[0][context_start:][evidence_indices]
input_ids = torch.cat([evidence_tokens, input_ids[0][context_start:]], dim=0).unsqueeze(0)
# Final generation
output = model.generate(input_ids, max_new_tokens=128)
return outputExperiments§
ReContext was evaluated on 8 long-context datasets (e.g., NarrativeQA, HotpotQA, 2WikiMultihop) with context length 128K, using Qwen3-4B, Qwen3-8B, and Llama3-8B. It consistently improves evidence utilization (F1 score) over baselines like no-retrieval and simple retrieval, achieving best average rank on all backbones. Ablations show recursive selection outperforms single-pass selection.
Key Advantages§
- Training-free and model-agnostic.
- Preserves full context (no pruning).
- Low overhead: only a few forward passes for evidence selection.
- Theoretically grounded in associative memory.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.
Related Research
Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Read Synopsis →Jul 2026Measuring the Gap Between Human and LLM Research Ideas
Read Synopsis →Jun 2026Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk