arrow_backBack to research feed
llmPublished: July 2, 2026

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

By Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, Jingrui He

Research TL;DR

"ReContext recursively replays query-relevant evidence from long contexts using attention signals, improving LLM reasoning without training or pruning."

Abstract

Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.

Technical Analysis & Implementation

Technical Breakdown of ReContext§

ReContext (Recursive Evidence Replay) is a training-free inference method to enhance long-context reasoning in LLMs by separating evidence organization from answer generation. It uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation, preserving the full original context.

Core Methodology§

ReContext operates in two phases: 1. Evidence Selection: Given a context $C$ (length $N$) and question $Q$, the model computes attention scores $A = \text{softmax}(Q^T K / \sqrt{d})$ where $K$ are key embeddings from $C$. It then selects the top-$k$ tokens with highest attention values to form an evidence pool $E$. This selection is recursive: in each round, the query is concatenated with previously selected evidence to refine relevance signals. 2. Evidence Replay: After $R$ rounds of recursive selection, the final evidence pool $E_R$ is prepended to the original context $C$ (ensuring full context preservation) and passed to the model for answer generation.

Mathematically, the process for round $r$: $$ A_r = \text{Attention}(Q \oplus E_{r-1}, C) \quad \text{(all tokens)} $$ $$ E_r = \text{top-}k(A_r) \quad \text{(token indices)} $$ where $\oplus$ denotes concatenation and $E_0 = \emptyset$.

Theoretical Justification§

The authors frame ReContext through associative memory theory:

  • Context $C$ is a memory store with cues (token positions).
  • Question $Q$ acts as a retrieval cue.
  • Attention computes cue-trace association strength.
  • Evidence replay corresponds to trace reactivation, strengthening the memory signal.

Implementation Details§

ReContext requires no training or external memory. It uses the LLM's own attention mechanism to guide evidence selection. Key hyperparameters: $k$ (number of tokens per round), $R$ (number of recursive rounds). In experiments, $k=128$ and $R=3$ for 128K context. The method is applied to the final layer's attention heads, averaging across heads for a single relevance score per token.

Code Snippet (PyTorch-like)§

def recontext(model, input_ids, question_ids, context_start, k=128, R=3):
    # input_ids: full context + question
    # context_start: index where context begins
    evidence_indices = []
    for r in range(R):
        with torch.no_grad():
            outputs = model(input_ids, output_attentions=True)
            attn = outputs.attentions[-1]  # last layer, shape [batch, heads, seq, seq]
        # Average attention over heads and last query positions (question tokens)
        q_start = len(input_ids) - len(question_ids)
        attn_q = attn[:, :, q_start:, :].mean(dim=(0,1))  # [num_q, seq]
        attn_mean = attn_q.mean(dim=0)  # [seq]
        # Focus on context portion; exclude evidence already selected
        context_attn = attn_mean[context_start:]
        # Select top-k from context not already in evidence
        sorted_indices = torch.argsort(context_attn, descending=True)
        new_evidence = []
        for idx in sorted_indices:
            if idx.cpu().item() not in evidence_indices and len(new_evidence) < k:
                new_evidence.append(idx.cpu().item())
        evidence_indices.extend(new_evidence)
        # Update input: prepend evidence tokens before context
        evidence_tokens = input_ids[0][context_start:][evidence_indices]
        input_ids = torch.cat([evidence_tokens, input_ids[0][context_start:]], dim=0).unsqueeze(0)
    # Final generation
    output = model.generate(input_ids, max_new_tokens=128)
    return output

Experiments§

ReContext was evaluated on 8 long-context datasets (e.g., NarrativeQA, HotpotQA, 2WikiMultihop) with context length 128K, using Qwen3-4B, Qwen3-8B, and Llama3-8B. It consistently improves evidence utilization (F1 score) over baselines like no-retrieval and simple retrieval, achieving best average rank on all backbones. Ablations show recursive selection outperforms single-pass selection.

Key Advantages§

  • Training-free and model-agnostic.
  • Preserves full context (no pruning).
  • Low overhead: only a few forward passes for evidence selection.
  • Theoretically grounded in associative memory.
Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.

SHARE RESEARCH:
INTEGRATED RECOMMENDATION

Accelerate your workflow with Feedalyze

AI churn detection for SaaS. Know which customers will leave before they do.

Free plan available · Connects to HubSpot, Intercom, Zendesk