ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

Technical Breakdown of ReContext§

ReContext (Recursive Evidence Replay) is a training-free inference method to enhance long-context reasoning in LLMs by separating evidence organization from answer generation. It uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation, preserving the full original context.

Core Methodology§

ReContext operates in two phases: 1. Evidence Selection: Given a context $C$ (length $N$) and question $Q$, the model computes attention scores $A = \text{softmax}(Q^T K / \sqrt{d})$ where $K$ are key embeddings from $C$. It then selects the top-$k$ tokens with highest attention values to form an evidence pool $E$. This selection is recursive: in each round, the query is concatenated with previously selected evidence to refine relevance signals. 2. Evidence Replay: After $R$ rounds of recursive selection, the final evidence pool $E_R$ is prepended to the original context $C$ (ensuring full context preservation) and passed to the model for answer generation.

Mathematically, the process for round $r$: $$ A_r = \text{Attention}(Q \oplus E_{r-1}, C) \quad \text{(all tokens)} $$ $$ E_r = \text{top-}k(A_r) \quad \text{(token indices)} $$ where $\oplus$ denotes concatenation and $E_0 = \emptyset$.

Theoretical Justification§

The authors frame ReContext through associative memory theory:

Context $C$ is a memory store with cues (token positions).
Question $Q$ acts as a retrieval cue.
Attention computes cue-trace association strength.
Evidence replay corresponds to trace reactivation, strengthening the memory signal.

Implementation Details§

ReContext requires no training or external memory. It uses the LLM's own attention mechanism to guide evidence selection. Key hyperparameters: $k$ (number of tokens per round), $R$ (number of recursive rounds). In experiments, $k=128$ and $R=3$ for 128K context. The method is applied to the final layer's attention heads, averaging across heads for a single relevance score per token.

Code Snippet (PyTorch-like)§

def recontext(model, input_ids, question_ids, context_start, k=128, R=3):
    # input_ids: full context + question
    # context_start: index where context begins
    evidence_indices = []
    for r in range(R):
        with torch.no_grad():
            outputs = model(input_ids, output_attentions=True)
            attn = outputs.attentions[-1]  # last layer, shape [batch, heads, seq, seq]
        # Average attention over heads and last query positions (question tokens)
        q_start = len(input_ids) - len(question_ids)
        attn_q = attn[:, :, q_start:, :].mean(dim=(0,1))  # [num_q, seq]
        attn_mean = attn_q.mean(dim=0)  # [seq]
        # Focus on context portion; exclude evidence already selected
        context_attn = attn_mean[context_start:]
        # Select top-k from context not already in evidence
        sorted_indices = torch.argsort(context_attn, descending=True)
        new_evidence = []
        for idx in sorted_indices:
            if idx.cpu().item() not in evidence_indices and len(new_evidence) < k:
                new_evidence.append(idx.cpu().item())
        evidence_indices.extend(new_evidence)
        # Update input: prepend evidence tokens before context
        evidence_tokens = input_ids[0][context_start:][evidence_indices]
        input_ids = torch.cat([evidence_tokens, input_ids[0][context_start:]], dim=0).unsqueeze(0)
    # Final generation
    output = model.generate(input_ids, max_new_tokens=128)
    return output

Experiments§

ReContext was evaluated on 8 long-context datasets (e.g., NarrativeQA, HotpotQA, 2WikiMultihop) with context length 128K, using Qwen3-4B, Qwen3-8B, and Llama3-8B. It consistently improves evidence utilization (F1 score) over baselines like no-retrieval and simple retrieval, achieving best average rank on all backbones. Ablations show recursive selection outperforms single-pass selection.

Key Advantages§

Training-free and model-agnostic.
Preserves full context (no pruning).
Low overhead: only a few forward passes for evidence selection.
Theoretically grounded in associative memory.

Abstract

Technical Analysis & Implementation

Technical Breakdown of ReContext§

Core Methodology§

Theoretical Justification§

Implementation Details§

Code Snippet (PyTorch-like)§

Experiments§

Key Advantages§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

Measuring the Gap Between Human and LLM Research Ideas

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

Accelerate your workflow with Feedalyze