arrow_backBack to research feed
alignmentPublished: June 30, 2026

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

By Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan

Research TL;DR

"Proposes RL with metacognitive feedback (RLMF) using self-judgment quality as reward for preference optimization, achieving superior calibration and uncertainty expression in LLMs."

Abstract

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

Technical Analysis & Implementation

Core Methodology§

The paper introduces Reinforcement Learning with Metacognitive Feedback (RLMF), a two-stage paradigm to improve LLM calibration.

Stage 1: Faithful Calibration via RLMF§

RLMF extends standard preference optimization (e.g., DPO) by incorporating a metacognitive reward. For each input $x$, the model generates a response $y$ and a self-reported confidence $c \in [0,1]$. During training, a metacognitive judge (typically a committee of experts or a separate model) evaluates the alignment between confidence and correctness. The reward $R$ for a pair $(y, c)$ is:

$$R(y,c) = \begin{cases} 1 - (c - \mathbb{I}[\text{correct}])^2 & \text{if confidence is well-calibrated} \\ -\alpha & \text{otherwise} \end{cases}$$

where $\mathbb{I}[\text{correct}]$ is 1 if $y$ is correct else 0. The preference optimization objective (e.g., InfoNCA) maximizes the likelihood of high-reward completions over low-reward ones.

Stage 2: Linguistic Uncertainty Mapping§

After calibration, the model's confidence scores are mapped to natural language expressions via a trainable linear transformation followed by an editing module that adjusts output phrases (e.g., "high confidence" vs. "unsure") based on context.

Metacognitive Data Selection§

To augment training data, the model's self-judgments of difficulty (e.g., confidence after initial training) are used to select high-value examples. This outperforms random or uncertainty-based active learning.

Implementation Details§

  • Base model: Llama 2 7B
  • RLMF uses a lightweight judge (trained on calibration metrics) with a reward shaping term.
  • Training follows a two-stage decoupled process: first calibrate confidence, then train language mapping.

Code Snippet (PyTorch-style)§

import torch
import torch.nn.functional as F

class RLMFTrainer:
    def __init__(self, model, judge_model, alpha=0.5):
        self.model = model
        self.judge = judge_model
        self.alpha = alpha

    def compute_reward(self, responses, confidences, targets):
        correctness = (responses.argmax(dim=-1) == targets).float()
        calibration_error = (confidences - correctness) ** 2
        reward = 1 - calibration_error
        # penalize overconfident errors
        penalty = (confidences > 0.5) & (correctness < 0.5)
        reward[penalty] -= self.alpha
        return reward

    def train_step(self, batch):
        # Forward pass with confidence head
        logits, confidence = self.model(batch['input_ids'])
        # Generate responses and confidences
        responses = logits.argmax(dim=-1)
        rewards = self.compute_reward(responses, confidence, batch['labels'])
        # Preference optimization loss (e.g., InfoNCA)
        loss = -torch.log(F.softmax(rewards, dim=0)[0])
        loss.backward()
        # optimizer.step()

Key Results§

  • RLMF achieves state-of-the-art calibration on multiple benchmarks (e.g., MMLU, TriviaQA), reducing expected calibration error by up to 63% over standard RL.
  • Metacognitive data selection yields faster convergence and better final calibration than random or active learning baselines.
  • The two-stage approach preserves task accuracy while improving uncertainty expression.

Significance§

This work bridges metacognition and RL, showing that self-judgment quality can serve as a dense reward signal, overcoming limitations of intrinsic feedback methods. It offers a scalable path to more trustworthy LLMs.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.