Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Technical Breakdown§

Core Contributions§

1. DramaSR-532K: A large-scale benchmark with 532K annotated dialogue lines from TV dramas, covering >900 characters. Each utterance is annotated with speaker identity, and the benchmark requires integrating audio, visual (facial appearances, body language), and linguistic (dialogue context, scene info) cues. 2. DramaSR-LRM: A method that uses a Large Reasoning Model (LRM) to autonomously aggregate evidence via multimodal tool-use, improving speaker recognition accuracy.

Methodology§

DramaSR-LRM employs an LRM (e.g., a powerful LLM with reasoning capabilities) that can invoke external tools to gather contextual evidence. The architecture consists of:

Multimodal input module: Processes audio, video frames, and dialogue transcripts.
Tool-use reasoning: The LRM decides which tools (e.g., face detection, speaker diarization, text-based character name retrieval) to call based on the current utterance and context.
Attribution head: Fuses the evidence from tools to predict the speaker identity.

The reasoning process is iterative: the model may call tools multiple times to refine its prediction. The training uses a two-stage approach: (1) supervised fine-tuning on the DramaSR-532K dataset with ground-truth labels, (2) reasoning-oriented fine-tuning where the model learns to plan tool calls.

Key Equations§

Let the input be a multimodal sequence: audio features $\mathbf{A}$, video frames $\mathbf{V}$, and textual dialogue $\mathbf{T}$. The LRM produces a sequence of reasoning steps $\mathbf{R} = [r_1, r_2, \dots, r_n]$ where each step $r_i$ is either a tool call (e.g., face_recognize(frame)) or an intermediate inference. The final speaker probability is: $$P(\text{speaker} | \mathbf{A}, \mathbf{V}, \mathbf{T}) = \text{softmax}(f_{\text{head}}(\mathbf{R}))$$

The tool-use is formulated as a Markov decision process where the LRM learns to maximize attribution accuracy.

Implementation Details§

LRM backbone: Uses a decoder-only transformer (e.g., LLaMA-style) pretrained on text and fine-tuned on multimodal data.
Tool functions: Predefined APIs for face detection (MTCNN), speaker diarization (pyannote.audio), and text-based name lookup (TF-IDF on character descriptions).
Training: Batch size 64, learning rate 1e-5, AdamW optimizer, trained for 20 epochs on 8 A100 GPUs.

Code Snippet (PyTorch)§

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class DramaSR_LRM(nn.Module):
    def __init__(self, model_name="meta-llama/Llama-2-7b", num_tools=5):
        super().__init__()
        self.lrm = AutoModelForCausalLM.from_pretrained(model_name)
        self.tool_embeddings = nn.Embedding(num_tools, self.lrm.config.hidden_size)
        self.attribution_head = nn.Linear(self.lrm.config.hidden_size, num_characters)

    def forward(self, audio_feats, video_feats, text_ids, tool_mask):
        # Encode multimodal inputs
        multimodal_hidden = self.lrm(inputs_embeds=...)  # simplified
        # Add tool embeddings at reasoning steps
        tool_emb = self.tool_embeddings(tool_mask)  # tool_mask shape [batch, seq]
        combined = multimodal_hidden + tool_emb
        logits = self.attribution_head(combined.mean(dim=1))
        return logits

Experimental Results§

DramaSR-LRM achieves 92.3% accuracy on short utterances (<2s), compared to 78.1% for the best baseline (audio-only). On full-long utterances, it reaches 96.1%. Ablation studies show that tool-use reasoning contributes a +5.2% improvement over a standard multimodal fusion baseline.

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Core Contributions§

Methodology§

Key Equations§

Implementation Details§

Code Snippet (PyTorch)§

Experimental Results§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

LeVo 2: Stable and Melodious Song Generation via Hierarchical Representation Modeling and Progressive Post-Training

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models