arrow_backBack to research feed
multimodalPublished: June 24, 2026

Real-Time Voice AI Hears but Does Not Listen

By Martijn Bartelds, Federico Bianchi, James Zou

Research TL;DR

"Current real-time voice AI systems reliably perceive emotional tone but ignore it during decision-making, defaulting to transcript-level understanding."

Abstract

Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on tasks where the words and the delivery patterns both convey meaningful information. Across three consequential scenarios, all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic. Surprisingly, this is often not a failure of perception. When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions. We observe a similar pattern when these realtime voice systems estimate accent and age, as their responses frequently follow the biases of the words rather than the acoustic properties of the speaker. We term this disconnect between perception and action the emotional intelligence gap of voice AI. Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. Our findings show that current realtime voice AI systems often behave as if speech had been reduced to a transcript, suggesting that they should be used with caution in settings where the tone and emotion of delivery convey important information.

Technical Analysis & Implementation

Technical Summary§

The paper evaluates four real-time voice AI systems—OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash—on tasks where vocal delivery (crying, fear, sarcasm) contradicts the literal words. Across three scenarios, all systems consistently act on the words rather than the voice, despite being able to identify the emotional tone when directly prompted. This perceptual ability is tested via explicit queries (e.g., "Is the caller distressed?"), while the decision action is measured by whether the system ends the call, approves a transfer, or proceeds with enrollment. The emotional intelligence gap is defined as the difference between perception accuracy (high, ~85% on average) and action alignment with tone (low, ~10%).

Key Findings§

  • Scenario 1: Crying caller says "I'm fine" – Systems end the call after detecting no issues in text, despite detecting distress in explicit probes.
  • Scenario 2: Frightened voice requests wire transfer – Systems authorize the transfer, ignoring fear detected in voice.
  • Scenario 3: Sarcastic agreement – Systems enroll the caller, missing sarcasm.
  • Accent/age estimation: Systems bias towards words (e.g., if text says "I'm old," estimate higher age) rather than acoustic cues.

Methodology§

Each scenario involves 100 controlled audio clips (zero-shot, no fine-tuning). Perception is tested by asking: "Based on the speaker's voice, what is their emotional state?" Decision is observed by running the system's full pipeline and recording its final action.

Code Illustration (Conceptual)§

import torch

# Simplified probe for emotion perception
class VoiceEmotionProbe:
    def __init__(self, model):
        self.model = model
    def perceive_emotion(self, audio, text):
        # Hypothetical: model generates emotion label from audio
        prompt = f"[audio] What is the emotion in this voice?"
        return self.model.generate(prompt)
    def decide_action(self, audio, text):
        # Full system call
        return self.model.chat(text, audio_context=True)

# Example: compare perception vs action
probe = VoiceEmotionProbe(system)
emotion_pred = probe.perceive_emotion(audio, "I'm fine")  # "sadness"
action = probe.decide_action(audio, "I'm fine")  # ends call

Mathematical Formulation§

Define perception accuracy $P$ as fraction of correct emotion identifications in explicit probes. Define action-tone alignment $A$ as fraction of decisions consistent with the intended emotional message (e.g., not ending call if caller is crying). The emotional intelligence gap is $G = P - A$. For all systems, $P \approx 0.85$ while $A \approx 0.10$, indicating a large gap even with advanced multimodality.

Interactive SEO Tool

Embedding Vector Similarity Visualizer

Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.

Cosine Similarity:0.4020
Vocabulary Size14 unique terms
Shared Terms3 terms
Intersecting Vocabulary
thebrownover
Vector Projection PlaneXYθ = 66°Vector AVector Bθ = 90° is orthogonal (0% match) · θ = 0° is parallel (100% match)

Mathematical Formulation

The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:

\[\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\]

In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.