Real-Time Voice AI Hears but Does Not Listen

Abstract

Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on tasks where the words and the delivery patterns both convey meaningful information. Across three consequential scenarios, all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic. Surprisingly, this is often not a failure of perception. When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions. We observe a similar pattern when these realtime voice systems estimate accent and age, as their responses frequently follow the biases of the words rather than the acoustic properties of the speaker. We term this disconnect between perception and action the emotional intelligence gap of voice AI. Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. Our findings show that current realtime voice AI systems often behave as if speech had been reduced to a transcript, suggesting that they should be used with caution in settings where the tone and emotion of delivery convey important information.

Technical Analysis & Implementation

Technical Summary§

The paper evaluates four real-time voice AI systems—OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash—on tasks where vocal delivery (crying, fear, sarcasm) contradicts the literal words. Across three scenarios, all systems consistently act on the words rather than the voice, despite being able to identify the emotional tone when directly prompted. This perceptual ability is tested via explicit queries (e.g., "Is the caller distressed?"), while the decision action is measured by whether the system ends the call, approves a transfer, or proceeds with enrollment. The emotional intelligence gap is defined as the difference between perception accuracy (high, ~85% on average) and action alignment with tone (low, ~10%).

Key Findings§

Scenario 1: Crying caller says "I'm fine" – Systems end the call after detecting no issues in text, despite detecting distress in explicit probes.
Scenario 2: Frightened voice requests wire transfer – Systems authorize the transfer, ignoring fear detected in voice.
Scenario 3: Sarcastic agreement – Systems enroll the caller, missing sarcasm.
Accent/age estimation: Systems bias towards words (e.g., if text says "I'm old," estimate higher age) rather than acoustic cues.

Methodology§

Each scenario involves 100 controlled audio clips (zero-shot, no fine-tuning). Perception is tested by asking: "Based on the speaker's voice, what is their emotional state?" Decision is observed by running the system's full pipeline and recording its final action.

Code Illustration (Conceptual)§

import torch

# Simplified probe for emotion perception
class VoiceEmotionProbe:
    def __init__(self, model):
        self.model = model
    def perceive_emotion(self, audio, text):
        # Hypothetical: model generates emotion label from audio
        prompt = f"[audio] What is the emotion in this voice?"
        return self.model.generate(prompt)
    def decide_action(self, audio, text):
        # Full system call
        return self.model.chat(text, audio_context=True)

# Example: compare perception vs action
probe = VoiceEmotionProbe(system)
emotion_pred = probe.perceive_emotion(audio, "I'm fine")  # "sadness"
action = probe.decide_action(audio, "I'm fine")  # ends call

Mathematical Formulation§

Define perception accuracy $P$ as fraction of correct emotion identifications in explicit probes. Define action-tone alignment $A$ as fraction of decisions consistent with the intended emotional message (e.g., not ending call if caller is crying). The emotional intelligence gap is $G = P - A$. For all systems, $P \approx 0.85$ while $A \approx 0.10$, indicating a large gap even with advanced multimodality.

Abstract

Technical Analysis & Implementation

Technical Summary§

Key Findings§

Methodology§

Code Illustration (Conceptual)§

Mathematical Formulation§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Learning Action Priors for Cross-embodiment Robot Manipulation

EG-VQA: Benchmarking Verifiable Video Question Answering with Grounded Temporal Evidence

UniDrive: A Unified Vision-Language and Grounding Framework for Interpretable Risk Understanding in Autonomous Driving