Audio-Based Understanding of Audiobook Narration Appeal

Technical Breakdown§

This paper presents a computational study linking narration qualities (tone, pace, loudness) to audiobook appeal, measured via view-rate and proprietary engagement metrics. The core methodology involves feature extraction from LibriVox audio using pre-trained models, followed by statistical modeling.

Feature Extraction§

Audio features are extracted using two pre-trained models:

wav2vec 2.0 (self-supervised): produces frame-level representations.
HuBERT (hidden-unit BERT): also frame-level.

Features are aggregated over time via mean pooling to obtain utterance-level embeddings. Additionally, hand-crafted prosodic features (e.g., mean pitch, speaking rate, energy) are computed using librosa.

Modeling§

A linear mixed-effects model is used to predict view-rate $y_{i,j,k}$ for audiobook $i$, genre $j$, title $k$: $$y_{i,j,k} = \mu + \alpha_j + \beta_k + \mathbf{x}_i^T \mathbf{w} + \epsilon_{i,j,k}$$ where $\alpha_j$, $\beta_k$ are random intercepts for genre and title, $\mathbf{x}_i$ is the feature vector, $\mathbf{w}$ are learned coefficients, and $\epsilon$ is noise. The model is fitted via maximum likelihood.

Code Snippet§

import torch
import torch.nn as nn
from transformers import Wav2Vec2Model, Wav2Vec2Processor

class AudioFeatureExtractor(nn.Module):
    def __init__(self, model_name='facebook/wav2vec2-base'):
        super().__init__()
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.model = Wav2Vec2Model.from_pretrained(model_name)
        
    def forward(self, waveform):
        # waveform: (batch, T) normalized to [-1,1]
        inputs = self.processor(waveform, sampling_rate=16000, return_tensors='pt', padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        # outputs.last_hidden_state: (batch, seq_len, hidden_dim)
        return outputs.last_hidden_state.mean(dim=1)  # (batch, hidden_dim)

Key Findings§

Acoustic features alone explain a significant portion of view-rate variance, even after controlling for genre and title (partial $R^2 \approx 0.15$).
Effects of specific features (e.g., pace, loudness) vary by genre: faster pace is appealing for thrillers but not for classics.
Validation on proprietary data (listener engagement) confirms robustness.

The study is limited by sparse consumption data but demonstrates potential for data-driven narrator casting.

Abstract

Technical Analysis & Implementation

Technical Breakdown§

Feature Extraction§

Modeling§

Code Snippet§

Key Findings§

Related Research

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

Language-Critique Imitation Learning from Suboptimal Demonstrations

DexCompose: Reusing Dexterous Policies for Multi-Task Manipulation with a Single Hand