arrow_backBack to research feed
otherPublished: July 2, 2026

Audio-Based Understanding of Audiobook Narration Appeal

By Shahar Elisha, Mariano Beguerisse-Díaz, Emmanouil Benetos

Research TL;DR

"Uses pre-trained audio models to extract vocal/acoustic features from audiobooks, showing robust correlation with consumption metrics after controlling for title/genre effects."

Abstract

Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.

Technical Analysis & Implementation

Technical Breakdown§

This paper presents a computational study linking narration qualities (tone, pace, loudness) to audiobook appeal, measured via view-rate and proprietary engagement metrics. The core methodology involves feature extraction from LibriVox audio using pre-trained models, followed by statistical modeling.

Feature Extraction§

Audio features are extracted using two pre-trained models:

  • wav2vec 2.0 (self-supervised): produces frame-level representations.
  • HuBERT (hidden-unit BERT): also frame-level.

Features are aggregated over time via mean pooling to obtain utterance-level embeddings. Additionally, hand-crafted prosodic features (e.g., mean pitch, speaking rate, energy) are computed using librosa.

Modeling§

A linear mixed-effects model is used to predict view-rate $y_{i,j,k}$ for audiobook $i$, genre $j$, title $k$: $$y_{i,j,k} = \mu + \alpha_j + \beta_k + \mathbf{x}_i^T \mathbf{w} + \epsilon_{i,j,k}$$ where $\alpha_j$, $\beta_k$ are random intercepts for genre and title, $\mathbf{x}_i$ is the feature vector, $\mathbf{w}$ are learned coefficients, and $\epsilon$ is noise. The model is fitted via maximum likelihood.

Code Snippet§

import torch
import torch.nn as nn
from transformers import Wav2Vec2Model, Wav2Vec2Processor

class AudioFeatureExtractor(nn.Module):
    def __init__(self, model_name='facebook/wav2vec2-base'):
        super().__init__()
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.model = Wav2Vec2Model.from_pretrained(model_name)
        
    def forward(self, waveform):
        # waveform: (batch, T) normalized to [-1,1]
        inputs = self.processor(waveform, sampling_rate=16000, return_tensors='pt', padding=True)
        with torch.no_grad():
            outputs = self.model(**inputs)
        # outputs.last_hidden_state: (batch, seq_len, hidden_dim)
        return outputs.last_hidden_state.mean(dim=1)  # (batch, hidden_dim)

Key Findings§

  • Acoustic features alone explain a significant portion of view-rate variance, even after controlling for genre and title (partial $R^2 \approx 0.15$).
  • Effects of specific features (e.g., pace, loudness) vary by genre: faster pace is appealing for thrillers but not for classics.
  • Validation on proprietary data (listener engagement) confirms robustness.

The study is limited by sparse consumption data but demonstrates potential for data-driven narrator casting.

SHARE RESEARCH: