Measuring the Gap Between Human and LLM Research Ideas

Abstract

LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.

Technical Analysis & Implementation

Overview§

This paper quantifies the distributional gap between human-written research ideas and LLM-generated ideas. The authors reverse-engineer likely prior works that inspired each human paper, then prompt LLMs to generate new ideas from those prior works. A two-axis taxonomy (opportunity pattern × research paradigm) is introduced to profile ideas.

Methodology§

Reverse-engineering prior works§

For each human paper $P$, a small set of prior works $\mathcal{C}_P$ (typically 2) that likely inspired $P$ is identified via citation analysis and human annotation. LLMs are given the titles and summaries of $\mathcal{C}_P$ and asked to propose a new idea.

Two-axis taxonomy§

Each idea is classified along:

Opportunity pattern: how the gap is framed (e.g., bridge between fields, fill a hole, identify a new direction)
Research paradigm: how the contribution is constructed (e.g., synthesis, analysis, empirical study)

This yields a 2D distribution. The authors compute the Wasserstein distance between human and LLM distributions.

Key Findings§

LLM ideas are disproportionately concentrated on "bridge" opportunities and "synthesis" paradigms.
Human ideas span more evenly across categories like "hole-filling" and "analysis".
The gap persists across various LLMs (GPT-4, Claude, Gemini) and prompt variations.

Code Snippet (Idea Classification)§

import numpy as np
from scipy.stats import wasserstein_distance

def profile_idea(idea_text, classifier):
    # classifier returns one-hot vector over 4 opportunity patterns and 4 paradigms
    opp, para = classifier(idea_text)
    return opp, para

# Compute distributions
opp_dist_human = np.array([0.2, 0.3, 0.3, 0.2])  # example
opp_dist_llm = np.array([0.1, 0.5, 0.2, 0.2])

w_dist = wasserstein_distance(opp_dist_human, opp_dist_llm)
print(f"Wasserstein distance: {w_dist:.3f}")

Equations§

Let $X_H$ and $X_L$ be random variables representing the taxonomy category of human and LLM ideas. The gap is measured by the Wasserstein distance: $$W_1(X_H, X_L) = \inf_{\gamma \in \Gamma(X_H, X_L)} \mathbb{E}_{(x,y)\sim\gamma}[d(x,y)]$$ where $d$ is Euclidean distance on the 2D taxonomy grid.

Abstract

Technical Analysis & Implementation

Overview§

Methodology§

Reverse-engineering prior works§

Two-axis taxonomy§

Key Findings§

Code Snippet (Idea Classification)§

Equations§

Embedding Vector Similarity Visualizer

Mathematical Formulation

Related Research

Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

Accelerate your workflow with Feedalyze