Measuring the Gap Between Human and LLM Research Ideas
By Ziyu Chen, Yilun Zhao, Arman Cohan
"Proposes a two-axis research-taste taxonomy to profile ideas, revealing that LLMs over-concentrate on bridge opportunities and synthesis methods compared to humans."
Abstract
LLMs are increasingly used to brainstorm research ideas, but existing evaluations mostly judge individual ideas by novelty, feasibility, or expert preference. We instead ask: how far are current LLM-generated ideas from human researchers? To characterize this gap, we build a large-scale evaluation framework for ideation from high-quality human research papers. For each paper, we reverse-engineer a small set of closely related prior works that likely inspired its core idea. LLMs are then prompted to generate a new idea from the set of paper titles and summaries. We introduce a two-axis research-taste taxonomy to profile each idea by its opportunity pattern and research paradigm, and use it to quantify the divergence between human and LLM ideas. Across idea sets generated by different LLMs, we observe a consistent distributional gap: LLM ideas are disproportionately concentrated around bridge-like opportunities and synthesis methods, whereas the human paper reference distribution spreads more broadly across ways of framing gaps and constructing contributions. This result suggests that strong LLMs can produce a range of reasonable ideas, but that range remains narrower than, and systematically shifted relative to, human research taste.
Technical Analysis & Implementation
Overview§
This paper quantifies the distributional gap between human-written research ideas and LLM-generated ideas. The authors reverse-engineer likely prior works that inspired each human paper, then prompt LLMs to generate new ideas from those prior works. A two-axis taxonomy (opportunity pattern × research paradigm) is introduced to profile ideas.
Methodology§
Reverse-engineering prior works§
For each human paper $P$, a small set of prior works $\mathcal{C}_P$ (typically 2) that likely inspired $P$ is identified via citation analysis and human annotation. LLMs are given the titles and summaries of $\mathcal{C}_P$ and asked to propose a new idea.
Two-axis taxonomy§
Each idea is classified along:
- Opportunity pattern: how the gap is framed (e.g., bridge between fields, fill a hole, identify a new direction)
- Research paradigm: how the contribution is constructed (e.g., synthesis, analysis, empirical study)
This yields a 2D distribution. The authors compute the Wasserstein distance between human and LLM distributions.
Key Findings§
- LLM ideas are disproportionately concentrated on "bridge" opportunities and "synthesis" paradigms.
- Human ideas span more evenly across categories like "hole-filling" and "analysis".
- The gap persists across various LLMs (GPT-4, Claude, Gemini) and prompt variations.
Code Snippet (Idea Classification)§
import numpy as np
from scipy.stats import wasserstein_distance
def profile_idea(idea_text, classifier):
# classifier returns one-hot vector over 4 opportunity patterns and 4 paradigms
opp, para = classifier(idea_text)
return opp, para
# Compute distributions
opp_dist_human = np.array([0.2, 0.3, 0.3, 0.2]) # example
opp_dist_llm = np.array([0.1, 0.5, 0.2, 0.2])
w_dist = wasserstein_distance(opp_dist_human, opp_dist_llm)
print(f"Wasserstein distance: {w_dist:.3f}")Equations§
Let $X_H$ and $X_L$ be random variables representing the taxonomy category of human and LLM ideas. The gap is measured by the Wasserstein distance: $$W_1(X_H, X_L) = \inf_{\gamma \in \Gamma(X_H, X_L)} \mathbb{E}_{(x,y)\sim\gamma}[d(x,y)]$$ where $d$ is Euclidean distance on the 2D taxonomy grid.
Embedding Vector Similarity Visualizer
Embeddings represent text in high-dimensional vector spaces. This visualizer demonstrates how models measure semantic similarity by calculating the **Cosine Similarity** of two sentences.
Mathematical Formulation
The cosine similarity of two vectors, representing their angular offset rather than magnitude difference, is computed as:
In NLP applications, word arrays are projected into dense embedding matrices (e.g. 1536 dimensions). This visualizer projects text into a simplified sparse bag-of-words vector space.
Related Research
Is One Layer Enough? Training A Single Transformer Layer Can Match Full-Parameter RL Training
Read Synopsis →Jun 2026When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
Read Synopsis →Jun 2026Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
Read Synopsis →Accelerate your workflow with Feedalyze
AI churn detection for SaaS. Know which customers will leave before they do.
Free plan available · Connects to HubSpot, Intercom, Zendesk