Back to Papers

Creativity Benchmark: A benchmark for marketing creativity for large language models

Ninad Bhat et al.

2025

Abstract

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $Δθ\approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Relevance Assessment

Research Gap

Notes

Notes are automatically saved as you type

Tags

Human-In-The-Loop › Autonomous GenerationCreativity Frameworks › Computational CreativityCreativity Evaluation Methods › LLM-Based EvaluationCreativity Evaluation Methods › Automatic MetricsCreativity Evaluation Methods › Human EvaluationCreativity Evaluation Methods › Creativity-Specific EvaluationLevel of Analysis › Document-LevelCreative Phenomena Studied › LogicsRelationship to Creativity › ExplicitProprietary Models › OpenAI ChatGPTModel Scale › Large (>32B)Proprietary Models › Google GeminiProprietary Models › Anthropic ClaudeResearch Focus › Benchmark

Search Queries

Paper ID: 54b6e6ba-0d65-4d3f-abba-28600e81890bAdded: 10/26/2025