Creativity or brute force? Using brainteasers as a window into the problem-solving abilities of large language models

Simeng Han

2025Computer science - artificial intelligence, computer science - computation and language

Abstract

Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute f

Relevance Assessment

Research Gap

Notes

Notes are automatically saved as you type

Tags

Human-In-The-Loop › Autonomous GenerationCreativity Frameworks › Logical CreativityCreative Phenomena Studied › LogicsProprietary Models › OpenAI ChatGPTModel Scale › Large (>32B)Model Scale › Medium (8-24)Model Scale › Small (<3B)Relationship to Creativity › ExplicitResearch Focus › Prompt EngineeringResearch Focus › Architectural ResearchCreativity Evaluation Methods › LLM-Based EvaluationCreativity Evaluation Methods › Human EvaluationLevel of Analysis › Document-LevelCreativity Evaluation Methods › Creativity-Specific EvaluationCreative Phenomena Studied › WordplayProprietary Models › Google GeminiProprietary Models › Anthropic ClaudeResearch Focus › Benchmark

Search Queries

("LLMs" OR large language models) AND creative BY TITLE - ("LLMs" OR large language models) AND creative

Paper ID: 164b4354-472e-4b14-9ef6-a2c770a16bb2Added: 10/26/2025