Papers

250 papers found

Review Papers

E.A.R.T.H.: Structuring creative evolution through model error in generative AI

Yusen Peng

How can AI move beyond imitation toward genuine creativity? This paper proposes the E.A.R.T.H. framework, a five-stage generative pipeline that transforms model-generated errors into creative assets through Error generation, Amplification, Refine selection, Transform, and Harness feedback. Drawing on cognitive science and generative modeling, we posit that "creative potential hides in failure" and operationalize this via structured prompts, semantic scoring, and human-in-the-loop evaluation. Imp

Computer science - artificial intelligenceRelevant
creativity frameworks › creative-textual creativityevaluates a creative feature › sloganevaluation › automatic metricsevaluation › creativity evaluationevaluation › human evalmodel used › Medium (8-24)related to creativity › related to creativity as a human abilityscope › prompt engineering

Origins of creativity in attention-based diffusion models

Emma Finn

As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how `creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb \& Ganguli, 2024) and ot

Computer science - machine learning, computer science - computer vision and pattern recognitionNot Relevant

An analytic theory of creativity in convolutional diffusion models

Mason Kamb

We obtain an analytic, interpretable and predictive theory of creativity in convolutional diffusion models. Indeed, score-matching diffusion models can generate highly original images that lie far from their training data. However, optimal score-matching theory suggests that these models should only be able to produce memorized training examples. To reconcile this theory-experiment gap, we identify two simple inductive biases, locality and equivariance, that: (1) induce a form of combinatorial c

Computer science - machine learning, condensed matter - disordered systems and neural networks, computer science - artificial intelligence, quantitative biology - neurons and cognition, statistics - machine learning, I.2.10Not Relevant

Creativity in LLM-based multi-agent systems: a survey

Yi-Cheng Lin

Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts. While existing surveys provide comprehensive overviews of MAS infrastructures, they largely overlook the dimension of \emph{creativity}, including how novel outputs are generated and evaluated, how creativity informs agent personas, and how creative workflows are coordinated. This is the first survey dedicated to creativity in MAS. We focus on text and ima

Computer science - human-computer interaction, computer science - artificial intelligence, computer science - computation and languageRelevant
creativity frameworks › computational creativityscope › agentsevaluation › creativity evaluationrelated to creativity › related to creativity as a textual genre

Toward modeling creative processes for algorithmic painting

Aaron Hertzmann

This paper proposes a framework for computational modeling of artistic painting algorithms, inspired by human creative practices. Based on examples from expert artists and from the author's own experience, the paper argues that creative processes often involve two important components: vague, high-level goals (e.g., "make a good painting"), and exploratory processes for discovering new ideas. This paper then sketches out possible computational mechanisms for imitating those elements of the paint

Computer science - artificial intelligence, computer science - computer vision and pattern recognition, computer science - graphicsNot Relevant

Modeling creativity: Case studies in python

Tom De Smedt

Modeling Creativity (doctoral dissertation, 2013) explores how creativity can be represented using computational approaches. Our aim is to construct computer models that exhibit creativity in an artistic context, that is, that are capable of generating or evaluating an artwork (visual or linguistic), an interesting new idea, a subjective opinion. The research was conducted in 2008-2012 at the Computational Linguistics Research Group (CLiPS, University of Antwerp) under the supervision of Prof. W

Computer science - artificial intelligenceRelevant
creativity frameworks › computational creativitycreativity frameworks › psychological/cognitivecreativity frameworks › creative-textual creativityevaluation › creativity evaluationevaluation › human evalrelated to creativity › mentions creativity as a human ability

HLLM-creator: Hierarchical LLM-based personalized creative generation

Junyi Chen

AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators' inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main cha

Computer science - information retrieval, computer science - computation and languageNot Relevant

Taming flow-based I2V models for creative video editing

Xianghao Kong

Although image editing techniques have advanced significantly, video editing, which aims to manipulate videos according to user intent, remains an emerging challenge. Most existing image-conditioned video editing methods either require inversion with model-specific design or need extensive optimization, limiting their capability of leveraging up-to-date image-to-video (I2V) models to transfer the editing capability of image editing models to the video domain. To this end, we propose IF-V2V, an I

Computer science - computer vision and pattern recognition, computer science - multimediaNot Relevant

Enhancing creative generation on stable diffusion-based models

Jiyeon Han

Recent text-to-image generative models, particularly Stable Diffusion and its distilled variants, have achieved impressive fidelity and strong text-image alignment. However, their creative capability remains constrained, as including `creative' in prompts seldom yields the desired results. This paper introduces C3 (Creative Concept Catalyst), a training-free approach designed to enhance creativity in Stable Diffusion-based models. C3 selectively amplifies features during the denoising process to

Computer science - computer vision and pattern recognitionNot Relevant

Does generation require memorization? Creative diffusion models using ambient diffusion

Kulin Shah

There is strong empirical evidence that the state-of-the-art diffusion modeling paradigm leads to models that memorize the training set, especially when the training set is small. Prior methods to mitigate the memorization problem often lead to a decrease in image quality. Is it possible to obtain strong and creative generative models, i.e., models that achieve high generation quality and low memorization? Despite the current pessimistic landscape of results, we make significant progress in push

Computer science - machine learning, statistics - machine learningNot Relevant

CommonCanvas: An open diffusion model trained with creative-commons images

Aaron Gokaslan

We assemble a dataset of Creative-Commons-licensed (CC) images, which we use to train a set of open diffusion models that are qualitatively competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train text-to-image generative models; (2) CC images are relatively scarce. In turn, to address these challenges, we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired

Computer science - computer vision and pattern recognition, cs.CYNot Relevant

CCEdit: Creative and controllable video editing via diffusion models

Ruoyu Feng

In this paper, we present CCEdit, a versatile generative video editing framework based on diffusion models. Our approach employs a novel trident network structure that separates structure and appearance control, ensuring precise and creative editing capabilities. Utilizing the foundational ControlNet architecture, we maintain the structural integrity of the video during editing. The incorporation of an additional appearance branch enables users to exert fine-grained control over the edited key f

Computer science - computer vision and pattern recognitionNot Relevant

Base models beat aligned models at randomness and creativity

Peter West

Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable ou

Computer science - computation and languageRelevant
creativity frameworks › computational creativityevaluation › human evaltextual genre › poetrycreativity frameworks › psychological/cognitiveevaluates a creative feature › logic (puzzles, etc.)

L-C4: Language-based video colorization for creative and consistent color

Zheng Chang

Automatic video colorization is inherently an ill-posed problem because each monochrome frame has multiple optional color candidates. Previous exemplar-based video colorization methods restrict the user's imagination due to the elaborate retrieval process. Alternatively, conditional image colorization methods combined with post-processing algorithms still struggle to maintain temporal consistency. To address these issues, we present Language-based video Colorization for Creative and Consistent C

Computer science - computer vision and pattern recognitionNot Relevant

Pattern languages as media for the creative society

Takashi Iba

This paper proposes new languages for basic skills in the Creative Society, where people create their own goods, tools, concepts, knowledge, and mechanisms with their own hands: the skills of learning, presentation, and collaboration. These languages are written as a pattern language, which is a way of describing the tacit practical knowledge. In this paper, a new type of pattern languages are proposed as "Pattern Languages 3.0" and three examples are introduced: Learning Patterns, Collaboration

cs.CYNot Relevant

Ranking creative language characteristics in small data scenarios

Julia Siekiera

The ability to rank creative natural language provides an important general tool for downstream language understanding and generation. However, current deep ranking models require substantial amounts of labeled data that are difficult and expensive to obtain for different domains, languages and creative characteristics. A recent neural approach, the DirectRanker, promises to reduce the amount of training data needed but its application to text isn't fully explored. We therefore adapt the DirectR

Computer science - computation and languageRelevant
creativity frameworks › computational creativitycreativity frameworks › creative-textual creativityevaluates a creative feature › figures of speechevaluates a creative feature › humorevaluates a creative feature › punsevaluation › creativity evaluationevaluation › human evalmodel used › Small (<3B)evaluation › automatic metricsscope › technical researchtextual genre › literaturerelated to creativity › related to creativity as a human ability

Evaluating creative language generation: The case of rap lyric ghostwriting

Peter Potash

Language generation tasks that seek to mimic human ability to use language creatively are difficult to evaluate, since one must consider creativity, style, and other non-trivial aspects of the generated text. The goal of this paper is to develop evaluation methods for one such task, ghostwriting of rap lyrics, and to provide an explicit, quantifiable foundation for the goals and future directions of this task. Ghostwriting must produce text that is similar in style to the emulated artist, yet di

Computer science - computation and languageNot Relevant

Galton's law of mediocrity: Why large language models regress to the mean and fail at creativity in advertising

Matt Keon

Large language models (LLMs) generate fluent text yet often default to safe, generic phrasing, raising doubts about their ability to handle creativity. We formalize this tendency as a Galton-style regression to the mean in language and evaluate it using a creativity stress test in advertising concepts. When ad ideas were simplified step by step, creative features such as metaphors, emotions, and visual cues disappeared early, while factual content remained, showing that models favor high-probabi

Computer science - artificial intelligenceRelevant
related to creativity › related to creativity as a human abilityrelated to creativity › related to creativity as a textual genrecreativity frameworks › computational creativitycreativity frameworks › creative-textual creativitymodel used › Small (<3B)evaluation › human evalevaluation › automatic metricsevaluation › sentence-levelevaluation › creativity evaluationscope › technical researchtextual genre › music

VisuCraft: Enhancing large vision-language models for complex visual-guided creative content generation via structured information extraction

Rongxin Jiang

This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt

Computer science - computer vision and pattern recognition, computer science - computation and languageNot Relevant

Creativity or brute force? Using brainteasers as a window into the problem-solving abilities of large language models

Simeng Han

Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute f

Computer science - artificial intelligence, computer science - computation and languageRelevant
creativity frameworks › psychological/cognitiveevaluates a creative feature › logic (puzzles, etc.)evaluates a creative feature › riddlesmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)model used › Small (<3B)related to creativity › mentions creativity as a human abilityscope › prompt engineeringscope › technical researchevaluation › LLM-as-a-judgeevaluation › human evalevaluation › document-levelevaluation › creativity evaluation

Let's think outside the box: Exploring leap-of-thought in large language models with creative humor generation

Shanshan Zhong

Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step, and can motivate their logical reasoning ability. While effective for logical tasks, CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs – a non-sequential, creative paradigm involving strong associations and knowledge leaps. To this end, we study LLMs on the

Computer science - artificial intelligence, computer science - computation and language, computer science - computer vision and pattern recognitionRelevant
creativity frameworks › psychological/cognitiveevaluates a creative feature › humorevaluation › automatic metricsevaluation › human evalevaluation › sentence-levelevaluation › creativity evaluationmodel used › ChatGPTmodel used › Medium (8-24)model used › Large (>32B)scope › creative trainingscope › prompt engineeringscope › technical researchrelated to creativity › related to creativity as a human ability

Curiosity-driven LLM-as-a-judge for personalized creative judgment

Vanya Bannihatti Kumar

Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotat

Computer science - computation and language, computer science - machine learningRelevant
evaluation › LLM-as-a-judgecreativity frameworks › psychological/cognitiveevaluation › human evalevaluation › automatic metricsevaluation › document-levelmodel used › Large (>32B)model used › Small (<3B)related to creativity › related to creativity as a human abilityrelated to creativity › related to creativity as a textual genretextual genre › literaturescope › technical research

LLM-driven e-commerce marketing content optimization: Balancing creativity and conversion

Haowei Yang

As e-commerce competition intensifies, balancing creative content with conversion effectiveness becomes critical. Leveraging LLMs' language generation capabilities, we propose a framework that integrates prompt engineering, multi-objective fine-tuning, and post-processing to generate marketing copy that is both engaging and conversion-driven. Our fine-tuning method combines sentiment adjustment, diversity enhancement, and CTA embedding. Through offline evaluations and online A/B tests across cat

Computer science - computation and language, computer science - artificial intelligence, computer science - information retrievalNot Relevant

We're different, we're the same: Creative homogeneity across LLMs

Emily Wenger

Numerous powerful large language models (LLMs) are now available for use as writing support tools, idea generators, and beyond. Although these LLMs are marketed as helpful creative assistants, several works have shown that using an LLM as a creative partner results in a narrower set of creative outputs. However, these studies only consider the effects of interacting with a single LLM, begging the question of whether such narrowed creativity stems from using a particular LLM – which arguably has

cs.CY, computer science - artificial intelligence, computer science - computation and language, computer science - machine learningRelevant
creativity frameworks › psychological/cognitiverelated to creativity › mentions creativity as a human abilitymodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)model used › Small (<3B)evaluation › automatic metricsevaluation › creativity evaluationevaluation › human evalscope › prompt engineeringscope › technical research

Designing and evaluating dialogue LLMs for co-creative improvised theatre

Boyd Branch

Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world evaluations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehen

Computer science - computation and languageNot Relevant

Creative beam search: LLM-as-a-judge for improving response generation

Giorgio Franceschelli

Large language models are revolutionizing several areas, including artificial creativity. However, the process of generation in machines profoundly diverges from that observed in humans. In particular, machine generation is characterized by a lack of intentionality and an underlying creative process. We propose a method called Creative Beam Search that uses Diverse Beam Search and LLM-as-a-Judge to perform response generation and response validation. The results of a qualitative experiment show

Computer science - artificial intelligence, computer science - computation and language, computer science - human-computer interaction, computer science - machine learningRelevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › creativity evaluationevaluation › human evalmodel used › Medium (8-24)related to creativity › mentions creativity as a human abilityevaluation › document-levelscope › technical research

CreativEval: Evaluating creativity of LLM-based hardware code generation

Matthew DeLorenzo

Large Language Models (LLMs) have proved effective and efficient in generating code, leading to their utilization within the hardware design process. Prior works evaluating LLMs' abilities for register transfer level code generation solely focus on functional correctness. However, the creativity associated with these LLMs, or the ability to generate novel and unique solutions, is a metric not as well understood, in part due to the challenge of quantifying this quality. To address this research

Computer science - computation and languageNot Relevant

On recipe memorization and creativity in large language models: Is your model a creative cook, a bad cook, or merely a plagiator?

Jan Kvapil

This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated

Computer science - computation and languageNot Relevant

Probing and inducing combinational creativity in vision-language models

Yongqian Peng

The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity–defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts–or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativ

Computer science - computer vision and pattern recognition, computer science - artificial intelligence, computer science - computation and languageNot Relevant

EscapeBench: Towards advancing creative intelligence of language model agents

Cheng Qian

Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench, a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite emplo

Computer science - computation and language, computer science - artificial intelligence, computer science - machine learningNot Relevant

Weaver: Foundation models for creative writing

Tiannan Wang

This work introduces Weaver, our first family of large language models (LLMs) dedicated to content creation. Weaver is pre-trained on a carefully selected corpus that focuses on improving the writing capabilities of large language models. We then fine-tune Weaver for creative and professional writing purposes and align it to the preference of professional writers using a suit of novel methods for instruction data synthesis and LLM alignment, making it able to produce more human-like texts and fo

Computer science - computation and language, computer science - artificial intelligence, computer science - machine learningRelevant
creativity frameworks › creative-textual creativityevaluation › LLM-as-a-judgeevaluation › human evalevaluation › creativity evaluationevaluation › document-levelmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)model used › Small (<3B)post-editing › post-editing with LLMsscope › agentsscope › creative trainingscope › technical researchtextual genre › literaturerelated to creativity › related to creativity as a human abilityrelated to creativity › related to creativity as a textual genre

Creative divergent synthesis with generative models

Axel Chemla–Romeu-Santos

Machine learning approaches now achieve impressive generation capabilities in numerous domains such as image, audio or video. However, most training \& evaluation frameworks revolve around the idea of strictly modelling the original data distribution rather than trying to extrapolate from it. This precludes the ability of such models to diverge from the original distribution and, hence, exhibit some creative traits. In this paper, we propose various perspectives on how this complicated goal coul

Computer science - machine learning, statistics - machine learningNot Relevant

Creative painting with latent diffusion models

Xianchao Wu

Artistic painting has achieved significant progress during recent years. Using an autoencoder to connect the original images with compressed latent spaces and a cross attention enhanced U-Net as the backbone of diffusion, latent diffusion models (LDMs) have achieved stable and high fertility image generation. In this paper, we focus on enhancing the creative painting ability of current LDMs in two directions, textual condition extension and model retraining with Wikiart dataset. Through textual

Computer science - computer vision and pattern recognition, computer science - artificial intelligence, computer science - computation and language, computer science - graphics, computer science - machine learningNot Relevant

Ownership and creativity in generative models

Omri Avrahami

Machine learning generated content such as image artworks, textual poems and music become prominent in recent years. These tools attract much attention from the media, artists, researchers, and investors. Because these tools are data-driven, they are inherently different than the traditional creative tools which arises the question - who may own the content that is generated by these tools? In this paper we aim to address this question, we start by providing a background to this problem, raising

cs.CYNot Relevant

BILLY: Steering large language models via merging persona vectors for creative generation

Tsung-Min Pai

Multi-LLM systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi-LLM collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by

Computer science - computation and language, computer science - artificial intelligenceRelevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › creativity evaluationevaluation › document-levelmodel used › Medium (8-24)related to creativity › related to creativity as a human abilityscope › technical research

Creative writers' attitudes on writing as training data for large language models

Katy Ilonka Gero

The use of creative writing as training data for large language models (LLMs) is highly contentious and many writers have expressed outrage at the use of their work without consent or compensation. In this paper, we seek to understand how creative writers reason about the real or hypothetical use of their writing as training data. We interviewed 33 writers with variation across genre, method of publishing, degree of professionalization, and attitudes toward and engagement with LLMs. We report on

Computer science - human-computer interactionNot Relevant

The creative psychometric item generator: a framework for item generation and validation using large language models

Antonio Laverghetta Jr. and Antonio Laverghetta Jr

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creat

Computer science - computation and languageNot Relevant

C2Ideas: Supporting creative interior color design ideation with large language model

Yihan Hou

Interior color design is a creative process that endeavors to allocate colors to furniture and other elements within an interior space. While much research focuses on generating realistic interior designs, these automated approaches often misalign with user intention and disregard design rationales. Informed by a need-finding preliminary study, we develop C2Ideas, an innovative system for designers to creatively ideate color schemes enabled by an intent-aligned and domain-oriented large language

Computer science - human-computer interactionNot Relevant

Inspire creativity with ORIBA: Transform artists' original characters into chatbots through large language model

Yuqian Sun

This research delves into the intersection of illustration art and artificial intelligence (AI), focusing on how illustrators engage with AI agents that embody their original characters (OCs). We introduce 'ORIBA', a customizable AI chatbot that enables illustrators to converse with their OCs. This approach allows artists to not only receive responses from their OCs but also to observe their inner monologues and behavior. Despite the existing tension between artists and AI, our study explores in

Computer science - multimedia, computer science - artificial intelligence, computer science - human-computer interaction, 14J60 (primary) 14F05, 14J26 (secondary), F.2.2; I.2.7Not Relevant

A mathematical abstraction for balancing the trade-off between creativity and reality in large language models

Ritwik Sinha

Large Language Models have become popular for their remarkable capabilities in human-oriented tasks and traditional natural language processing tasks. Its efficient functioning is attributed to the attention mechanism in the Transformer architecture, enabling it to concentrate on particular aspects of the input. LLMs are increasingly being used in domains such as generating prose, poetry or art, which require the model to be creative (e.g. Adobe firefly). LLMs possess advanced language generat

Computer science - computation and language, computer science - machine learningNot Relevant

Sasha: Creative goal-oriented reasoning in smart homes with large language models

Evan King

Smart home assistants function best when user commands are direct and well-specified (e.g., "turn on the kitchen light"), or when a hard-coded routine specifies the response. In more natural communication, however, human speech is unconstrained, often describing goals (e.g., "make it cozy in here" or "help me save energy") rather than indicating specific target devices and actions to take on those devices. Current systems fail to understand these under-specified commands since they cannot reason

Computer science - human-computer interaction, computer science - artificial intelligenceNot Relevant

LLMs can realize combinatorial creativity: Generating creative ideas via LLMs for scientific research

Tianyang Gu

Scientific idea generation has been extensively studied in creativity theory and computational creativity research, providing valuable frameworks for understanding and implementing creative processes. However, recent work using Large Language Models (LLMs) for research idea generation often overlooks these theoretical foundations. We present a framework that explicitly implements combinatorial creativity theory using LLMs, featuring a generalization-level retrieval system for cross-domain knowle

Computer science - artificial intelligenceNot Relevant

Mathemyths: Leveraging large language models to teach mathematical language through child-AI co-creative storytelling

Chao Zhang

Mathematical language is a cornerstone of a child's mathematical development, and children can effectively acquire this language through storytelling with a knowledgeable and engaging partner. In this study, we leverage the recent advances in large language models to conduct free-form, creative conversations with children. Consequently, we developed Mathemyths, a joint storytelling agent that takes turns co-creating stories with children while integrating mathematical terms into the evolving nar

Computer science - human-computer interactionNot Relevant

FlexMind: Supporting deeper creative thinking with LLMs

Yaqing Yang

Effective ideation requires both broad exploration of diverse ideas and deep evaluation of their potential. Generative AI can support such processes, but current tools typically emphasize either generating many ideas or supporting in-depth consideration of a few, lacking support for both. Research also highlights risks of over-reliance on LLMs, including shallow exploration and negative creative outcomes. We present FlexMind, an AI-augmented system that scaffolds iterative exploration of ideas,

Computer science - human-computer interactionNot Relevant

Igniting creative writing in small language models: LLM-as-a-judge versus multi-agent refined rewards

Xiaolong Wei

Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to

Computer science - computation and language, computer science - artificial intelligenceNot Relevant

Cooking up creativity: Enhancing LLM creativity through structured recombination

Moran Mizrahi

Large Language Models (LLMs) excel at many tasks, yet they struggle to produce truly creative, diverse ideas. In this paper, we introduce a novel approach that enhances LLM creativity. We apply LLMs for translating between natural language and structured representations, and perform the core creative leap via cognitively inspired manipulations on these representations. Our notion of creativity goes beyond superficial token-level variations; rather, we recombine structured representations of exis

Computer science - computation and language, computer science - artificial intelligence, computer science - machine learningRelevant
creativity frameworks › psychological/cognitivecreativity frameworks › computational creativityevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelevaluation › human evalmodel used › ChatGPTmodel used › Large (>32B)related to creativity › mentions creativity as a human abilityscope › prompt engineeringscope › technical research

CreativityPrism: a holistic benchmark for large language model creativity

Zhaoyi Joey Hou

Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPri

Computer science - computation and language, computer science - artificial intelligenceRelevant
related to creativity › mentions creativity as a human abilitycreativity frameworks › psychological/cognitiveevaluates a creative feature › logic (puzzles, etc.)evaluation › automatic metricsevaluation › word-levelmodel used › Medium (8-24)evaluation › creativity evaluationtextual genre › literaturetextual genre › poetry

Modifying large language model post-training for diverse creative writing

John Joon Young Chung

As creative writing tasks do not have singular correct answers, large language models (LLMs) trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation – the degree of difference between a trainin

Computer science - computation and language, computer science - machine learningRelevant
creativity frameworks › creative-textual creativityevaluation › document-levelevaluation › automatic metricsmodel used › Medium (8-24)related to creativity › related to creativity as a textual genretextual genre › literaturescope › creative trainingscope › technical research

Characterising the creative process in humans and large language models

Surabhi S. Nath

Large language models appear quite creative, often performing on par with the average human on creative tasks. However, research on LLM creativity has focused solely on \textit{products}, with little attention on the creative \textit{process}. Process analyses of human creativity often require hand-coded categories or exploit response times, which do not apply to LLMs. We provide an automated method to characterise how humans and LLMs explore semantic spaces on the Alternate Uses Task, and contr

Computer science - human-computer interaction, computer science - artificial intelligence, computer science - computation and language, quantitative biology - neurons and cognitionNot Relevant

Creative robot tool use with large language models

Mengdi Xu

Tool use is a hallmark of advanced intelligence, exemplified in both animal behavior and robotic capabilities. This paper investigates the feasibility of imbuing robots with the ability to creatively use tools in tasks that involve implicit physical constraints and long-term planning. Leveraging Large Language Models (LLMs), we develop RoboTool, a system that accepts natural language instructions and outputs executable code for controlling robots in both simulated and real-world environments. Ro

Computer science - robotics, computer science - artificial intelligence, computer science - machine learningNot Relevant

Generating creativity through ChatGPT: an empirical investigation in open innovation platforms

Li Yongjun et al.

Since the release of ChatGPT, the remarkable content comprehension and generation abilities of large language models (LLMs) have spurred knowledge democratization, however, on the one hand, they can lower the barriers to individual innovation, potentially facilitating the enhancement of personal innovative capabilities. On the other hand, their exceptional capabilities and popularity may lead to resistance and speculation, hindering demand for innovation. This work aims to examine whether they can assist people in creative tasks. This study utilizes an extensive dataset from an Open Innovation Platform (OIP), known for fostering creativity through external ideas. Leveraging ChatGPT’s release as a natural experiment, it employs the Difference-in-Difference (DID) technique to examine how LLMs impact people’s creative participation. The results indicate that the launching enhances peoples’ willingness to engage in creative tasks, reflected in increased works. And this enhancement is not necessarily a threat to the quality of innovation, as the works have increased in length by 44%, complexity by 35%, and votes by 1.5% on average. Additionally, heterogeneity analysis shows these effects are more pronounced in higher-tier innovation works and among less experienced users. Further mechanism analysis suggests that the improvement in innovation performance in OIPs stems from the lowering of the innovation threshold enabled by ChatGPT. Moreover, LLMs can increase interaction feedback to stimulate external creativity, thereby improving innovation performance. This study’s conclusions contribute to understanding the impact of the new technology, LLMs, on innovation activities across diverse populations.

Not Relevant

Generative AI and creativity: a systematic literature review and meta-analysis

Niklas Holzner et al.

Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

Not Relevant

How AI ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment

Joshua Ashkinaze et al.

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted a dynamic experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants, and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as “AI” (disclosure). We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

Proceedings of the ACM collective intelligence conferenceNot Relevant

Leveraging large models to evaluate novel content: a case study on advertisement creativity

Zhaoyi Joey Hou et al.

Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.

Not Relevant

Artificial intelligence reshapes creativity: a multidimensional evaluation

Chenchen Zhang et al.

Artificial intelligence (AI) is reshaping creativity by challenging its long-held status as a uniquely human faculty. This study uses bibliometric analysis to reveal AI’s evolution from a passive instrument to an active co-creator that amplifies human intuition and expands creative possibilities. We highlight how AI-driven evaluative frameworks offer more objective, scalable, and inclusive assessments of creativity, disrupting bias-prone traditional methods. Also, this transformation raises pressing ethical and legal concerns, particularly regarding authorship, intellectual property, and recognition of machine-generated outputs. By mapping these tensions and opportunities, the study provides a critical foundation for rethinking creativity in the age of human–machine collaboration. Our findings point toward an urgent need for new conceptual models that align innovation with ethical and societal responsibility.

Not Relevant

Evaluating AI’s ideas: The role of individual creativity and expertise in human-AI co-creativity

Paul V DiStefano et al.

As generative artificial intelligence (genAI) becomes increasingly integrated into education and work, understanding who benefits most from human-AI collaboration is crucial. This study examines how domain expertise and individual differences—creative self-efficacy and baseline creative ability—influence human-AI co-creativity in an engineering design task. Using pre-generated ideas from GPT-3.5-turbo, engineering (N = 99) and psychology students (N = 212) generated an initial solution, evaluated AI-generated ideas, and revised their idea. Linear mixed-effects models demonstrated expertise and generation ability predicted solution quality. Engineering students produced more original and effective solutions, yet both groups improved comparably supporting the “rising tides lift all boats” hypothesis. A novel categorization scheme revealed group differences in idea inspiration: engineers generated more novel solutions, while psychology students tended to adopt existing ideas. These findings highlight the role of domain knowledge and individual differences in pre-existing creativity in maximizing human-AI co-creativity, emphasizing the need to develop these human abilities alongside genAI.

Not Relevant

Assessing creativity across multi-step intervention using generative AI models

Eran Hadas and Arnon Hershkovitz

Creativity is an imperative skill for today’s learners, one that has important contributions to issues of inclusion and equity in education. Therefore, assessing creativity is of major importance in educational contexts. However, scoring creativity based on traditional tools suffers from subjectivity and is heavily time- and labour-consuming. This is indeed the case for the commonly used Alternative Uses Test (AUT), in which participants are asked to list as many different uses as possible for a daily object. The test measures divergent thinking (DT), which involves exploring multiple possible solutions in various semantic domains. This study leverages recent advancements in generative AI (GenAI) to automate the AUT scoring process, potentially increasing efficiency and objectivity. Using two validated models, we analyze the dynamics of creativity dimensions in a multi-step intervention aimed at improving creativity by using repeated AUT sessions (N=157 9th-grade students). Our research questions focus on the behavioural patterns of DT dimensions over time, their correlation with the number of practice opportunities, and the influence of response order on creativity scores. The results show improvement in fluency and flexibility, as a function of practice opportunities, as well as various correlations between DT dimensions. By automating the scoring process, this study aims to provide deeper insights into the development of creative skills over time and explore the capabilities of GenAI in educational assessments. Eventually, the use of automatic evaluation can incorporate creativity evaluation in various educational processes at scale.

Relevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › sentence-levelrelated to creativity › mentions creativity as a human ability

Creativity, context, and large language models

Max Peeperkorn

Not Relevant

Creativity behind the prompts: Automated creativity assessment in prompting for text-to-image models

Saif Abdoelrazak

This thesis delves into the intersection of artificial intelligence and creativity, specifically focusing on the application of text-to-image synthesis models. These models, gaining significant attention in recent years, hold the potential to redefine the boundaries of human imagination and challenge conventional notions of creativity. However, they also raise pertinent questions about originality, copyright, and the role of human input in the creative process. The study investigates the use of prompt engineering to augment the creativity of the generated artworks. Various prompt modifiers, including artist names and aesthetic quality descriptors, are employed to guide the synthesis process. The results indicate that the strategic use of these modifiers significantly enhances the creativity of the generated images, providing a concrete strategy for both novice and experienced users of these models. The research also explores the use of topic modeling methods, such as Gibbs Sampling Dirichlet Mixture Model (GSDMM) and BERTopic. However, several challenges, including computational constraints and limitations in the clustering methods used, are identified. Despite these challenges, the research offers valuable insights into the potential of text-to-image synthesis models and the role of prompt engineering in enhancing creativity. Future work aims to address these challenges and further explore the potential of these models in various creative domains.

Not Relevant

AI as humanity's salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text

Ximing Lu et al.

Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.

Relevant
creativity frameworks › creative-textual creativityevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)related to creativity › related to creativity as a human abilityscope › prompt engineeringscope › technical researchtextual genre › literaturetextual genre › poetry

Creative agents: Simulating the systems model of creativity with generative agents

Naomi Imasato et al.

With the growing popularity of generative AI for images, video, and music, we witnessed models rapidly improve in quality and performance. However, not much attention is paid towards enabling AI’s ability to “be creative”. We often attribute the quality of “being creative” to an individual or an object, but we believe that countless variables participate in determining what or who is creative, transcending a single entity or artifact. Csikszentmihalyi’s systems model of creativity suggests that creativity is a product of interactions among multiple parts of a society that create, evaluate, and record. In this study, we implemented and simulated Csikszentmihalyi’s systems model of creativity using virtual agents utilizing large language models (LLMs) and text prompts. We conducted experiments in virtual settings where creativity is achieved with the presence of specific characteristics in the artifact. For comparison, the simulations were conducted with two “virtual artists” being 1)in the system, which received feedback from the field, and 2)isolated, which did not. Both agents were compared by analyzing the novelty, which was measured via Creativity Implication Network, and value, quantified through the desired characteristics present in artifacts. Our results suggest that the agents that receive feedback from the field can generate artifacts that are more novel and more valuable, thus more creative, in the framework of the systems model of creativity. Furthermore, the difference becomes more evident when external factors enact changes to the domain.

Not Relevant

Human creativity in the age of LLMs: Randomized experiments on divergent and convergent thinking

Harsh Kumar et al.

Large language models are transforming the creative process by offering unprecedented capabilities to algorithmically generate ideas. While these tools can enhance human creativity when people co-create with them, it’s unclear how this will impact unassisted human creativity. We conducted two large pre-registered parallel experiments involving 1,100 participants attempting tasks targeting the two core components of creativity, divergent and convergent thinking. We compare the effects of two forms of large language model (LLM) assistance—a standard LLM providing direct answers and a coach-like LLM offering guidance—with a control group receiving no AI assistance, and focus particularly on how all groups perform in a final, unassisted stage. Our findings reveal that while LLM assistance can provide short-term boosts in creativity during assisted tasks, it may inadvertently hinder independent creative performance when users work without assistance, raising concerns about the long-term impact on human creativity and cognition.

Proceedings of the 2025 CHI conference on human factors in computing systemsRelevant
creativity frameworks › psychological/cognitiveevaluates a creative feature › logic (puzzles, etc.)evaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationmodel used › ChatGPTmodel used › Large (>32B)related to creativity › mentions creativity as a human ability

A new dataset and method for creativity assessment using the alternate uses task

Luning Sun et al.

Creativity ratings by humans for the alternate uses task (AUT) tend to be subjective and inefficient. To automate the scoring process of the AUT, previous literature suggested using semantic distance from non-contextual models. In this paper, we extend this line of research by including contextual semantic models and more importantly, exploring the feasibility of predicting creativity ratings with supervised discriminative machine learning models. Based on a newly collected dataset, our results show that supervised models can successfully classify between creative and non-creative responses even with unbalanced data, and can generalise well to out-of-domain unseen prompts.

Intelligent computers, algorithms, and applicationsNot Relevant
creativity frameworks › computational creativity

Autonomous measure of creativity in large language models (LLM)

Javier M. Mora-Merchan et al.

The present work proposes a new metric to measure the creativity of Large Language Models (LLM). Unlike traditional approaches that compare the results produced by LLMs with those produced by humans, this study presents an autonomous approach that does not depend on external reference systems. The methodology consists of providing a specific input prompt to LLMs, for example, generating a short three-paragraph story describing children in a park. Then, a large number of stories are generated, and techniques such as morphological or semantic grouping are applied to determine the number of original stories produced. Given that natural language processing (NLP) comparison techniques are computationally intensive and the calculation of distances is of order $$O(n^2)$$O(n2), it’s not possible to directly count the number of original stories generated by prompt. To address this, models have been developed from a calibration set, which simulate story generation and from which the maximum number of stories that can be generated is inferred. This methodology is not specific to the LLM or the prompt used, so it can be used to compare existing systems.

A cross-disciplinary exploration of STEMNot Relevant

The impact of AI on creativity: Enhancing human potential or challenging creative expression

Luka Baklaga

We recommendExploring the Role of Generative AI in Enhancing Language Learning: Opportunities and ChallengesEdwin Creely, International Journal of Changes in Education, 2024Data Science and ApplicationsNarender Chinthamu, Journal of Data Science and Intelligent Systems, 2023Identification and Evaluation of Sustainable Factors in Urban Construction Project Management Using SWOT TechniqueNima Amani, Mahsima Naeij, Green and Low-Carbon Economy, 2024Bioinformatics Applications in Chronic Diseases: A Comprehensive Review of Genomic, Transcriptomics, Proteomic, Metabolomics, and Machine Learning ApproachesTaiwo Temitope Ogunjobi, Medinformatics, 2024A Novel Ensemble Deep Learning Based Polyp Detection Using Colonoscopy DatasetSai Rakshana K., Artificial Intelligence and Applications, 2024Powered by Privacy policy

Not Relevant

Luminate: Structured generation and exploration of design space with large language models for human-AI co-creation

Sangho Suh et al.

Thanks to their generative capabilities, large language models (LLMs) have become an invaluable tool for creative processes. These models have the capacity to produce hundreds and thousands of visual and textual outputs, offering abundant inspiration for creative endeavors. But are we harnessing their full potential? We argue that current interaction paradigms fall short, guiding users towards rapid convergence on a limited set of ideas, rather than empowering them to explore the vast latent design space in generative models. To address this limitation, we propose a framework that facilitates the structured generation of design space in which users can seamlessly explore, evaluate, and synthesize a multitude of responses. We demonstrate the feasibility and usefulness of this framework through the design and development of an interactive system, Luminate, and a user study with 14 professional writers. Our work advances how we interact with LLMs for creative tasks, introducing a way to harness the creative potential of LLMs.

Proceedings of the 2024 CHI conference on human factors in computing systemsRelevant
evaluation › creativity evaluationevaluation › human evalcreativity frameworks › creative-textual creativitycreativity frameworks › psychological/cognitivepost-editing › post-editing with LLMsmodel used › ChatGPTscope › prompt engineeringtextual genre › poetrytextual genre › musictextual genre › literature

Style over story: a process-oriented study of authorial creativity in large language models

Donghoon Jung et al.

Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.

Relevant
creativity frameworks › creative-textual creativityevaluation › creativity evaluationevaluation › document-levelevaluation › automatic metricsmodel used › ChatGPTtextual genre › literaturescope › prompt engineering

Artificial intelligence as a tool for creativity

Zorana Ivcevic and Mike Grandinetti

The release of ChatGPT has sparked quite a bit of interest about creativity in the context of artificial intelligence (AI), with theorizing and empirical research asking questions about the nature of creativity (both human and artificially-produced) and the valuing of work produced by humans and artificial means. In this article, we discuss one specific scenario identified in the creativity research community – co-creation, or use of AI as a tool that could augment human creativity. We present emerging research relevant to how AI can be used on a continuum of four levels of creativity, from mini-c/creativity in learning to little-c/everyday creativity to Pro-C/professional creativity and Big-C/eminent creativity. In this discussion, AI is defined broadly, not to include only large language models (e.g., ChatGPT) which might approach general AI, but also other computer programs that perform tasks typically understood as requiring human intelligence. We conclude by considering future directions for research on AI as a tool for creativity across the four c's.

Not Relevant

Evaluation is creation: Self and social judgments of creativity across the four-c model

Denis Dumas and James C. Kaufman

Who should evaluate the originality and task-appropriateness of a given idea has been a perennial debate among psychologists of creativity. Here, we argue that the most relevant evaluator of a given idea depends crucially on the level of expertise of the person who generated it. To build this argument, we draw on two complimentary theoretical perspectives. The model of domain learning (MDL) suggests that, for novices in a domain, creativity is by-necessity self-referenced, but as expertise develops, more socially referenced creativity is possible. Relatedly, the four-C model posits four forms of creativity that fall along a continuum of social impact: mini-c, little-c, Pro-c, and Big-C. We show that the MDL implies a learning trajectory that connects the four Cs because, as socially referenced creativity develops, greater societal impact becomes available to a creator. Then, we describe four sources of evaluations that become relevant as an individual learns: judgments from the creators themselves, their local community, consumers of the idea, and finally, critics in the domain. We suggest that creators’ judgments are of essential importance for mini-c, community judgments are paramount for little-c, Pro-c requires either positive evaluations from consumers or critics, and Big-C requires both consumers and critics to evaluate an idea positively for an extended time. We identify key insights and imperatives for the field: aligning our measures (both human and AI scored) with the most relevant evaluations of ideas to support the reliability and validity of our measurements, using evaluations as feedback for learners to support the development of creative metacognition, and the importance of considering domain differences when evaluating ideas.

Not Relevant

The effects of (dis)similarities between the creator and the assessor on assessing creativity: a comparison of humans and LLMs

Martin ‘t Hof et al.

Current research predominantly involves human subjects to evaluate AI creativity. In this explorative study, we questioned the validity of this practice and examined how creator–assessor (dis)similarity—namely to what extent the creator and the assessor were alike—along two dimensions of culture (Western and English-speaking vs. Eastern and Chinese-speaking) and agency (human vs. AI) influences the assessment of creativity. We first asked four types of subjects to create stories, including Eastern participants (university students from China), Eastern AI (Kimi from China), Western participants (university students from The Netherlands), and Western AI (ChatGPT 3.5 from the US). Both Eastern participants and AI created stories in Chinese, which were then translated into English, while both Western participants and AI created stories in English, which were then translated into Chinese. A subset of these stories (2 creative and 2 uncreative per creator type, in total 16 stories) was then randomly selected as assessment materials. Adopting a within-subject design, we then asked new subjects from the same four types (n = 120, 30 per type) to assess these stories on creativity, originality, and appropriateness. The results confirmed that similarities in both dimensions of culture and agency influence the assessment of originality and appropriateness. As for the agency dimension, human assessors preferred human-created stories for originality, while AI assessors showed no preference. Conversely, AI assessors rated AI-generated stories higher in appropriateness, whereas human assessors showed no preference. Culturally, both Eastern and Western assessors favored Eastern-created stories in originality. And as for appropriateness, the assessors always preferred stories from the creators with the same cultural backgrounds. The present study is significant in attempting to ask an often-overlooked question and provides the first empirical evidence to underscore the need for more discussion on using humans to judge AI agents’ creativity or the other way around.

Relevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelevaluation › human evalmodel used › ChatGPTrelated to creativity › related to creativity as a textual genre

Assessing the creativity of LLMs in proposing novel solutions to mathematical problems

Junyi Ye et al.

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.

Relevant
creativity frameworks › computational creativitycreativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › human evalmodel used › ChatGPTmodel used › Large (>32B)related to creativity › related to creativity as a human ability

A survey on evaluation of large language models

Yupeng Chang et al.

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at:

Not Relevant

The creative agency of large language models: a philosophical inquiry

Paschal Mmesoma Ukpaka

This paper explores the difficult question of whether Large Language Models (LLMs) are intrinsically creative. Because they can independently create original content, LLMs are often seen as creative agents. Contrary to the belief that LLMs are creative, this paper argues that LLMs are not creative for two reasons. First, LLMs are not creative because they lack an essential component of creativity, which is the first-person experience of the world. Secondly, LLMs are not creative because they are not the principal authors of their creative output, for they lack the subjective awareness and intentionality necessary to be regarded as authors, and their output is a collaborative effort of the AI model, data providers, and other stakeholders. Since they are not full-fledged authors in a traditional sense, they are not creative.

Not Relevant

Exploring automated assessment of primary students' creativity in a flow-based music programming environment

Zifeng Liu et al.

Creativity is a vital skill in science, technology, engineering, and mathematics (STEM)-related education, fostering innovation and problem-solving. Traditionally, creativity assessments relied on human evaluations, such as the consensual assessment technique (CAT), which are resource-intensive, time-consuming, and often subjective. Recent advances in computational methods, particularly large language models (LLMs), have enabled automated creativity assessments. In this study, we extend research on automated creativity scoring to a flow-based music programming environment, a context that integrates computational and creative thinking. We collected 383 programming artifacts from 194 primary school students (2022-2024) and employed two automated approaches: an evidence-centred design (ECD) framework-based approach and an LLM-based approach using ChatGPT-4 with few-shot learning. The ECD-based approach integrates divergent thinking, complexity, efficiency, and emotional expressiveness, while the LLM-based approach uses CAT ratings and ECD examples to learn creativity scoring. Results revealed moderate to strong correlations with human evaluations (ECD-based: r = 0.48; LLM-based: r = 0.68), with the LLM-based approach demonstrating greater consistency across varying learning examples (r = 0.82). These findings highlight the potential of automated tools for scalable, objective, and efficient creativity assessment, paving the way for their application in creativity-focused learning environments.

Relevant
creativity frameworks › psychological/cognitiveevaluation › creativity evaluationevaluation › automatic metricsevaluation › LLM-as-a-judgerelated to creativity › mentions creativity as a human ability

Creativity has left the chat: The price of debiasing language models

Behnam Mohammadi

Large Language Models (LLMs) have revolutionized natural language processing but can exhibit biases and may generate toxic content. While alignment techniques like Reinforcement Learning from Human Feedback (RLHF) reduce these issues, their impact on creativity, defined as syntactic and semantic diversity, remains unexplored. We investigate the unintended consequences of RLHF on the creativity of LLMs through three experiments focusing on the Llama-2 series. Our findings reveal that aligned models exhibit lower entropy in token predictions, form distinct clusters in the embedding space, and gravitate towards "attractor states", indicating limited output diversity. Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation. The trade-off between consistency and creativity in aligned models should be carefully considered when selecting the appropriate model for a given application. We also discuss the importance of prompt engineering in harnessing the creative potential of base models.

Relevant
creativity frameworks › creative-textual creativityevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelrelated to creativity › mentions creativity as a textual genrescope › creative trainingscope › prompt engineering

Automating evaluation of creativity in LLMs with semantic entropy and efficient multi-agent judge

Tan Min Sen et al.

Large Language Models (LLMs) have achieved remarkable progress in natural language comprehension, reasoning, and generation, sparking interest in their creative potential. Automating creativity evaluation in LLMs, particularly in physical reasoning tasks, presents a transformative opportunity to accelerate scientific discovery by enabling innovative solutions, uncovering patterns, and automating problem-solving processes. Current creativity evaluation frameworks, however, rely heavily on human annotation, making them subjective, resource-intensive, and impractical for scaling. To address this, we introduce a novel automated evaluation framework rooted in cognitive science principles of divergent and convergent thinking. Divergent creativity is measured using Semantic Entropy, a sampling-based metric that quantifies variability in generated outputs to capture the novelty of ideas. Convergent creativity is assessed using a modified retrieval-based discussion framework—60% more efficient—where autonomous multi-agent systems evaluate task solutions across feasibility, safety, and effectiveness. We implement these methodologies within a benchmark based on the MacGyver dataset, which contains 300 real-world, solvable problems requiring innovative use of everyday objects. Our framework evaluates state-of-the-art LLMs, such as GPT and LLaMA models, while analyzing the effects of key parameters like temperature, model size, and recency. By automating creativity evaluation, we establish a scalable, objective, and reproducible methodology to enhance LLM development, paving the way for breakthroughs in scientific discovery and creative problem-solving across diverse fields.

Relevant
creativity frameworks › psychological/cognitivecreativity frameworks › computational creativityevaluation › creativity evaluationevaluation › human evalevaluation › sentence-levelevaluation › LLM-as-a-judgeevaluation › automatic metricsmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)related to creativity › mentions creativity as a human ability

DeepMath-creative: a benchmark for evaluating mathematical creativity of large language models

Xiaoyang Chen et al.

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria – emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations – the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

Not Relevant

How do hackathons foster creativity? Towards automated evaluation of creativity at scale

Jeanette Falk et al.

Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

Proceedings of the 2025 CHI conference on human factors in computing systems

Automatic assessment of mathematical creativity using natural language processing

Rebecca Marrone et al.

Creativity is now accepted as a core 21st-century competency and is increasingly an explicit part of school curricula around the world. Therefore, the ability to assess creativity for both formative and summative purposes is vital. However, the fitness-for-purpose of creativity tests has recently come under scrutiny. Current creativity assessments have up to five key weaknesses that create a barrier to their widespread use in educational settings. These are: (a) A lack of domain/subject specificity; (b) Inconsistency, leading to a lack of trust; (c) A lack of authenticity in classroom settings; (d) Slowness (in providing useful results); (e) High cost to administer. The aim of the present study is to explore the feasibility of the automated assessment of mathematical creativity, drawing on tools and techniques from the field of natural language processing, as a means to address these weaknesses. This paper describes the performance of a machine learning algorithm, relative to human judges, demonstrating the practicality of automated creativity assessment for large-scale, school-based assessments. The importance of creativity is recognized in education systems globally. The ability to assess creativity, for both formative and summative purposes, is therefore vital. However, the quality of creativity tests (their validity and reliability) tends to come at a cost. In simple terms, the better the creativity test, the greater the effort, and therefore cost, required to deliver and score that test. The aim of the present study is to explore the feasibility of the automated assessment of mathematical creativity, drawing on tools and techniques from the field of natural language processing. This paper describes how a machine learning algorithm assesses mathematical creativity and compares this to human judges.

How do hackathons foster creativity? Towards AI collaborative evaluation of creativity at scale

Jeanette Falk et al.

Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

A robot walks into a bar: Can language models serve as creativity SupportTools for comedy? An evaluation of LLMs’ humour alignment with comedians

Piotr Mirowski et al.

We interviewed twenty professional comedians who perform live shows in front of audiences and who use artificial intelligence in their artistic process as part of 3-hour workshops on “AI x Comedy” conducted at the Edinburgh Festival Fringe in August 2023 and online. The workshop consisted of a comedy writing session with large language models (LLMs), a human-computer interaction questionnaire to assess the Creativity Support Index of AI as a writing tool, and a focus group interrogating the comedians’ motivations for and processes of using AI, as well as their ethical concerns about bias, censorship and copyright. Participants noted that existing moderation strategies used in safety filtering and instruction-tuned LLMs reinforced hegemonic viewpoints by erasing minority groups and their perspectives, and qualified this as a form of censorship. At the same time, most participants felt the LLMs did not succeed as a creativity support tool, by producing bland and biased comedy tropes, akin to “cruise ship comedy material from the 1950s, but a bit less racist”. Our work extends scholarship about the subtle difference between, one the one hand, harmful speech, and on the other hand, “offensive” language as a practice of resistance, satire and “punching up”. We also interrogate the global value alignment behind such language models, and discuss the importance of community-based value alignment and data ownership to build AI tools that better suit artists’ needs. Warning: this study may contain offensive language and discusses self-harm.

Proceedings of the 2024 ACM conference on fairness, accountability, and transparency

Knowledge-Enhanced Large Language Models and Human-AI Collaboration Frameworks for Creativity Support - ProQuest

Unknown

Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.

Automated scoring of scientific creativity in german

Benjamin Goecke et al.

ABSTRACT Automated scoring is a current hot topic in creativity research. However, most research has focused on the English language and popular verbal creative thinking tasks, such as the alternate uses task. Therefore, in this study, we present a large language model approach for automated scoring of a scientific creative thinking task that assesses divergent ideation in experimental tasks in the German language. Participants are required to generate alternative explanations for an empirical observation. This work analyzed a total of 13,423 unique responses. To predict human ratings of originality, we used XLM‐RoBERTa (Cross‐lingual Language Model‐RoBERTa), a large, multilingual model. The prediction model was trained on 9,400 responses. Results showed a strong correlation between model predictions and human ratings in a held‐out test set ( n  = 2,682; r  = 0.80; CI‐95% [0.79, 0.81]). These promising findings underscore the potential of large language models for automated scoring of scientific creative thinking in the German language. We encourage researchers to further investigate automated scoring of other domain‐specific creative thinking tasks.

Exploring poetic creativity in large language models: a dynamic multi-agent framework for poem generation

Scarlet Weatherby et al.

Creativity Benchmark: A benchmark for marketing creativity for large language models

Ninad Bhat et al.

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $Δθ\approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Automated assessment of creativity in multilingual narratives.

Simone A. Luchini et al.

Multilingual semantic distance: Automatic verbal creativity assessment in many languages.

John D. Patterson et al.

"It Felt Like Having a Second Mind": Investigating human-AI co-creativity in prewriting with large language models

Qian Wan et al.

Prewriting is the process of discovering and developing ideas before writing a first draft, which requires divergent thinking and often implies unstructured strategies such as diagramming, outlining, free-writing, etc. Although large language models (LLMs) have been demonstrated to be useful for a variety of tasks including creative writing, little is known about how users would collaborate with LLMs to support prewriting. The preferred collaborative role and initiative of LLMs during such a creative process is also unclear. To investigate human-LLM collaboration patterns and dynamics during prewriting, we conducted a three-session qualitative study with 15 participants in two creative tasks: story writing and slogan writing. The findings indicated that during collaborative prewriting, there appears to be a three-stage iterative Human-AI Co-creativity process that includes Ideation, Illumination, and Implementation stages. This collaborative process champions the human in a dominant role, in addition to mixed and shifting levels of initiative that exist between humans and LLMs. This research also reports on collaboration breakdowns that occur during this process, user perceptions of using existing LLMs during Human-AI Co-creativity, and discusses design implications to support this co-creativity process.

Natural language processing in computational creativity: a systematic

Owen Graham and Megan Walter

An integrated benchmark for verbal creativity testing of LLMs and humans

Anca Dinu and Andra Maria Florescu

Homogenizing effect of large language model on creativity: An empirical comparison of human and ChatGPT writing

Kibum Moon

Evaluating creativity and deception in large language models: a simulation framework for multi-agent balderdash

Parsa Hejabi et al.

Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).

Enhancing creativity in large language models through associative thinking strategies

Pronita Mehrotra et al.

This paper explores the enhancement of creativity in Large Language Models (LLMs) like vGPT-4 through associative thinking, a cognitive process where creative ideas emerge from linking seemingly unrelated concepts. Associative thinking strategies have been found to effectively help humans boost creativity. However, whether the same strategies can help LLMs become more creative remains under-explored. In this work, we investigate whether prompting LLMs to connect disparate concepts can augment their creative outputs. Focusing on three domains – Product Design, Storytelling, and Marketing – we introduce creativity tasks designed to assess vGPT-4's ability to generate original and useful content. By challenging the models to form novel associations, we evaluate the potential of associative thinking to enhance the creative capabilities of LLMs. Our findings show that leveraging associative thinking techniques can significantly improve the originality of vGPT-4's responses.

Putting GPT-3's creativity to the (alternative uses) test

Claire Stevenson et al.

AI large language models have (co-)produced amazing written works from newspaper articles to novels and poetry. These works meet the standards of the standard definition of creativity: being original and useful, and sometimes even the additional element of surprise. But can a large language model designed to predict the next text fragment provide creative, out-of-the-box, responses that still solve the problem at hand? We put Open AI's generative natural language model, GPT-3, to the test. Can it provide creative solutions to one of the most commonly used tests in creativity research? We assessed GPT-3's creativity on Guilford's Alternative Uses Test and compared its performance to previously collected human responses on expert ratings of originality, usefulness and surprise of responses, flexibility of each set of ideas as well as an automated method to measure creativity based on the semantic distance between a response and the AUT object in question. Our results show that – on the whole – humans currently outperform GPT-3 when it comes to creative output. But, we believe it is only a matter of time before GPT-3 catches up on this particular task. We discuss what this work reveals about human and AI creativity, creativity testing and our definition of creativity.

Aug-creativity: Framework for human-centered creativity with vision language models

Dan Li et al.

Human-computer interaction

User-controlled knowledge fusion in large language models: Balancing creativity and hallucination

Chen Zhang

In modern dialogue systems, the use of Large Language Models (LLMs) has grown exponentially due to their capacity to generate diverse, relevant, and creative responses. Despite their strengths, striking a balance between the LLMs' creativity and their faithfulness to external knowledge remains a key challenge. This paper presents an innovative user-controllable mechanism that modulates the balance between an LLM's imaginative capabilities and its adherence to factual information. Our approach incorporates a numerical tag during the fine-tuning phase of the LLM's training, representing the degree of faithfulness to the reference knowledge in the generated responses. This degree is computed through an automated process that measures lexical overlap using ROUGE scores, semantic similarity using Sentence-BERT embeddings, and an LLM's self-evaluation score. During model inference, users can manipulate this numerical tag, thus controlling the degree of the LLM's reliance on external knowledge. We conduct extensive experiments across various scenarios, demonstrating the adaptability of our method and its efficacy in ensuring the quality and accuracy of the LLM's responses. The results highlight the potential of our approach to enhance the versatility of LLMs while maintaining a balance between creativity and hallucination.

Evaluating the creativity of LLMs in persian literary text generation

Armin Tourajmehr et al.

Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models' ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.

Do LLMs agree on the creativity evaluation of alternative uses?

Abdullah Al Rabeyah et al.

This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT). While LLMs are increasingly used to evaluate creative content, previous studies have primarily focused on a single model assessing responses generated by the same model or humans. This paper explores whether LLMs can impartially and accurately evaluate creativity in outputs generated by both themselves and other models. Using an oracle benchmark set of AUT responses, categorized by creativity level (common, creative, and highly creative), we experiment with four state-of-the-art LLMs evaluating these outputs. We test both scoring and ranking methods and employ two evaluation settings (comprehensive and segmented) to examine if LLMs agree on the creativity evaluation of alternative uses. Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle, indicating a high level of agreement and validating the reliability of LLMs in creativity assessment of alternative uses. Notably, models do not favour their own responses, instead they provide similar creativity assessment scores or rankings for alternative uses generated by other models. These findings suggest that LLMs exhibit impartiality and high alignment in creativity evaluation, offering promising implications for their use in automated creativity assessment.

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

Kent F. Hubert et al.

Does increasing reliance on artificial intelligence boost creativity? Assessing AI-augmented creativity with large language models

Jiaoping Chen et al.

What shapes a creative machine mind? Comprehensively benchmarking creativity in foundation models

Zicong He et al.

The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.

How the ideation process shapes the creative output in innovation contests—an analysis using a large language model

Martin G. Moehrle et al.

LLM discussion: Enhancing the creativity of large language models via discussion framework and role-play

Li-Chun Lu et al.

Large language models (LLMs) have shown exceptional proficiency in natural language processing but often fall short of generating creative and original responses to open-ended questions. To enhance LLM creativity, our key insight is to emulate the human process of inducing collective creativity through engaging discussions with participants from diverse backgrounds and perspectives. To this end, we propose LLM Discussion, a three-phase discussion framework that facilitates vigorous and diverging idea exchanges and ensures convergence to creative answers. Moreover, we adopt a role-playing technique by assigning distinct roles to LLMs to combat the homogeneity of LLMs. We evaluate the efficacy of the proposed framework with the Alternative Uses Test, Similarities Test, Instances Test, and Scientific Creativity Test through both LLM evaluation and human study. The results show that our proposed framework outperforms single-LLM approaches and existing multi-LLM frameworks across various creativity metrics. The code is available at https://github.com/lawraa/LLM-Discussion.

Creative thought embeddings: a framework for instilling creativity in large language models

Qusay H. Mahmoud

Proceedings of the AAAI symposium series

Comparing Large Language Models verbal creativity to human verbal creativity

Anca Dinu and Andra Florescu

Proceedings of the 10th italian conference on computational linguistics (CLiC-it 2024)

Steering large language models to evaluate and amplify creativity

Matthew Lyle Olson et al.

Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively" to provide a robust measure of creativity that corresponds strongly with human judgment. We also show these internal state differences can be applied to enhance the creativity of generated text at inference time.

Creativity support in the age of large language models: An empirical study involving emerging writers

Tuhin Chakrabarty et al.

The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.

Using large language models to evaluate alternative uses task flexibility score

Eran Hadas and Arnon Hershkovitz

Brainstorm, then select: a generative language model improves its creativity score

Douglas Summers-Stay et al.

The AAAI-23 workshop on creative AI across modalities

Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity

Zijian Ding et al.

Creativity and cognition

Large language model in creative work: The role of collaboration modality and user expertise

Zenan Chen and Jason Chan

Since the launch of ChatGPT in December 2022, large language models (LLMs) have been rapidly adopted by businesses to assist users in a wide range of open-ended tasks, including creative work. Although the versatility of LLM has unlocked new ways of human-artificial intelligence collaboration, it remains uncertain how LLMs should be used to enhance business outcomes. To examine the effects of human-LLM collaboration on business outcomes, we conducted an experiment where we tasked expert and nonexpert users to write an ad copy with and without the assistance of LLMs. Here, we investigate and compare two ways of working with LLMs: (1) using LLMs as “ghostwriters,” which assume the main role of the content generation task, and (2) using LLMs as “sounding boards” to provide feedback on human-created content. We measure the quality of the ads using the number of clicks generated by the created ads on major social media platforms. Our results show that different collaboration modalities can result in very different outcomes for different user types. Using LLMs as sounding boards enhances the quality of the resultant ad copies for nonexperts. However, using LLMs as ghostwriters did not provide significant benefits and is, in fact, detrimental to expert users. We rely on textual analyses to understand the mechanisms, and we learned that using LLMs as ghostwriters produces an anchoring effect, which leads to lower-quality ads. On the other hand, using LLMs as sounding boards helped nonexperts achieve ad content with low semantic divergence to content produced by experts, thereby closing the gap between the two types of users. This paper was accepted by D. J. Wu, information systems. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.03014 .

Evaluating creativity with AI: Comparing GPT models and human experts in idea evaluation

Theresa S. Kränzle

Creativity support in the age of large language models: An empirical study involving professional writers

Tuhin Chakrabarty et al.

Creativity and cognition

A large-scale evaluation of the collaborative potential of human and machine creativity

Dawei Wang et al.

Homogenization effects of large language models on human creative ideation

Barrett R Anderson et al.

Creativity and cognition

Large Language Models show both individual and collective creativity comparable to humans

Luning Sun et al.

CS4: Measuring the creativity of large language models automatically by controlling the number of story-writing constraints

Anirudh Atmakuru et al.

Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at https://github.com/anirudhlakkaraju/cs4_benchmark.

Evaluating text creativity across diverse domains: a dataset and large language model evaluator

Qian Cao et al.

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

Jennifer Haase et al.

Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs – including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek – across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.

A framework for collaborating a large language model tool in brainstorming for triggering creative thoughts

Hung-Fu Chang and Tong Li

A comparative approach to assessing linguistic creativity of large language models and humans

Anca Dinu et al.

The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.

Evaluating creativity: Can LLMs be good evaluators in creative writing tasks?

Sungeun Kim and Dongsuk Oh

Evaluating creative output with generative artificial intelligence: Comparing GPT models and human experts in idea evaluation

Theresa Kranzle and Katelyn Sharratt

ABSTRACT Traditional techniques for evaluating creative outcomes are typically based on evaluations made by human experts. These methods suffer from challenges such as subjectivity, biases, limited availability, ‘crowding’, and high transaction costs. We propose that large language models (LLMs) can be used to overcome these shortcomings. However, there is a dearth of research comparing the performance of LLMs to traditional expert evaluations for evaluating creative outcomes such as ideas. Our study compares the alignment of expert evaluations with evaluations from the LLM GPT‐4. Our results reveal that to achieve moderate evaluation alignment with experts, LLMs require using a base framework and a spectrum‐based few‐shot prompt. We offer six theoretical contributions, shifting the focus from whether LLMs can evaluate to how specific design choices shape their alignment with human judgement. These insights are situated within broader frameworks from cognitive science, creativity theory, and machine learning. Furthermore, we outline six propositions for organizations interested in LLM‐supported evaluation methods. Key recommendations include utilizing base frameworks for large‐scale idea screening, establishing a database of evaluated ideas to optimize few‐shot performance, and leveraging AI–human collaboration for internal and external idea sourcing. Additionally, we highlight the need for privacy considerations when using third‐party LLMs for proprietary idea evaluations. This research contributes to innovation management literature by exploring methods for integrating LLM into creative evaluation processes to enhance scalability and efficiency while retaining evaluation quality.

Testing language creativity of large language models and humans

Anca Dinu and Andra-Maria Florescu

Proceedings of the 5th international conference on natural language processing for digital humanities

Deep associations, high creativity: a simple yet effective metric for evaluating large language models

Ziliang Qiu and Renfen Hu

The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $\rho = 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.

Automating chemosensory creativity assessment with large language models

Qian Janice Wang and Robert Pellegrino

How do humans and language models reason about creativity? A comparative analysis

Antonio Laverghetta Jr et al.

Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is "far" from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially – to upwards of $0.99$ – suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings.

Simone A. Luchini et al.

Rethinking creativity evaluation: a critical analysis of existing creativity evaluations

Li-Chun Lu et al.

We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

Evaluating creative short story generation in humans and large language models

Mete Ismayilzada et al.

Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

A survey on large language model hallucination via a creativity perspective

Xuhui Jiang et al.

Hallucinations in large language models (LLMs) are always seen as limitations. However, could they also be a source of creativity? This survey explores this possibility, suggesting that hallucinations may contribute to LLM application by fostering creativity. This survey begins with a review of the taxonomy of hallucinations and their negative impact on LLM reliability in critical applications. Then, through historical examples and recent relevant theories, the survey explores the potential creative benefits of hallucinations in LLMs. To elucidate the value and evaluation criteria of this connection, we delve into the definitions and assessment methods of creativity. Following the framework of divergent and convergent thinking phases, the survey systematically reviews the literature on transforming and harnessing hallucinations for creativity in LLMs. Finally, the survey discusses future research directions, emphasizing the need to further explore and refine the application of hallucinations in creative processes within LLMs.

On the creativity of large language models

Giorgio Franceschelli and Mirco Musolesi

Abstract Large language models (LLMs) are revolutionizing several areas of Artificial Intelligence. One of the most remarkable applications is creative writing, e.g., poetry or storytelling: the generated outputs are often of astonishing quality. However, a natural question arises: can LLMs be really considered creative? In this article, we first analyze the development of LLMs under the lens of creativity theories, investigating the key open questions and challenges. In particular, we focus our discussion on the dimensions of value, novelty, and surprise as proposed by Margaret Boden in her work. Then, we consider different classic perspectives, namely product, process, press, and person. We discuss a set of “easy” and “hard” problems in machine creativity, presenting them in relation to LLMs. Finally, we examine the societal impact of these technologies with a particular focus on the creative industries, analyzing the opportunities offered, the challenges arising from them, and the potential associated risks, from both legal and ethical points of view.

Automated creativity evaluation for large language models: a reference-based approach

Ruizhe Li et al.

Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15\%).

A causality-aware paradigm for evaluating creativity of multimodal large language models

Zhongzhan Huang et al.

Noveltybench: Evaluating creativity and diversity in language models

Yiming Zhang et al.

Second conference on language modeling

Automatic scoring of metaphor creativity with large language models

Paul V. DiStefano et al.

Divergent creativity in humans and large language models

Antoine Bellemare-Pepin et al.

The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs' semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities, though they fall short of the typical performance of highly creative humans. Notably, even the top performing LLMs are still largely surpassed by highly creative individuals, underscoring a ceiling that current LLMs still fail to surpass. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labour by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.

Is temperature the creativity parameter of large language models?

Max Peeperkorn et al.

Large language models (LLMs) are applied to all sorts of creative tasks, and their outputs vary from beautiful, to peculiar, to pastiche, into plain plagiarism. The temperature parameter of an LLM regulates the amount of randomness, leading to more diverse outputs; therefore, it is often claimed to be the creativity parameter. Here, we investigate this claim using a narrative generation task with a predetermined fixed context, model and prompt. Specifically, we present an empirical analysis of the LLM output for different temperature values using four necessary conditions for creativity in narrative generation: novelty, typicality, cohesion, and coherence. We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the "creativity parameter" claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher. Finally, we discuss ideas to allow more controlled LLM creativity, rather than relying on chance via changing the temperature parameter.

On characterizations of large language models and creativity evaluation

Max Peeperkorn et al.

Evaluating large language model creativity from a literary perspective

Murray Shanahan and Catherine Clarke

This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.

The language of creativity: Evidence from humans and large language models

William Orwig et al.

ABSTRACT Recent developments in computerized scoring via semantic distance have provided automated assessments of verbal creativity. Here, we extend past work, applying computational linguistic approaches to characterize salient features of creative text. We hypothesize that, in addition to semantic diversity, the degree to which a story includes perceptual details, thus transporting the reader to another time and place, would be predictive of creativity. Additionally, we explore the use of generative language models to supplement human data collection and examine the extent to which machine‐generated stories can mimic human creativity. We collect 600 short stories from human participants and GPT‐3, subsequently randomized and assessed on their creative quality. Results indicate that the presence of perceptual details, in conjunction with semantic diversity, is highly predictive of creativity. These results were replicated in an independent sample of stories ( n  = 120) generated by GPT‐4. We do not observe a significant difference between human and AI‐generated stories in terms of creativity ratings, and we also observe positive correlations between human and AI assessments of creativity. Implications and future directions are discussed.

Art or artifice? Large language models and the false promise of creativity

Tuhin Chakrabarty et al.

Proceedings of the CHI conference on human factors in computing systems

Assessing and understanding creativity in large language models

Yunpu Zhao et al.

Abstract In the field of natural language processing, the rapid development of large language model (LLM) has attracted increasing attention. LLMs have shown a high level of creativity in various tasks, but the methods for assessing such creativity are inadequate. Assessment of LLM creativity needs to consider differences from humans, requiring multiple dimensional measurement while balancing accuracy and efficiency. This paper aims to establish an efficient framework for assessing the level of creativity in LLMs. By adapting the modified Torrance tests of creative thinking, the research evaluates the creative performance of various LLMs across 7 tasks, emphasizing 4 criteria including fluency, flexibility, originality, and elaboration. In this context, we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method. In addition, this study presents a novel analysis of LLMs’ responses to diverse prompts and role-play situations. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration. In addition, the use of prompts and role-play settings of the model significantly influence creativity. Additionally, the experimental results also indicate that collaboration among multiple LLMs can enhance originality. Notably, our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity. The findings underscore the significant impact of LLM design on creativity and bridge artificial intelligence and human creativity, offering insights into LLMs’ creativity and potential applications.

Exploring Multimodal Relation Extraction of Hierarchical Tabular Data with Multi-task Learning

Xinyu Zhang

Relation Extraction (RE) is a key task in table understanding, aiming to extract semantic relations between columns. However, complex tables with hierarchical headers are hard to obtain high-quality textual formats (e.g., Markdown) for input under practical scenarios like webpage screenshots and scanned documents, while table images are more accessible and intuitive. Besides, existing works overlook the need of mining relations among multiple columns rather than just the semantic relation between two specific columns in real-world practice. In this work, we explore utilizing Multimodal Large Language Models (MLLMs) to address RE in tables with complex structures. We creatively extend the concept of RE to include calculational relations, enabling multi-task learning of both semantic and calculational RE for mutual reinforcement. Specifically, we reconstruct table images into graph structure based on neighboring nodes to extract graph-level visual features. Such feature enhancement alleviates the insensitivity of MLLMs to the positional information within table images. We then propose a Chain-of-Thought distillation framework with self-correction mechanism to enhance MLLMs’ reasoning capabilities without increasing parameter scale. Our method significantly outperforms most baselines on wide datasets. Additionally, we release a benchmark dataset for calculational RE in complex tables.

2025ACLNot Relevant

Cautious Next Token Prediction

Yizhou Wang

Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.

2025FINDINGSNot Relevant

KIA: Knowledge-Guided Implicit Vision-Language Alignment for Chest X-Ray Report Generation

Heng Yin

Report generation (RG) faces challenges in understanding complex medical images and establishing cross-modal semantic alignment in radiology image-report pairs. Previous methods often overlook fine-grained cross-modal interaction, leading to insufficient understanding of detailed information. Recently, various large multimodal models have been proposed for image-text tasks. However, such models still underperform on rare domain tasks like understanding complex medical images. To address these limitations, we develop a new framework of Knowledge-guided Implicit vision-language Alignment for radiology report generation, named KIA. To better understand medical reports and images and build alignment between them, multi-task implicit alignment is creatively introduced, forming comprehensive understanding of medical images and reports. Additionally, to further meet medical refinement requirements, we design novel masking strategies guided by medical knowledge to enhance pathological observation and anatomical landm

2025COLINGNot Relevant

SemEval-2023 Task 2: Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2)

Besnik Fetahu

We present the findings of SemEval-2023 Task 2 on Fine-grained Multilingual Named Entity Recognition (MultiCoNER 2). Divided into 13 tracks, the task focused on methods to identify complex fine-grained named entities (like WRITTENWORK, VEHICLE, MUSICALGRP) across 12 languages, in both monolingual and multilingual scenarios, as well as noisy settings. The task used the MultiCoNER V2 dataset, composed of 2.2 million instances in Bangla, Chinese, English, Farsi, French, German, Hindi, Italian., Portuguese, Spanish, Swedish, and Ukrainian. MultiCoNER 2 was one of the most popular tasks of SemEval-2023. It attracted 842 submissions from 47 teams, and 34 teams submitted system papers. Results showed that complex entity types such as media titles and product names were the most challenging. Methods fusing external knowledge into transformer models achieved the best performance, and the largest gains were on the Creative Work and Group classes, which are still challenging even with external knowledge. Some fine-grained classes proved to be more challenging than others, such as SCIENTIST, ARTWORK, and PRIVATECORP. We also observed that noisy data has a significant impact on model performance, with an average drop of 10% on the noisy subset. The task highlights the need for future research on improving NER robustness on noisy data containing complex entities.

2023SEMEVALNot Relevant

How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?

Zhuoyan Li

Recent advances in generative AI technologies like large language models have boosted the incorporation of AI assistance in writing workflows, leading to the rise of a new paradigm of human-AI co-creation in writing. To understand how people perceive writings that are produced under this paradigm, in this paper, we conduct an experimental study to understand whether and how the disclosure of the level and type of AI assistance in the writing process would affect people’s perceptions of the writing on various aspects, including their evaluation on the quality of the writing, and their ranking of different writings. Our results suggest that disclosing the AI assistance in the writing process, especially if AI has provided assistance in generating new content, decreases the average quality ratings for both argumentative essays and creative stories. This decrease in the average quality ratings often comes with an increased level of variations in different individuals’ quality evaluations of the same writing. Indeed, factors such as an individual’s writing confidence and familiarity with AI writing assistants are shown to moderate the impact of AI assistance disclosure on their writing quality evaluations. We also find that disclosing the use of AI assistance may significantly reduce the proportion of writings produced with AI’s content generation assistance among the top-ranked writings.

2024EMNLPNot Relevant

VISIAR: Empower MLLM for Visual Story Ideation

Zhaoyang Xia

Ideation, the process of forming ideas from concepts, is a big part of the content creation process. However, the noble goal of helping visual content creators by suggesting meaningful sequences of visual assets from a limited collection is challenging. It requires a nuanced understanding of visual assets and the integration of open-world knowledge to support creative exploration. Despite its importance, this task has yet to be explored fully in existing literature. To fill this gap, we propose Visual Story Ideation, a novel and underexplored task focused on the automated selection and arrangement of visual assets into coherent sequences that convey expressive storylines.We also present VISIAR, Visual Ideation through Sequence Integration and Asset Rearrangement, a robust framework leveraging Multimodal Large Language Models (MLLMs), and a novel Story Graph mechanism. Our framework operates in three key stages: visual content understanding, candidate asset selection, and asset rearrangement via MLLMs. In addition, we curated a new benchmark dataset, called VTravel, to evaluate our methods both qualitatively and quantitatively.User studies and GPT-as-the-judge evaluation show that our approach surpasses GPT-4o based baseline by an average of 33.5% and 18.5% across three different metrics, demonstrating the effectiveness of our framework for generating compelling visual stories.

2025FINDINGSNot Relevant

Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study

Wan-hua Her

Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing Large Language Models, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.

2024SIGUL, WSNot Relevant

Tailoring with Targeted Precision: Edit-Based Agents for Open-Domain Procedure Customization

Yash Kumar Lal

How-to procedures, such as how to plant a garden, are now used by millions of users, but sometimes need customizing to meet a user’s specific needs, e.g., planting a garden without pesticides. Our goal is to measure and improve an LLM’s ability to perform such customization. Our approach is to test several simple multi-LLM-agent architectures for customization, as well as an end-to-end LLM, using a new evaluation set, called CustomPlans, of over 200 WikiHow procedures each with a customization need. We find that a simple architecture with two LLM agents used sequentially performs best, one that edits a generic how-to procedure and one that verifies its executability, significantly outperforming (10.5% absolute) an end-to-end prompted LLM. This suggests that LLMs can be configured reasonably effectively for procedure customization. This also suggests that multi-agent editing architectures may be worth exploring further for other customization applications (e.g. coding, creative writing) in the future.

2024FINDINGSNot Relevant

CMDAG: A Chinese Metaphor Dataset with Annotated Grounds as CoT for Boosting Metaphor Generation

Yujie Shao

Metaphor is a prominent linguistic device in human language and literature, as they add color, imagery, and emphasis to enhance effective communication. This paper introduces a large-scale high quality annotated Chinese Metaphor Corpus, which comprises around 28K sentences drawn from a diverse range of Chinese literary sources, such as poems, prose, song lyrics, etc. To ensure the accuracy and consistency of our annotations, we introduce a comprehensive set of guidelines. These guidelines address the facets of metaphor annotation, including identifying tenors, vehicles, and grounds to handling the complexities of similes, personifications, juxtapositions, and hyperboles. Breaking tradition, our approach to metaphor generation emphasizes tenors and their distinct features rather than the conventional combination of tenors and vehicles. By integrating “ground” as a CoT (Chain of Thoughts) input, we are able to generate metaphors that resonate more with real-world intuition. We test generative models such as Belle, Baichuan, and Chinese-alpaca-33B using our annotated corpus. These models are able to generate creative and fluent metaphor sentences more frequently induced by selected samples from our dataset, demonstrating the value of our corpus for Chinese metaphor research.

2024LREC, COLINGRelevant
related to creativity › related to creativity as a human abilityevaluation › human eval

ViPE: Visualise Pretty-much Everything

Hassan Shahmohammadi

Figurative and non-literal expressions are profoundly integrated in human communication. Visualising such expressions allow us to convey our creative thoughts, and evoke nuanced emotions. Recent text-to-image models like Stable Diffusion, on the other hand, struggle to depict non-literal expressions. Recent works primarily deal with this issue by compiling humanly annotated datasets on a small scale, which not only demands specialized expertise but also proves highly inefficient. To address this issue, we introduce ViPE: Visualise Pretty-much Everything. ViPE offers a series of lightweight and robust language models that have been trained on a large-scale set of lyrics with noisy visual descriptions that represent their implicit meaning. The synthetic visual descriptions are generated by GPT3.5 relying on neither human annotations nor images. ViPE effectively expresses any arbitrary piece of text into a visualisable description, enabling meaningful and high-quality image generation. We provide compelling evidence that ViPE is more robust than GPT3.5 in synthesising visual elaborations. ViPE also exhibits an understanding of figurative expressions comparable to human experts, providing a powerful and open-source backbone to many downstream applications such as music video and caption generation.

2023EMNLPNot Relevant

Generation and Evaluation of English Grammar Multiple-Choice Cloze Exercises

Nicolò Donati

English grammar Multiple-Choice Cloze (MCC) exercises are crucial for improving learners’ grammatical proficiency andcomprehension skills. However, creating these exercises is labour-intensive and requires expert knowledge. Effective MCCexercises must be contextually relevant and engaging, incorporating distractors—plausible but incorrect alternatives—tobalance difficulty and maintain learner motivation. Despite the increasing interest in utilizing large language models (LLMs)in education, their application in generating English grammar MCC exercises is still limited. Previous methods typicallyimpose constraints on LLMs, producing grammatically correct yet uncreative results. This paper explores the potentialof LLMs to independently generate diverse and contextually relevant MCC exercises without predefined limitations. Wehypothesize that LLMs can craft self-contained sentences that foster learner’s communicative competence. Our analysisof existing MCC exercise datasets revealed issues of diversity, completeness, and correctness. Furthermore, we addressthe lack of a standardized automatic metric for evaluating the quality of generated exercises. Our contributions includedeveloping an LLM-based solution for generating MCC exercises, curating a comprehensive dataset spanning 19 grammartopics, and proposing an automatic metric validated against human expert evaluations. This work aims to advance theautomatic generation of English grammar MCC exercises, enhancing both their quality and creativity.

2024CLICITNot Relevant

AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies

Xiao Ye

Humans regularly engage in analogical thinking, relating personal experiences to current situations (X is analogous to Y because of Z). Analogical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language models (LMs) do the same? To answer this question, we propose AnaloBench, a benchmark to determine analogical reasoning ability in LMs. Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We collect a set of 340 high quality, human written analogies for use in our benchmark, which constitutes the largest such collection to date. We then test a broad collection of models consisting of 12 open source and 3 proprietary in various sizes and architectures. As in prior results, scaling up LMs results in some performance boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.

2024EMNLPNot Relevant

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Matthieu Futeral

Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 303M documents, 200B tokens and 1.15B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model trained on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs. The dataset will be made publicly accessible under the Creative Commons CC BY 4.0 license.

2025FINDINGSNot Relevant

BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense

Baktash Ansari

This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think ‘outside of the box’. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a ‘round table conference’ approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.

2024SEMEVALNot Relevant

Evaluating Diversity in Automatic Poetry Generation

Yanran Chen

Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation — can humans distinguish between automatic and human generated poetry — we evaluate the diversity of automatically generated poetry (with a focus on quatrains), by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3-8B, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions — they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.

2024EMNLPRelevant

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Lanqing Xue

Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics, but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap datasets with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement, and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms.

2021ACL, IJCNLPNot Relevant

ACT: Knowledgeable Agents to Design and Perform Complex Tasks

Makoto Nakatsuji

Large language models enhance collaborative task execution in multi-agent systems. Current studies break complex task into manageable tasks, but agents lack understanding of the overall task and how others approach their tasks, hindering synergy and integration.We propose a method called knowledgeable Agents to design and perform Complex Tasks (ACT), where: (1) Agents independently manage their knowledge and tasks while collaboratively design the complex task into a more comprehensible form. In parallel, each agent also acquires knowledge of others, defined as a structured description of how other agents approach their tasks based on the agent’s own task resolution. (2) Each agent updates its knowledge and refines its task through interactions with others. By referencing structured knowledge, they effectively integrate their tasks to collaboratively solve the complex task.Three evaluations including creative writing and tool utilization, show that ACT accurately outperforms existing methods in solving complex tasks.

2025ACLRelevant

IGA: An Intent-Guided Authoring Assistant

Simeng Sun

While large-scale pretrained language models have significantly improved writing assistance functionalities such as autocomplete, more complex and controllable writing assistants have yet to be explored. We leverage advances in language modeling to build an interactive writing assistant that generates and rephrases text according to fine-grained author specifications. Users provide input to our Intent-Guided Assistant (IGA) in the form of text interspersed with tags that correspond to specific rhetorical directives (e.g., adding description or contrast, or rephrasing a particular sentence). We fine-tune a language model on a dataset heuristically-labeled with author intent, which allows IGA to fill in these tags with generated text that users can subsequently edit to their liking. A series of automatic and crowdsourced evaluations confirm the quality of IGA’s generated outputs, while a small-scale user study demonstrates author preference for IGA over baseline methods in a creative writing task. We release our dataset, code, and demo to spur further research into AI-assisted writing.

2021EMNLPRelevant

BordIRlines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation

Bryan Li

Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs’ responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM’s response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges.

2024WIKINLP, WSRelevant

Language Models can Categorize System Inputs for Performance Analysis

Dominic Sobhani

Language model systems are used to process diverse categories of input requests, ranging from improving creative writing to solving programming challenges. It would be useful to know which categories they are good at. However, existing evaluations compare model performance on pre-defined categories, failing to reflect a system’s performance on finer-grained or novel ones. We propose to automatically search for finer-grained categories based on inputs where a system performs well or poorly, and describe them in natural language. To search for these categories, we propose a large number of candidate category descriptions, e.g. “Communication Improvement”, find the subset of inputs that match the category descriptions, and calculate the performance on these categories; then we sort these categories based on their performance, thereby highlighting those that score high or low. As one application, we apply our method to compare LLaMA 3-70B and Claude 3 Opus, which have similar Elo-ratings on Chatbot Arena; our method finds the former is weaker at making text more professional and humorous while better at providing psychological insights, depicting a more nuanced picture of model performance.

2025NAACLRelevant

JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models

Jillian Fisher

The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship when needed, e.g., blind reviews for scientific papers, anonymous online reviews, or anonymous interactions in the mental health forums. In this paper, we propose an unsupervised inference-time approach to authorship obfuscation to address the unique challenges of authorship obfuscation: lack of supervision data for diverse authorship and domains, and the need for a sufficient level of revision beyond simple paraphrasing to obfuscate the authorship, all the while preserving the original content and fluency.We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation that can be in principle applied to any text and authorship. Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM’s APIs, while also reducing the performance gap between small and large language models via algorithmic enhancement. The key idea behind our approach is to boost the creative power of smaller language models through constrained decoding, while also allowing for user-specified controls and flexibility. Experimental results demonstrate that our approach based on GPT2-XL outperforms previous state-of-the-art methods based on comparably small models, while performing competitively against GPT3.5 175B, a propriety model that is two orders of magnitudes larger.

2024NAACLRelevant

ChatMusician: Understanding and Generating Music Intrinsically with LLM

Ruibin Yuan

While LLMs demonstrate impressive capabilities in musical knowledge, we find that music reasoning is still an unsolved task.We introduce ChatMusician, an open-source large language model (LLM) that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language.ChatMusician can understand and generate music with a pure text tokenizer without external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score.ChatMusician is capable of composing well-structured, full-length music, condition on texts, chords, melodies, motifs, musical forms, etc.On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 by a noticeable margin. We show that ChatMusician preserves or even surpasses the original LLaMA2 7B’s language abilities by evaluating on MMLU benchmark.Our work reveals that LLMs can be an excellent compressor for music, which can be seen as humanity’s creative language, but there remains significant territory to be conquered.We release our 5B token music-language corpora MusicPiles, the collected MusicTheoryBench, code, model and demo.

2024FINDINGSNot Relevant

Synthetic Lyrics Detection Across Languages and Genres

Yanis Labrak

In recent years, the use of large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity. These advances provide valuable tools for artists and enhance their creative processes, but they also raise concerns about copyright violations, consumer satisfaction, and content spamming. Previous research has explored content detection in various domains. However, no work has focused on the text modality, lyrics, in music. To address this gap, we curated a diverse dataset of real and synthetic lyrics from multiple languages, music genres, and artists. The generation pipeline was validated using both humans and automated methods. We performed a thorough evaluation of existing synthetic text detection approaches on lyrics, a previously unexplored data type. We also investigated methods to adapt the best-performing features to lyrics through unsupervised domain adaptation. Following both music and industrial constraints, we examined how well these approaches generalize across languages, scale with data availability, handle multilingual language content, and perform on novel genres in few-shot settings. Our findings show promising results that could inform policy decisions around AI-generated music and enhance transparency for users.

2025TRUSTNLP, WSNot Relevant

Summary of the Visually Grounded Story Generation Challenge

Xudong Hong

Recent advancements in vision-and-language models have opened new possibilities for natural language generation, particularly in generating creative stories from visual input. We thus host an open-sourced shared task, Visually Grounded Story Generation (VGSG), to explore whether these models can create coherent, diverse, and visually grounded narratives. This task challenges participants to generate coherent stories based on sequences of images, where characters and events must be grounded in the images provided. The task is structured into two tracks: the Closed track with constraints on fixed visual features and the Open track which allows all kinds of models. We propose the first two-stage model using GPT-4o as the baseline for the Open track that first generates descriptions for the images and then creates a story based on those descriptions. Human and automatic evaluations indicate that: 1) Retrieval augmentation helps generate more human-like stories, and 2) Largescale pre-trained LLM improves story quality by a large margin; 3) Traditional automatic metrics can not capture the overall quality.

2024INLGNot Relevant

A synthetic data approach for domain generalization of NLI models

Mohammad Javad Hosseini

Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data (685K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around 7% on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.

2024ACLNot Relevant

How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI

Ashwin Kalyan

Many real-world problems require the combined application of multiple reasoning abilities—employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, “How much would the sea level rise if all ice in the world melted?” FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question-answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large-scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.

2021EMNLPNot Relevant

MultiSubs: A Large-scale Multimodal and Multilingual Dataset

Josiah Wang

This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We also set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. Finally, we propose a fill-in-the-blank task to demonstrate the utility of the dataset, and present some baseline prediction models. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.

2022LRECNot Relevant

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Wentao Ge

Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating objective queries without considering real-world user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 26 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria.

2025NAACLNot Relevant

M2QA: Multi-domain Multilingual Question Answering

Leon Engländer

Generalization and robustness to input variation are core desiderata of machine learning research. Language varies along several axes, most importantly, language instance (e.g. French) and domain (e.g. news). While adapting NLP models to new languages within a single domain, or to new domains within a single language, is widely studied, research in joint adaptation is hampered by the lack of evaluation datasets. This prevents the transfer of NLP systems from well-resourced languages and domains to non-dominant language-domain combinations. To address this gap, we introduce M2QA, a multi-domain multilingual question answering benchmark.M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing. We use M2QA to explore cross-lingual cross-domain performance of fine-tuned models and state-of-the-art LLMs and investigate modular approaches to domain and language adaptation.We witness **1)** considerable performance _variations_ across domain-language combinations within model classes and **2)** considerable performance _drops_ between source and target language-domain combinations across all model sizes. We demonstrate that M2QA is far from solved, and new methods to effectively transfer both linguistic and domain-specific information are necessary.

2024FINDINGSNot Relevant

BOOKWORLD: From Novels to Interactive Agent Societies for Story Creation

Yiting Ran

Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld’s design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code and demo of this paper can be found at the project page: https://bookworld2025.github.io/.

2025ACLNot Relevant

Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

Hongru Wang

Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than 2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute 7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline .

2025FINDINGSNot Relevant

Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking

Byoungjip Kim

Large language models (LLMs) have shown the ability to solve complex decision-making tasks beyond natural language processing tasks. LLM agents based on few-shot in-context learning (ICL) achieve surprisingly high performance without training. Despite their simplicity and generalizability, ICL-based agents are limited in their ability to incorporate feedback from an environment. In this paper, we introduce Prospector, an LLM agent that consists of two complementary LLMs, an Actor and a Critic. To elicit better instruction-aligned actions from the LLM agent, we propose AskAct prompting that performs an additional self-asking step such as goal and progress checking before generating an action. Furthermore, to implicitly incorporate the environment feedback, we propose Trajectory Ranking that orders generated trajectories by predicting the expected total reward. Prospector encourages the LLM Actor to generate diverse (creative) trajectories, and harnesses the LLM Critic to select the most rewarding trajectory. On representative decision-making benchmark environments such as ALFWorld and WebShop, we empirically demonstrate that Prospector can considerably increase the success rate of given tasks, while outperforming recent advancements such as ReAct and Reflexion.

2024FINDINGSNot Relevant
scope › prompt engineering

Top-n𝜎: Eliminating Noise in Logit Space for Robust Token Sampling of LLM

Chenxia Tang

Large language models (LLMs) rely heavily on sampling methods to generate diverse and high-quality text.While existing sampling methods like top-p and min-p have identified the detrimental effects of low-probability tails in LLMs’ outputs, they still fail to effectively distinguish between diversity and noise. This limitation stems from their reliance on probability-based metrics that are inherently sensitive to temperature scaling. Through empirical and theoretical analysis, we make two key discoveries: (1) the pre-softmax logits exhibit a clear statistical separation between informative tokens and noise, and (2) we prove the mathematical equivalence of min-p and top-(1-p) under uniform distribution over logits. These findings motivate the design of top-n𝜎, a novel sampling method that identifies informative tokens by eliminating noise directly in logit space.Unlike existing methods that become unstable at high temperatures, top-n𝜎 achieves temperature-invariant token selection while preserving output diversity. Extensive experiments across reasoning and creative writing tasks demonstrate that our method consistently outperforms existing approaches, with particularly significant improvements in high-temperature settings.

2025ACLRelevant
evaluation › automatic metrics

Beemo: Benchmark of Expert-edited Machine-generated Outputs

Ekaterina Artemova

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo’s creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.

2025NAACLNot Relevant

BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs

Jesse Woo

A core part of legal work that has been underexplored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.

2025FINDINGSNot Relevant

Ghostbuster: Detecting Text Ghostwritten by Large Language Models

Vivek Verma

We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text.Our method works by passing documents through a series of weaker language models, running a structured search over possible combinations of their features, and then training a classifier on the selected features to predict whether documents are AI-generated.Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box or unknown models.In conjunction with our model, we release three new datasets of human- and AI-generated text as detection benchmarks in the domains of student essays, creative writing, and news articles. We compare Ghostbuster to several existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa baseline. Ghostbuster achieves 99.0 F1 when evaluated across domains, which is 5.9 F1 higher than the best preexisting model. It also outperforms all previous approaches in generalization across writing domains ( 7.5 F1), prompting strategies ( 2.1 F1), and language models ( 4.4 F1). We also analyze our system’s robustness to a variety of perturbations and paraphrasing attacks, and evaluate its performance on documents by non-native English speakers.

2024NAACLNot Relevant

MusicAgent: An AI Agent for Music Understanding and Generation with Large Language Models

Dingyao Yu

AI-empowered music processing is a diverse feld that encompasses dozens of tasks, ranging from generation tasks (e.g., timbre synthesis) to comprehension tasks (e.g., music classifcation). For developers and amateurs, it is very diffcult to grasp all of these task to satisfy their requirements in music processing, especially considering the huge differences in the representations of music data and the model applicability across platforms among various tasks. Consequently, it is necessary to build a system to organize and integrate these tasks, and thus help practitioners to automatically analyze their demand and call suitable tools as solutions to fulfill their requirements. Inspired by the recent success of large language models (LLMs) in task automation, we develop a system, named MusicAgent, which integrates numerous music-related tools and an autonomous workflow to address user requirements. More specifically, we build 1) toolset that collects tools from diverse sources, including Hugging Face, GitHub, and Web API, etc. 2) an autonomous workflow empowered by LLMs (e.g., ChatGPT) to organize these tools and automatically decompose user requests into multiple sub-tasks and invoke corresponding music tools. The primary goal of this system is to free users from the intricacies of AI-music tools, enabling them to concentrate on the creative aspect. By granting users the freedom to effortlessly combine tools, the system offers a seamless and enriching music experience. The code is available on GitHub along with a brief instructional video.

2023EMNLPNot Relevant

Does the “Most Sinfully Decadent Cake Ever” Taste Good? Answering Yes/No Questions from Figurative Contexts

Geetanjali Rakshit

Figurative language is commonplace in natural language, and while making communication memorable and creative, can be difficult to understand. In this work, we investigate the robustness of Question Answering (QA) models on figurative text. Yes/no questions, in particular, are a useful probe of figurative language understanding capabilities of large language models. We propose FigurativeQA, a set of 1000 yes/no questions with figurative and non-figurative contexts, extracted from the domains of restaurant and product reviews. We show that state-of-the-art BERT-based QA models exhibit an average performance drop of up to 15% points when answering questions from figurative contexts, as compared to non-figurative ones. While models like GPT-3 and ChatGPT are better at handling figurative texts, we show that further performance gains can be achieved by automatically simplifying the figurative contexts into their non-figurative (literal) counterparts. We find that the best overall model is ChatGPT with chain-of-thought prompting to generate non-figurative contexts. Our work provides a promising direction for building more robust QA models with figurative language understanding capabilities.

2023RANLPRelevant
model used › ChatGPTmodel used › Large (>32B)evaluation › human eval

ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Yan Yang

Solving expert-level multimodal tasks is a key milestone in general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to evolve, evaluation of frontier multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries encapsulating professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently collected from professionals based on their productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, they all face significant challenges in visual perception, textual understanding, domain knowledge, and advanced reasoning. Our benchmark is publicly accessible at TBC.

2025FINDINGSNot Relevant

Evaluating LLM Performance in Character Analysis: A Study of Artificial Beings in Recent Korean Science Fiction

Woori Jang

Literary works present diverse and complex character behaviors, often implicit or intentionally obscured, making character analysis an inherently challenging task. This study explores LLMs’ capability to identify and interpret behaviors of artificial beings in 11 award-winning contemporary Korean science fiction short stories. Focusing on artificial beings as a distinct class of characters, rather than on conventional human characters, adds to the multi-layered complexity of analysis. We compared two LLMs, Claude 3.5 Sonnet and GPT-4o, with human experts using a custom eight-label system and a unique agreement metric developed to capture the cognitive intricacies of literary interpretation. Human inter-annotator agreement was around 50%, confirming the subjectivity of literary comprehension. LLMs differed from humans in selected text spans but demonstrated high agreement in label assignment for correctly identified spans. LLMs notably excelled at discerning ‘actions’ as semantic units rather than isolated grammatical components. This study reaffirms literary interpretation’s multifaceted nature while expanding the boundaries of NLP, contributing to discussions about AI’s capacity to understand and interpret creative works.

2024NLP4DH, WSRelevant
evaluation › human evalmodel used › ChatGPTmodel used › Large (>32B)

Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration

Zhenhailong Wang

Human intelligence thrives on cognitive synergy, where collaboration among different minds yield superior outcomes compared to isolated individuals. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds’ strengths and knowledge to enhance problem-solving in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, experimental results demonstrate that SPP effectively reduces factual hallucination, and maintains strong reasoning capabilities. Additionally, comparative experiments show that cognitive synergy only emerges in GPT-4 and does not appear in less capable models, such as GPT-3.5-turbo and Llama2-13b-chat, which draws an interesting analogy to human development. Code, data, and prompts can be found at: https://github.com/MikeWangWZHL/Solo-Performance-Prompting.git.

2024NAACLRelevant
model used › ChatGPTmodel used › Large (>32B)evaluation › automatic metrics

Measuring Psychological Depth in Language Models

Fabrice Y. Harel-Canada

Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story’s subjective, psychological impact from a reader’s perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM’s ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff’s alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.

2024EMNLPRelevant
evaluates a creative featureevaluates a creative feature › figures of speechevaluation › human evalmodel used › ChatGPTmodel used › Large (>32B)evaluation › word-levelscope › prompt engineeringtextual genre › literature

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Or Honovich

Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0 and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.

2023ACLNot Relevant

I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors

Tuhin Chakrabarty

Visual metaphors are powerful rhetorical devices used to persuade or communicate creative ideas through images. Similar to linguistic metaphors, they convey meaning implicitly through symbolism and juxtaposition of the symbols. We propose a new task of generating visual metaphors from linguistic metaphors. This is a challenging task for diffusion-based text-to-image models, such as DALL⋅E 2, since it requires the ability to model implicit meaning and compositionality. We propose to solve the task through the collaboration between Large Language Models (LLMs) and Diffusion Models: Instruct GPT-3 (davinci-002) with Chain-of-Thought prompting generates text that represents a visual elaboration of the linguistic metaphor containing the implicit meaning and relevant objects, which is then used as input to the diffusion-based text-to-image models. Using a human-AI collaboration framework, where humans interact both with the LLM and the top-performing diffusion model, we create a high-quality dataset containing 6,476 visual metaphors for 1,540 linguistic metaphors and their associated visual elaborations. Evaluation by professional illustrators shows the promise of LLM-Diffusion Model collaboration for this task.To evaluate the utility of our Human-AI collaboration framework and the quality of our dataset, we perform both an intrinsic human-based evaluation and an extrinsic evaluation using visual entailment as a downstream task.

2023FINDINGSNot Relevant

Human-Centered Design Recommendations for LLM-as-a-judge

Qian Pan

Traditional reference-based metrics, such as BLEU and ROUGE, are less effective for assessing outputs from Large Language Models (LLMs) that produce highly creative or superior-quality text, or in situations where reference outputs are unavailable. While human evaluation remains an option, it is costly and difficult to scale. Recent work using LLMs as evaluators (LLM-as-a-judge) is promising, but trust and reliability remain a significant concern. Integrating human input is crucial to ensure criteria used to evaluate are aligned with the human’s intent, and evaluations are robust and consistent. This paper presents a user study of a design exploration called EvaluLLM, that enables users to leverage LLMs as customizable judges, promoting human involvement to balance trust and cost-saving potential with caution. Through interviews with eight domain experts, we identified the need for assistance in developing effective evaluation criteria aligning the LLM-as-a-judge with practitioners’ preferences and expectations. We offer findings and design recommendations to optimize human-assisted LLM-as-judge systems.

2024HUCLLM, WSNot Relevant

CoST of breaking the LLMs

Ananya Mukherjee

This paper presents an evaluation of 16 machine translation systems submitted to the Shared Task of the 9th Conference of Machine Translation (WMT24) for the English-Hindi (en-hi) language pair using our Complex Structures Test (CoST) suite. Aligning with this year’s test suite sub-task theme, “Help us break LLMs”, we curated a comprehensive test suite encompassing diverse datasets across various categories, including autobiography, poetry, legal, conversation, play, narration, technical, and mixed genres. Our evaluation reveals that all the systems struggle significantly with the archaic style of text like legal and technical writings or text with creative twist like conversation and poetry datasets, highlighting their weaknesses in handling complex linguistic structures and stylistic nuances inherent in these text types. Our evaluation identifies the strengths and limitations of the submitted models, pointing to specific areas where further research and development are needed to enhance their performance. Our test suite is available at https://github.com/AnanyaCoder/CoST-WMT-24-Test-Suite-Task.

2024WMT, WSRelevant

NewsMet : A ‘do it all’ Dataset of Contemporary Metaphors in News Headlines

Rohan Joseph

Metaphors are highly creative constructs of human language that grow old and eventually die. Popular datasets used for metaphor processing tasks were constructed from dated source texts. In this paper, we propose NewsMet, a large high-quality contemporary dataset of news headlines hand-annotated with metaphorical verbs. The dataset comprises headlines from various sources including political, satirical, reliable and fake. Our dataset serves the purpose of evaluation for the tasks of metaphor interpretation and generation. The experiments reveal several insights and limitations of using LLMs to automate metaphor processing tasks as frequently seen in the recent literature. The dataset is publicly available for research purposes https://github.com/AxleBlaze3/NewsMet_Metaphor_Dataset.

2023FINDINGSRelevant
evaluates a creative feature › figures of speechrelated to creativity › related to creativity as a human abilityevaluation › human eval

Cracking the Code: Multi-domain LLM Evaluation on Real-World Professional Exams in Indonesia

Fajri Koto

While knowledge evaluation in large language models has predominantly focused on academic subjects like math and physics, these assessments often fail to capture the practical demands of real-world professions. In this paper, we introduce IndoCareer, a dataset comprising 8,834 multiple-choice questions designed to evaluate performance in vocational and professional certification exams across various fields. With a focus on Indonesia, IndoCareer provides rich local contexts, spanning six key sectors: (1) healthcare, (2) insurance and finance, (3) creative and design, (4) tourism and hospitality, (5) education and training, and (6) law. Our comprehensive evaluation of 27 large language models shows that these models struggle particularly in fields with strong local contexts, such as insurance and finance. Additionally, while using the entire dataset, shuffling answer options generally maintains consistent evaluation results across models, but it introduces instability specifically in the insurance and finance sectors.

2025NAACLNot Relevant

CharacterGPT: A Persona Reconstruction Framework for Role-Playing Agents

Jeiyoon Park

The recent introduction of the Assistants API highlights its potential for large language models (LLMs) in role-playing agents (RPA). However, maintaining consistent character personas remains a significant challenge due to variability in information extraction, which frequently omits critical elements such as backstory or interpersonal relationships. To address this limitation, we introduce CharacterGPT, a framework designed to dynamically reconstruct character personas through Character Persona Training (CPT). This approach incrementally updates personas by extracting traits from chapter-wise novel summaries, reflecting the progression of the narrative. Our framework is evaluated through Big Five personality evaluations and creative tasks, in which characters generate original narratives, demonstrating the efficacy of CharacterGPT in preserving persona consistency. The code and results are available at https://github.com/Jeiyoon/charactergpt

2025NAACLRelevant
model used › ChatGPTscope › prompt engineeringrelated to creativity › related to creativity as a human abilitymodel used › Large (>32B)

Story-Yarn : An Interactive Story Building Application

Hryadyansh Saraswat

Story building is an important part of language and overall development of a child. Developing an interactive and artificial intelligence (AI) based solution to create stories for children is an open and challenging problem. Methods combining large language models (LLMs) and knowledge graphs (KGs) have further enabled high quality and coherent story generation. In this work, we present a platform, Story Yarn, developed for interactive story creation for children. We customise a KG, using children stories, which captures relationships between components of stories. This customised KG is then used along with LLM to collaboratively create a story. We have also built a simple app to facilitate user interaction. This platform can aid the creative development of children, and can be used at home or in schools.

2024ICONNot Relevant

Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization

Alvin Po-Chun Chen

Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system’s ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.

2024SEMEVALNot Relevant

Automating Humor: A Novel Approach to Joke Generation Using Template Extraction and Infilling

Mayank Goel

This paper presents a novel approach to humor generation in natural language processing by automating the creation of jokes through template extraction and infilling. Traditional methods have relied on predefined templates or neural network models, which either lack complexity or fail to produce genuinely humorous content. Our method introduces a technique to extract templates from existing jokes based on semantic salience and BERT’s attention weights. We then infill these templates using advanced techniques, through BERT and large language models (LLMs) like GPT-4, to generate new jokes. Our results indicate that the generated jokes are novel and human-like, with BERT showing promise in generating funny content and GPT-4 excelling in creating clever jokes. The study contributes to a deeper understanding of humor generation and the potential of AI in creative domains.

2024ICONRelevant
model used › ChatGPTscope › prompt engineeringrelated to creativity › related to creativity as a human ability

Plug-and-Play Controller for Story Completion: A Pilot Study toward Emotion-aware Story Writing Assistance

Yusuke Mori

Emotions are essential for storytelling and narrative generation, and as such, the relationship between stories and emotions has been extensively studied. The authors of this paper, including a professional novelist, have examined the use of natural language processing to address the problems of novelists from the perspective of practical creative writing. In particular, the story completion task, which requires understanding the existing unfinished context, was studied from the perspective of creative support for human writers, to generate appropriate content to complete the unfinished parts. It was found that unsupervised pre-trained large neural models of the sequence-to-sequence type are useful for this task. Furthermore, based on the plug-and-play module for controllable text generation using GPT-2, an additional module was implemented to consider emotions. Although this is a preliminary study, and the results leave room for improvement before incorporating the model into a practical system, this effort is an important step in complementing the emotional trajectory of the story.

2022IN2WRITINGRelevant

Break-Ideate-Generate (BrIdGe): Moving beyond Translations for Localization using LLMs

Swapnil Gupta

Language localization is the adaptation of written content to different linguistic and cultural contexts. Ability to localize written content is crucial for global businesses to provide consistent and reliable customer experience across diverse markets. Traditional methods have approached localization as an application of machine translation (MT), but localization requires more than linguistic conversion – content needs to align with the target audience’s cultural norms, linguistic nuances, and technical requirements. This difference is prominent for long-form text, where multiple facts are present in a creative choice of language. We propose a novel prompt approach for Large Languages Models (LLMs), called Break-Ideate-Generate (BrIdGe), for language localization. BrIdGe ‘breaks’ the source content into granular facts, ‘ideates’ an action plan for content creation in the target language by organizing the granular facts, and finally executes the plan to ‘generate’ localized content. This approach emulates the cognitive processes humans employ in writing that begin with identifying important points, followed by brainstorming on how to structure and organize the output. We evaluated the BrIdGe methodology from multiple perspectives, including impact of BrIdGe prompt on different LLMs and performance comparisons with traditional MT models and direct translation through LLMs on public benchmark and proprietary e-commerce datasets. Through human and LLM-based automated evaluations across content in multiple languages, we demonstrate effectiveness of BrIdGe in generating fluent localized content while preserving factual consistency between source and target languages.

2025NAACLRelevant
scope › prompt engineeringrelated to creativity › related to creativity as a textual genremodel used › ChatGPTevaluation › automatic metricsevaluation › human evaltextual genre › literature

ALF at SemEval-2024 Task 9: Exploring Lateral Thinking Capabilities of LMs through Multi-task Fine-tuning

Seyed Ali Farokh

Recent advancements in natural language processing (NLP) have prompted the development of sophisticated reasoning benchmarks. This paper presents our system for the SemEval 2024 Task 9 competition and also investigates the efficacy of fine-tuning language models (LMs) on BrainTeaser—a benchmark designed to evaluate NLP models’ lateral thinking and creative reasoning abilities. Our experiments focus on two prominent families of pre-trained models, BERT and T5. Additionally, we explore the potential benefits of multi-task fine-tuning on commonsense reasoning datasets to enhance performance. Our top-performing model, DeBERTa-v3-large, achieves an impressive overall accuracy of 93.33%, surpassing human performance.

2024SEMEVALNot Relevant

Measuring the Robustness of Reference-Free Dialogue Evaluation Systems

Justin Vasselli

Advancements in dialogue systems powered by large language models (LLMs) have outpaced the development of reliable evaluation metrics, particularly for diverse and creative responses. We present a benchmark for evaluating the robustness of reference-free dialogue metrics against four categories of adversarial attacks: speaker tag prefixes, static responses, ungrammatical responses, and repeated conversational context. We analyze metrics such as DialogRPT, UniEval, and PromptEval—a prompt-based method leveraging LLMs—across grounded and ungrounded datasets. By examining both their correlation with human judgment and susceptibility to adversarial attacks, we find that these two axes are not always aligned; metrics that appear to be equivalent when judged by traditional benchmarks may, in fact, vary in their scores of adversarial responses. These findings motivate the development of nuanced evaluation frameworks to address real-world dialogue challenges.

2025COLINGNot Relevant

Exploring the Capability of Multimodal LLMs with Yonkoma Manga: The YManga Dataset and Its Challenging Tasks

Qi Yang

Yonkoma Manga, characterized by its four-panel structure, presents unique challenges due to its rich contextual information and strong sequential features. To address the limitations of current multimodal large language models (MLLMs) in understanding this type of data, we create a novel dataset named YManga from the Internet. After filtering out low-quality content, we collect a dataset of 1,015 yonkoma strips, containing 10,150 human annotations. We then define three challenging tasks for this dataset: panel sequence detection, generation of the author’s creative intention, and description generation for masked panels. These tasks progressively introduce the complexity of understanding and utilizing such image-text data. To the best of our knowledge, YManga is the first dataset specifically designed for yonkoma manga strips understanding. Extensive experiments conducted on this dataset reveal significant challenges faced by current multimodal large language models. Our results show a substantial performance gap between models and humans across all three tasks.

2024FINDINGSNot Relevant

NOVA: An Iterative Planning Framework for Enhancing Scientific Innovation with Large Language Models

Xiang Hu

Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments demonstrates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation. Our code is available at https://github.com/hflyzju/Nova

2025FINDINGSNot Relevant

GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence

Zhihua Wen

Conditional story generation is significant in human-machine interaction, particularly in producing stories with complex plots. While Large language models (LLMs) perform well on multiple NLP tasks, including story generation, it is challenging to generate stories with both complex and creative plots. Existing methods often rely on detailed prompts to guide LLMs to meet target conditions, which inadvertently restrict the creative potential of the generated stories. We argue that leveraging information from exemplary human-written stories facilitates generating more diverse plotlines. Delving deeper into story details helps build complex and credible plots. In this paper, we propose a retrieval-auGmented stoRy generation framework with a fOrest of eVidEnce (GROVE) to enhance stories’ complexity. We build a retrieval repository for target conditions to produce few-shot examples to prompt LLMs. Additionally, we design an “asking-why” prompting scheme that extracts a forest of evidence, providing compensation for the ambiguities that may occur in the generated story. This iterative process uncovers underlying story backgrounds. Finally, we select the most fitting chains of evidence from the evidence forest and integrate them into the generated story, thereby enhancing the narrative’s complexity and credibility. Experimental results and numerous examples verify the effectiveness of our method.

2023FINDINGSNot Relevant

NewsBench: A Systematic Evaluation Framework for Assessing Editorial Capabilities of Large Language Models in Chinese Journalism

Miao Li

We present NewsBench, a novel evaluation framework to systematically assess the capabilities of Large Language Models (LLMs) for editorial capabilities in Chinese journalism. Our constructed benchmark dataset is focused on four facets of writing proficiency and six facets of safety adherence, and it comprises manually and carefully designed 1,267 test samples in the types of multiple choice questions and short answer questions for five editorial tasks in 24 news domains. To measure performances, we propose different GPT-4 based automatic evaluation protocols to assess LLM generations for short answer questions in terms of writing proficiency and safety adherence, and both are validated by the high correlations with human evaluations. Based on the systematic evaluation framework, we conduct a comprehensive analysis of eleven popular LLMs which can handle Chinese. The experimental results highlight GPT-4 and ERNIE Bot as top performers, yet reveal a relative deficiency in journalistic safety adherence in creative writing tasks. Our findings also underscore the need for enhanced ethical guidance in machine-generated journalistic content, marking a step forward in aligning LLMs with journalistic standards and safety considerations. The evaluation framework and experimental results are expected to provide an in-depth understanding of the editorial capabilities of LLMs and speed up the development of LLMs in journalism.

2024ACLNot Relevant

Decoding Decoded: Understanding Hyperparameter Effects in Open-Ended Text Generation

Esteban Garces Arias

Decoding strategies for generative large language models (LLMs) are a critical but often underexplored aspect of text generation tasks. Guided by specific hyperparameters, these strategies aim to transform the raw probability distributions produced by language models into coherent, fluent text. In this study, we undertake a large-scale empirical assessment of a range of decoding methods, open-source LLMs, textual domains, and evaluation protocols to determine how hyperparameter choices shape the outputs. Our experiments include both factual (e.g., news) and creative (e.g., fiction) domains, and incorporate a broad suite of automatic evaluation metrics alongside human judgments. Through extensive sensitivity analyses, we distill practical recommendations for selecting and tuning hyperparameters, noting that optimal configurations vary across models and tasks. By synthesizing these insights, this study provides actionable guidance for refining decoding strategies, enabling researchers and practitioners to achieve higher-quality, more reliable, and context-appropriate text generation outcomes.

2025COLINGNot Relevant

Dynamics of Instruction Fine-Tuning for Chinese Large Language Models

Chiyu Song

Instruction tuning is a burgeoning method to elicit the general intelligence of Large Language Models (LLMs). While numerous studies have examined the impact of factors such as data volume and model size on English models, the scaling properties of instruction tuning in other languages remain largely unexplored. In this work, we systematically investigate the effects of data quantity, model size, and data construction methods on instruction tuning for Chinese LLMs. We utilize a newly curated dataset, DoIT, which includes over 40,000 high-quality instruction instances covering ten underlying abilities, such as creative writing, code generation, and logical reasoning. Our experiments, conducted on models ranging from 7b to 33b parameters, yield three key findings: (i) While these factors directly affect overall model performance, some abilities are more responsive to scaling, whereas others demonstrate significant resistance. (ii) The scaling sensitivity of different abilities to these factors can be explained by two features: Complexity and Transference. (iii) By tailoring training strategies to their varying sensitivities, specific abilities can be efficiently learned, enhancing performance on two public benchmarks.

2025COLINGNot Relevant

Comparing the Evaluation and Production of Loophole Behavior in Humans and Large Language Models

Sonia Murthy

In law, lore, and everyday life, loopholes are commonplace. When people exploit a loophole, they understand the intended meaning or goal of another person, but choose to go with a different interpretation. Past and current AI research has shown that artificial intelligence engages in what seems superficially like the exploitation of loopholes, but this is likely anthropomorphization. It remains unclear to what extent current models, especially Large Language Models (LLMs), capture the pragmatic understanding required for engaging in loopholes. We examined the performance of LLMs on two metrics developed for studying loophole behavior in humans: evaluation (ratings of trouble, upset, and humor), and generation (coming up with new loopholes in a given context). We conducted a fine-grained comparison of state-of-the-art LLMs to humans, and find that while many of the models rate loophole behaviors as resulting in less trouble and upset than outright non-compliance (in line with adults), they struggle to recognize the humor in the creative exploitation of loopholes in the way that humans do. Furthermore, only two of the models, GPT 3 and 3.5, are capable of generating loopholes of their own, with GPT3.5 performing closest to the human baseline.

2023FINDINGSReviewed
evaluates a creative featureevaluation › automatic metricsevaluation › human evalmodel used › Large (>32B)related to creativity › related to creativity as a human abilityscope › prompt engineering

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Justin Zhao

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks – such as those related to emotional intelligence, creative writing, and persuasiveness – may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other’s responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

2025NAACLRelevant

Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation

Delu Kong

This study focuses on evaluating the performance of machine translations (MTs) compared to human translations (HTs) in children’s literature translation (CLT) from a stylometric perspective. The research constructs a extitPeter Pan corpus, comprising 21 translations: 7 human translations (HTs), 7 large language model translations (LLMs), and 7 neural machine translation outputs (NMTs). The analysis employs a generic feature set (including lexical, syntactic, readability, and n-gram features) and a creative text translation (CTT-specific) feature set, which captures repetition, rhyme, translatability, and miscellaneous levels, yielding 447 linguistic features in total. Using classification and clustering techniques in machine learning, we conduct a stylometric analysis of these translations. Results reveal that in generic features, HTs and MTs exhibit significant differences in conjunction word distributions and the ratio of 1-word-gram-一样, while NMTs and LLMs show significant variation in descriptive words usage and adverb ratios. Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning more closely with HTs in stylistic characteristics, demonstrating the potential of LLMs in CLT.

2025CTTRelevant

Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback

Hannah Rashkin

Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects—providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.

2025ACLRelevant
related to creativity › related to creativity as a human abilitytextual genre › literaturemodel used › Large (>32B)evaluation › automatic metricsevaluation › human eval

The Translator’s Canvas: Using LLMs to Enhance Poetry Translation

Natália Resende

We explore the potential of LLMs to enhance the translation process of rhymed and non-rhymed poetry. We examine LLMs’ performance in terms of lexical variety, lexical density, and sentence length compared to human translations (HT). We also examine the models’ abilities to translate sonnets while preserving the rhyme scheme of the source text. Our findings suggest that LLMs can serve as valuable tools for literary translators, assisting with the creative process and suggesting solutions to problems that may not otherwise have been considered. However, if the paradigm is flipped, such that instead of the systems being as tools by human translators, humans are used to post-edit the outputs to a standard comparable to the published translations, the amount of work required to complete the post-editing stage may outweigh any benefits assocaiated with using machine translation in the first place.

2024AMTARelevant
evaluates a creative featuretextual genre › poetryevaluation › automatic metricsevaluationmodel used › Large (>32B)related to creativity › related to creativity as a human ability

Little Red Riding Hood Goes around the Globe: Crosslingual Story Planning and Generation with Large Language Models

Evgeniia Razumovskaia

Previous work has demonstrated the effectiveness of planning for story generation exclusively in a monolingual setting focusing primarily on English. We consider whether planning brings advantages to automatic story generation across languages. We propose a new task of crosslingual story generation with planning and present a new dataset for this task. We conduct a comprehensive study of different plans and generate stories in several languages, by leveraging the creative and reasoning capabilities of large pretrained language models. Our results demonstrate that plans which structure stories into three acts lead to more coherent and interesting narratives, while allowing to explicitly control their content and structure.

2024LREC, COLINGRelevant
textual genre › literaturerelated to creativity › related to creativity as a human abilityevaluation › automatic metricsevaluation › human evalscope › prompt engineering

Small But Funny: A Feedback-Driven Approach to Humor Distillation

Sahithya Ravi

The emergence of Large Language Models (LLMs) has brought to light promising language generation capabilities, particularly in performing tasks like complex reasoning and creative writing. Consequently, distillation through imitation of teacher responses has emerged as a popular technique to transfer knowledge from LLMs to more accessible, Small Language Models (SLMs). While this works well for simpler tasks, there is a substantial performance gap on tasks requiring intricate language comprehension and creativity, such as humor generation. We hypothesize that this gap may stem from the fact that creative tasks might be hard to learn by imitation alone and explore whether an approach, involving supplementary guidance from the teacher, could yield higher performance. To address this, we study the effect of assigning a dual role to the LLM - as a “teacher” generating data, as well as a “critic” evaluating the student’s performance. Our experiments on humor generation reveal that the incorporation of feedback significantly narrows the performance gap between SLMs and their larger counterparts compared to merely relying on imitation. As a result, our research highlights the potential of using feedback as an additional dimension to data when transferring complex language abilities via distillation.

2024ACLRelevant
evaluation › automatic metricsmodel used › Medium (8-24)evaluates a creative featurerelated to creativity › related to creativity as a human ability

On the Controllability of Large Language Models for Dialogue Interaction

Nicolas Wagner

This paper investigates the enhancement of Dialogue Systems by integrating the creative capabilities of Large Language Models. While traditional Dialogue Systems focus on understanding user input and selecting appropriate system actions, Language Models excel at generating natural language text based on prompts. Therefore, we propose to improve controllability and coherence of interactions by guiding a Language Model with control signals that enable explicit control over the system behaviour. To address this, we tested and evaluated our concept in 815 conversations with over 3600 dialogue exchanges on a dataset. Our experiment examined the quality of generated system responses using two strategies: An unguided strategy where task data was provided to the models, and a controlled strategy in which a simulated Dialogue Controller provided appropriate system actions. The results show that the average BLEU score and the classification of dialogue acts improved in the controlled Natural Language Generation.

2024SIGDIALNot Relevant

JU-CSE-NLP’25 at SemEval-2025 Task 4: Learning to Unlearn LLMs

Arkajyoti Naskar

Large Language Models (LLMs) have achieved enormous success recently due to their ability to understand and solve various non-trivial tasks in natural language. However, they have been shown to memorize their training data which, among other concerns, increases the risk of the model regurgitating creative or private content, potentially leading to legal issues for the model developer and/or vendors. Such issues are often discovered post-model training during testing or red teaming. While unlearning has been studied for some time in classification problems, it is still a relatively underdeveloped area of study in LLM research since the latter operates in a potentially unbounded output label space. Specifically, robust evaluation frameworks are lacking to assess the accuracy of these unlearning strategies. In this challenge, we aim to bridge this gap by developing a comprehensive evaluation challenge for unlearning sensitive datasets in LLMs.

2025SEMEVAL, WSNot Relevant

Ad Headline Generation using Self-Critical Masked Language Model

Yashal Shakti Kanungo

For any E-commerce website it is a nontrivial problem to build enduring advertisements that attract shoppers. It is hard to pass the creative quality bar of the website, especially at a large scale. We thus propose a programmatic solution to generate product advertising headlines using retail content. We propose a state of the art application of Reinforcement Learning (RL) Policy gradient methods on Transformer (Vaswani et al., 2017) based Masked Language Models (Devlin et al., 2019). Our method creates the advertising headline by jointly conditioning on multiple products that a seller wishes to advertise. We demonstrate that our method outperforms existing Transformer and LSTM RL methods in overlap metrics and quality audits. We also show that our model generated headlines outperform human submitted headlines in terms of both grammar and creative quality as determined by audits.

2021NAACLNot Relevant

PositionID: LLMs can Control Lengths, Copy and Paste with Explicit Positional Awareness

Noah Wang

Large Language Models (LLMs) demonstrate impressive capabilities across various domains, including role-playing, creative writing, mathematical reasoning, and coding. Despite these advancements, LLMs still encounter challenges with length control, frequently failing to adhere to specific length constraints due to their token-level operations and insufficient training on data with strict length limitations. We identify this issue as stemming from a lack of positional awareness and propose novel approaches—PositionID Prompting and PositionID Fine-Tuning—to address it. These methods enhance the model’s ability to continuously monitor and manage text length during generation. Additionally, we introduce PositionID CP Prompting to enable LLMs to perform copy and paste operations accurately. Furthermore, we develop two benchmarks for evaluating length control and copy-paste abilities. Our experiments demonstrate that our methods significantly improve the model’s adherence to length constraints and copy-paste accuracy without compromising response quality.

2024FINDINGSNot Relevant

Building Japanese Creativity Benchmarks and Applying them to Enhance LLM Creativity

So Fukuda

To evaluate the creativity of large language models (LLMs) in Japanese, we construct three benchmarks: Japanese Creativity Questions (JCQ), Divergent Association Task (DAT), and Story Alteration Task (SAT). JCQ comprehensively evaluates creativity using LLMs. Meanwhile, DAT and SAT measure specific aspects of creative ability using embeddings. We also analyze correlations between JCQ and DAT, JCQ and SAT, and DAT and SAT. While JCQ provides comprehensive evaluation, it is relatively time and resource intensive. In contrast, DAT and SAT offer lower comprehensiveness but enable quick, low-cost assessment. Additionally, we investigate whether training with DAT contributes to enhancing LLM creativity.

2025ACL, WSRelevant
evaluation › creativity evaluationrelated to creativity › mentions creativity as a human abilitymodel used › Medium (8-24)model used › Small (<3B)evaluation › automatic metricsevaluates a creative featureevaluation › human eval

RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation

Ioannis Panagiotopoulos

Riddle-solving requires advanced reasoning skills, pushing Large Language Models (LLMs) to engage in abstract thinking and creative problem-solving, often revealing limitations in their cognitive abilities. In this paper, we examine the riddle-solving capabilities of LLMs using a multiple-choice format, exploring how different prompting techniques impact performance on riddles that demand diverse reasoning skills. To enhance results, we introduce RISCORE (RIddle Solving with COntext REcontruciton) a novel fully automated prompting method that generates and utilizes contextually reconstructed sentence-based puzzles in conjunction with the original examples to create few-shot exemplars. Our experiments demonstrate that RISCORE significantly improves the performance of language models in both vertical and lateral thinking tasks, surpassing traditional exemplar selection strategies across a variety of few-shot settings.

2025COLINGRelevant
scope › prompt engineeringevaluates a creative featurerelated to creativity › related to creativity as a human abilitymodel used › Large (>32B)evaluates a creative feature › riddles

Collective Critics for Creative Story Generation

Minwook Bae

Generating a long story of several thousand words with narrative coherence using Large Language Models (LLMs) has been a challenging task. Previous research has addressed this challenge by proposing different frameworks that create a story plan and generate a long story based on that plan. However, these frameworks have been mainly focusing on maintaining narrative coherence in stories, often overlooking creativity in story planning and the expressiveness of the stories generated from those plans, which are desirable properties to captivate readers’ interest. In this paper, we propose Collective Critics for Creative Story Generation framework (CritiCS), which is composed of plan refining stage (CrPlan) and story generation stage (CrText), to integrate a collective revision mechanism that promotes those properties into long-form story generation process. Specifically, in each stage, a group of LLM critics and one leader collaborate to incrementally refine drafts of plan and story throughout multiple rounds. Extensive human evaluation shows that the CritiCS can significantly enhance story creativity and reader engagement, while also maintaining narrative coherence. Furthermore, the design of the framework allows active participation from human writers in any role within the critique process, enabling interactive human-machine collaboration in story writing.

2024EMNLPRelevant

SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models

Anil Ramakrishna

We introduce SemEval-2025 Task 4: unlearn- ing sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) un- learn short form synthetic biographies contain- ing personally identifiable information (PII), in- cluding fake names, phone number, SSN, email and home addresses, and (3) unlearn real docu- ments sampled from the target model’s training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

2025SEMEVAL, WSNot Relevant

HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing

Jing Chen

Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as Writer, we also apply LLMs as Editor, who is responsible for providing feedback and revision advice to Writer. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as Actors that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.

2024FINDINGSRelevant
evaluation › LLM-as-a-judgeevaluation › automatic metricsevaluates a creative featurerelated to creativity › related to creativity as a human abilityscope › prompt engineering

“A good pun is its own reword”: Can Large Language Models Understand Puns?

Zhijun Xu

Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM’s ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the “lazy pun generation” pattern and identify the primary challenges LLMs encounter in understanding puns.

2024EMNLPRelevant
model used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)evaluation › automatic metricsevaluates a creative feature › punsscope › prompt engineering

Systematic Task Exploration with LLMs: A Study in Citation Text Generation

Furkan Şahinuç

Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks. Yet, this flexibility brings new challenges, as it introduces new degrees of freedom in formulating the task inputs and instructions and in evaluating model performance. To facilitate the exploration of creative NLG tasks, we propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement. We use this framework to explore citation text generation – a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric and has not yet been tackled within the LLM paradigm. Our results highlight the importance of systematically investigating both task instruction and input configuration when prompting LLMs, and reveal non-trivial relationships between different evaluation metrics used for citation text generation. Additional human generation and human evaluation experiments provide new qualitative insights into the task to guide future research in citation text generation. We make our code and data publicly available.

2024ACLNot Relevant

Navigating the Political Compass: Evaluating Multilingual LLMs across Languages and Nationalities

Chadi Helwe

Large Language Models (LLMs) have become ubiquitous in today’s technological landscape, boasting a plethora of applications, and even endangering human jobs in complex and creative fields. One such field is journalism: LLMs are being used for summarization, generation and even fact-checking. However, in today’s political landscape, LLMs could accentuate tensions if they exhibit political bias. In this work, we evaluate the political bias of the most used 15 multilingual LLMs via the Political Compass Test. We test different scenarios, where we vary the language of the prompt, while also assigning a nationality to the model. We evaluate models on the 50 most populous countries and their official languages. Our results indicate that language has a strong influence on the political ideology displayed by a model. In addition, smaller models tend to display a more stable political ideology, i.e. ideology that is less affected by variations in the prompt.

2025FINDINGSNot Relevant

LLM Factoscope: Uncovering LLMs’ Factual Discernment through Measuring Inner States

Jinwen He

Large Language Models (LLMs) have revolutionized various domains with extensive knowledge and creative capabilities. However, a critical issue with LLMs is their tendency to produce outputs that diverge from factual reality. This phenomenon is particularly concerning in sensitive applications such as medical consultation and legal advice, where accuracy is paramount. Inspired by human lie detectors using physiological responses, we introduce the LLM Factoscope, a novel Siamese network-based model that leverages the inner states of LLMs for factual detection. Our investigation reveals distinguishable patterns in LLMs’ inner states when generating factual versus non-factual content. We demonstrate its effectiveness across various architectures, achieving over 96% accuracy on our custom-collected factual detection dataset. Our work opens a new avenue for utilizing LLMs’ inner states for factual detection and encourages further exploration into LLMs’ inner workings for enhanced reliability and transparency.

2024FINDINGSNot Relevant

Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models

Wenqing Wang

Current exploration on creative generation focuses mainly on short stories, poetry, and scripts. With the expansion of Large Language Models (LLMs) context windows, “novel” avenues emerge. This study aims to extend the boundaries of Natural Language Generation (NLG) evaluation by exploring LLMs’ capabilities in more challenging long-form fiction. We propose a new multi-level evaluation framework that incorporates ten metrics across the Macro, Meso, and Micro levels. An annotated fiction dataset, sourced from human authors, LLMs, and human-AI collaborations in both English and Chinese is then constructed. Human evaluation reveals notable disparities between LLM-generated and human-authored fictions, particularly the “high-starting, low-ending” pattern in LLM outputs. We further probe ten high-performing LLMs through different prompt templates, achieving moderate correlations by strategically utilizing diverse LLMs tailored to different levels, as an initial step towards better automatic fiction evaluation. Finally, we offer a fine-grained analysis of LLMs capabilities through six issues, providing promising insights for future advancements.

2025FINDINGSRelevant
evaluation › creativity evaluationevaluation › human evaltextual genre › literaturescope › prompt engineeringrelated to creativity › mentions creativity as a human abilitymodel used › ChatGPTmodel used › Large (>32B)

Puzzle Solving using Reasoning of Large Language Models: A Survey

Panagiotis Giadikiaroglou

Exploring the capabilities of Large Language Models (LLMs) in puzzle solving unveils critical insights into their potential and challenges in AI, marking a significant step towards understanding their applicability in complex reasoning tasks. This survey leverages a unique taxonomy—dividing puzzles into rule-based and rule-less categories—to critically assess LLMs through various methodologies, including prompting techniques, neuro-symbolic approaches, and fine-tuning. Through a critical review of relevant datasets and benchmarks, we assess LLMs’ performance, identifying significant challenges in complex puzzle scenarios. Our findings highlight the disparity between LLM capabilities and human-like reasoning, particularly in those requiring advanced logical inference. The survey underscores the necessity for novel strategies and richer datasets to advance LLMs’ puzzle-solving proficiency and contribute to AI’s logical reasoning and creative problem-solving advancements.

2024EMNLPNot Relevant

Unveiling the Essence of Poetry: Introducing a Comprehensive Dataset and Benchmark for Poem Summarization

Ridwan Mahbub

While research in natural language processing has progressed significantly in creative language generation, the question of whether language models can interpret the intended meaning of creative language largely remains unanswered. Poetry as a creative art form has existed for generations, and summarization of such content requires deciphering the figurative patterns to find out the actual intent and message of the poet. This task can provide the researchers an opportunity to evaluate the creative language interpretation capacity of the language models. Unlike typical text, summarization of poems is a challenging task as poems carry a deeper meaning, which can be easily lost if only the literal meaning is considered. That being said, we propose a new task in the field of natural language understanding called ‘Poem Summarization’. As a starting, we propose the first-ever dataset for this task, named ‘PoemSum’, consisting of 3011 samples of poetry and its corresponding summarized interpretation in the English language. We have benchmarked the performance of different state-of-the-art summarization models and provided observations on their limitations. The dataset and all relevant code used in this work have been made publicly available.

2023EMNLPNot Relevant

MacGyver: Are Large Language Models Creative Problem Solvers?

Yufei Tian

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.

2024NAACLRelevant
evaluates a creative feature › logic (puzzles, etc.)creativity frameworks › psychological/cognitive

Can Large Language Models Transform Computational Social Science?

Caleb Ziems

Large language models (LLMs) are capable of successfully performing many language processing tasks zero-shot (without training data). If zero-shot LLMs can also reliably classify and explain social phenomena like persuasiveness and political ideology, then LLMs could augment the computational social science (CSS) pipeline in important ways. This work provides a road map for using LLMs as CSS tools. Towards this end, we contribute a set of prompting best practices and an extensive evaluation pipeline to measure the zero-shot performance of 13 language models on 25 representative English CSS benchmarks. On taxonomic labeling tasks (classification), LLMs fail to outperform the best fine-tuned models but still achieve fair levels of agreement with humans. On free-form coding tasks (generation), LLMs produce explanations that often exceed the quality of crowdworkers’ gold references. We conclude that the performance of today’s LLMs can augment the CSS research pipeline in two ways: (1) serving as zero-shot data annotators on human annotation teams, and (2) bootstrapping challenging creative generation tasks (e.g., explaining the underlying attributes of a text). In summary, LLMs are posed to meaningfully participate in social science analysis in partnership with humans.

2024CLNot Relevant

GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art

Yiming Lei

***Video Comment Art*** enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce **GODBench**, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs’ abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose **Ripple of Thought (RoT)**, a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments on GODBench reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improving creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity.

2025ACLNot Relevant

Do LLMs Generate Creative and Visually Accessible Data visualisations?

Clarissa Miranda-Pena

Data visualisation is a valuable task that combines careful data processing with creative design. Large Language Models (LLMs) are now capable of responding to a data visualisation request in natural language with code that generates accurate data visualisations (e.g., using Matplotlib), but what about human-centered factors, such as the creativity and accessibility of the data visualisations? In this work, we study human perceptions of creativity in the data visualisations generated by LLMs, and propose metrics for accessibility. We generate a range of visualisations using GPT-4 and Claude-2 with controlled variations in prompt and inference parameters, to encourage the generation of different types of data visualisations for the same data. Subsets of these data visualisations are presented to people in a survey with questions that probe human perceptions of different aspects of creativity and accessibility. We find that the models produce visualisations that are novel, but not surprising. Our results also show that our accessibility metrics are consistent with human judgements. In all respects, the LLMs underperform visualisations produced by human-written code. To go beyond the simplest requests, these models need to become aware of human-centered factors, while maintaining accuracy.

2024ALTARelevant
related to creativity › mentions creativity as a human abilitymodel used › Large (>32B)evaluation › automatic metricsevaluation › human evalscope › prompt engineeringevaluation › creativity evaluationcreativity frameworks › psychological/cognitivecreativity frameworks › computational creativity

HW-TSC at SemEval-2024 Task 9: Exploring Prompt Engineering Strategies for Brain Teaser Puzzles Through LLMs

Yinglu Li

Large Language Models (LLMs) have demonstrated impressive performance on many Natural Language Processing (NLP) tasks. However, their ability to solve more creative, lateral thinking puzzles remains relatively unexplored. In this work, we develop methods to enhance the lateral thinking and puzzle-solving capabilities of LLMs. We curate a dataset of word-type and sentence-type brain teasers requiring creative problem-solving abilities beyond commonsense reasoning. We first evaluate the zero-shot performance of models like GPT-3.5 and GPT-4 on this dataset. To improve their puzzle-solving skills, we employ prompting techniques like providing reasoning clues and chaining multiple examples to demonstrate the desired thinking process. We also fine-tune the state-of-the-art Mixtral 7x8b LLM on ourdataset. Our methods enable the models to achieve strong results, securing 2nd and 3rd places in the brain teaser task. Our work highlights the potential of LLMs in acquiring complex reasoning abilities with the appropriate training. The efficacy of our approaches opens up new research avenues into advancing lateral thinking and creative problem-solving with AI systems.

2024SEMEVALRelevant
scope › prompt engineeringevaluates a creative feature › logic (puzzles, etc.)creativity frameworks › psychological/cognitive

Benchmarking Language Model Creativity: A Case Study on Code Generation

Yining Lu

As LLMs become increasingly prevalent, it is interesting to consider how “creative” these models can be. From cognitive science, creativity consists of at least two key characteristics: convergent thinking (purposefulness to achieve a given goal) and divergent thinking (adaptability to explore new environments or constraints) (CITATION). In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.

2025NAACLRelevant
evaluation › creativity evaluationevaluation › automatic metricsevaluation › human evalrelated to creativity › mentions creativity as a human abilitycreativity frameworks › computational creativity

Probing the “Creativity” of Large Language Models: Can models produce divergent semantic association?

Honghua Chen

Large language models possess remarkable capacity for processing language, but it remains unclear whether these models can further generate creative content. The present study aims to investigate the creative thinking of large language models through a cognitive perspective. We utilize the divergent association task (DAT), an objective measurement of creativity that asks models to generate unrelated words and calculates the semantic distance between them. We compare the results across different models and decoding strategies. Our findings indicate that: (1) When using the greedy search strategy, GPT-4 outperforms 96% of humans, while GPT-3.5-turbo exceeds the average human level. (2) Stochastic sampling and temperature scaling are effective to obtain higher DAT scores for models except GPT-4, but face a trade-off between creativity and stability. These results imply that advanced large language models have divergent semantic associations, which is a fundamental process underlying creativity.

2023FINDINGSRelevant
evaluation › creativity evaluationrelated to creativity › mentions creativity as a human abilityscope › prompt engineeringmodel used › ChatGPTmodel used › Large (>32B)evaluates a creative feature › logic (puzzles, etc.)

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

Carlos Gómez-Rodríguez

We evaluate a range of recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We use a difficult, open-ended scenario chosen to avoid training data reuse: an epic narration of a single combat between Ignatius J. Reilly, the protagonist of the Pulitzer Prize-winning novel A Confederacy of Dunces (1980), and a pterodactyl, a prehistoric flying reptile. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as fluency, coherence, originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind. Humans retain an edge in creativity, while humor shows a binary divide between LLMs that can handle it comparably to humans and those that fail at it. We discuss the implications and limitations of our study and suggest directions for future research.

2023FINDINGSRelevant
related to creativity › mentions creativity as a human abilityevaluation › creativity evaluationtextual genre › literatureevaluation › human evalmodel used › Large (>32B)model used › ChatGPTcreativity frameworks › creative-textual creativity

Scientific and Creative Analogies in Pretrained Language Models

Tamara Czinczoll

This paper examines the encoding of analogy in large-scale pretrained language models, such as BERT and GPT-2. Existing analogy datasets typically focus on a limited set of analogical relations, with a high similarity of the two domains between which the analogy holds. As a more realistic setup, we introduce the Scientific and Creative Analogy dataset (SCAN), a novel analogy dataset containing systematic mappings of multiple attributes and relational structures across dissimilar domains. Using this dataset, we test the analogical reasoning capabilities of several widely-used pretrained language models (LMs). We find that state-of-the-art LMs achieve low performance on these complex analogy tasks, highlighting the challenges still posed by analogy understanding.

2022FINDINGSRelevant
creativity frameworks › creative-textual creativityevaluates a creative feature › logic (puzzles, etc.)evaluation › automatic metricsevaluation › creativity evaluationevaluation › word-levelmodel used › Medium (8-24)related to creativity › related to creativity as a human abilityscope › technical research

SimulBench: Evaluating Language Models with Creative Simulation Tasks

Qi Jia

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation tasks, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM’s general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on SimulBench, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these creative simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55% more cases.

2025FINDINGSNot Relevant

Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit’s Showerthoughts

Tolga Buz

Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa-based classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide the dataset for creative, witty text generation based on Reddit Showerthoughts posts.

2024STARSEMRelevant
creativity frameworks › creative-textual creativitymodel used › ChatGPTmodel used › Large (>32B)model used › Small (<3B)textual genre › literatureevaluation › human eval

Extending CREAMT: Leveraging Large Language Models for Literary Translation Post-Editing

Antonio Castaldo

Post-editing machine translation (MT) for creative texts, such as literature, requires balancing efficiency with the preservation of creativity and style. While neural MT systems struggle with these challenges, large language models (LLMs) offer improved capabilities for context-aware and creative translation. This study evaluates the feasibility of post-editing literary translations generated by LLMs. Using a custom research tool, we collaborated with professional literary translators to analyze editing time, quality, and creativity. Our results indicate that post-editing (PE) LLM-generated translations significantly reduce editing time compared to human translation while maintaining a similar level of creativity. The minimal difference in creativity between PE and MT, combined with substantial productivity gains, suggests that LLMs may effectively support literary translators.

2025MTSUMMITRelevant
creativity frameworks › creative-textual creativityrelated to creativity › mentions creativity as a human abilitytextual genre › literaturemodel used › Large (>32B)model used › ChatGPTevaluation › sentence-levelevaluation › automatic metricsevaluation › creativity evaluation

Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs

Alexander Spangher

Journalists engage in multiple steps in the news writing process that depend on human creativity, like exploring different “angles” (i.e. the specific perspectives a reporter takes). These can potentially be aided by large language models (LLMs). By affecting planning decisions, such interventions can have an outsize impact on creative output. We advocate a careful approach to evaluating these interventions to ensure alignment with human values.In a case study of journalistic coverage of press releases, we assemble a large dataset of 250k press releases and 650k articles covering them. We develop methods to identify news articles that _challenge and contextualize_ press releases. Finally, we evaluate suggestions made by LLMs for these articles and compare these with decisions made by human journalists. Our findings are three-fold: (1) Human-written news articles that challenge and contextualize press releases more take more creative angles and use more informational sources. (2) LLMs align better with humans when recommending angles, compared with informational sources. (3) Both the angles and sources LLMs suggest are significantly less creative than humans.

2024EMNLPRelevant
creativity frameworks › computational creativityrelated to creativity › related to creativity as a textual genreevaluation › automatic metricsmodel used › Large (>32B)model used › ChatGPT

Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs

Guillermo Marco

In this paper, we evaluate the creative fiction writing abilities of a fine-tuned small language model (SLM), BART-large, and compare its performance to human writers and two large language models (LLMs): GPT-3.5 and GPT-4o. Our evaluation consists of two experiments: (i) a human study in which 68 participants rated short stories from humans and the SLM on grammaticality, relevance, creativity, and attractiveness, and (ii) a qualitative linguistic analysis examining the textual characteristics of stories produced by each model. In the first experiment, BART-large outscored average human writers overall (2.11 vs. 1.85), a 14% relative improvement, though the slight human advantage in creativity was not statistically significant. In the second experiment, qualitative analysis showed that while GPT-4o demonstrated near-perfect coherence and used less cliche phrases, it tended to produce more predictable language, with only 3% of its synopses featuring surprising associations (compared to 15% for BART). These findings highlight how model size and fine-tuning influence the balance between creativity, fluency, and coherence in creative writing tasks, and demonstrate that smaller models can, in certain contexts, rival both humans and larger models.

2025COLINGRelevant
related to creativity › related to creativity as a textual genremodel used › Large (>32B)evaluation › human evalevaluation › creativity evaluationtextual genre › literaturecreativity frameworks › creative-textual creativity

Subjective Topic meets LLMs: Unleashing Comprehensive, Reflective and Creative Thinking through the Negation of Negation

Fangrui Lv

Large language models (LLMs) exhibit powerful reasoning capacity, as evidenced by prior studies focusing on objective topics that with unique standard answers such as arithmetic and commonsense reasoning. However, the reasoning to definite answers emphasizes more on logical thinking, and falls short in effectively reflecting the comprehensive, reflective, and creative thinking that is also critical for the overall reasoning prowess of LLMs. In light of this, we build a dataset SJTP comprising diverse SubJective ToPics with free responses, as well as three evaluation indicators to fully explore LLM’s reasoning ability. We observe that a sole emphasis on logical thinking falls short in effectively tackling subjective challenges. Therefore, we introduce a framework grounded in the principle of the Negation of Negation (NeoN) to unleash the potential comprehensive, reflective, and creative thinking abilities of LLMs. Comprehensive experiments on SJTP demonstrate the efficacy of NeoN, and the enhanced performance on various objective reasoning tasks unequivocally underscores the benefits of stimulating LLM’s subjective thinking in augmenting overall reasoning capabilities.

2024EMNLPRelevant
creativity frameworks › psychological/cognitivemodel used › Large (>32B)scope › creative trainingmodel used › Medium (8-24)related to creativity › related to creativity as a human ability

Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing?

Guillermo Marco

Are LLMs ready to compete in creative writing skills with a top (rather than average) novelist? To provide an initial answer for this question, we have carried out a contest between Patricio Pron (an awarded novelist, considered one of the best of his generation) and GPT-4 (one of the top performing LLMs), in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. We asked Pron and GPT-4 to provide thirty titles each, and then to write short stories for both their titles and their opponent’s. Then, we prepared an evaluation rubric inspired by Boden’s definition of creativity, and we collected several detailed expert assessments of the texts, provided by literature critics and scholars. The results of our experimentation indicate that LLMs are still far from challenging a top human creative writer. We also observed that GPT-4 writes more creatively using Pron’s titles than its own titles (which is an indication of the potential for human-machine co-creation). Additionally, we found that GPT-4 has a more creative writing style in English than in Spanish.

2024EMNLPRelevant
creativity frameworks › psychological/cognitiveevaluation › human evalevaluation › creativity evaluationmodel used › Large (>32B)model used › ChatGPTrelated to creativity › mentions creativity as a human abilitytextual genre › literature

Prompting Large Language Models for Idiomatic Translation

Antonio Castaldo

Large Language Models (LLMs) have demonstrated impressive performance in translating content across different languages and genres. Yet, their potential in the creative aspects of machine translation has not been fully explored. In this paper, we seek to identify the strengths and weaknesses inherent in different LLMs when applied to one of the most prominent features of creative works: the translation of idiomatic expressions. We present an overview of their performance in the EN→IT language pair, a context characterized by an evident lack of bilingual data tailored for idiomatic translation. Lastly, we investigate the impact of prompt design on the quality of machine translation, drawing on recent findings which indicate a substantial variation in the performance of LLMs depending on the prompts utilized.

2024CTT, WSRelevant
related to creativity › mentions creativity as a human abilitycreativity frameworks › creative-textual creativitymodel used › Large (>32B)model used › ChatGPTevaluation › sentence-levelevaluation › automatic metricsevaluation › human evalevaluates a creative feature › idiomatic expressionstextual genre › literature

Creative Natural Language Generation

Tuhin Chakrabarty

Large language models such as GPT-3, GPT4, Claude etc., have advanced the state of the art in several natural language generation tasks such as text summarization and machine translation. However when it comes to open-ended tasks with a focus on creativity such as generating stories, poetry, or various forms of figurative language, these state-of-the-art language models are often found to be inadequate. This tutorial aims to bring awareness of the important and emerging research area of open-domain creative generation, with a focus on language generation while also touching on multi-modal generation (e.g., image captioning, visual metaphors). It targets natural language processing (NLP) and artificial intelligence (AI) researchers as well as creative writing practitioners who are interested in building systems that are capable of emulating as well as augmenting human creativity. In particular, we will review recent studies on creative language generation both at the sentence level as well as longer forms of text. We will provide the audiences with a holistic view of 1) the importance and challenges of building creative language generation systems; 2) how we incorporate content planning, domain knowledge and creativity specific heuristics for different forms of creative language generation such as story, poetry, humor, metaphors etc 3) how can we build better evaluation methods for creative text generation? In particular, how could the recent advancement of AI shape the future workforce for creativity? We will conclude the tutorial by outlining future research directions in this area.

2023EMNLP
creativity frameworks › creative-textual creativitycreativity frameworks › psychological/cognitivecreativity frameworks › computational creativityevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluation

Story Centaur: Large Language Model Few Shot Learning as a Creative Writing Tool

Ben Swanson

Few shot learning with large language models has the potential to give individuals without formal machine learning training the access to a wide range of text to text models. We consider how this applies to creative writers and present Story Centaur, a user interface for prototyping few shot models and a set of recombinable web components that deploy them. Story Centaur’s goal is to expose creative writers to few shot learning with a simple but powerful interface that lets them compose their own co-creation tools that further their own unique artistic directions. We build out several examples of such tools, and in the process probe the boundaries and issues surrounding generation with large language models.

2021EACLRelevant
creativity frameworks › creative-textual creativitymodel used › Large (>32B)scope › technical research

Unmet Creativity Support Needs in Computationally Supported Creative Writing

Max Kreminski

Large language models (LLMs) enabled by the datasets and computing power of the last decade have recently gained popularity for their capacity to generate plausible natural language text from human-provided prompts. This ability makes them appealing to fiction writers as prospective co-creative agents, addressing the common challenge of writer’s block, or getting unstuck. However, creative writers face additional challenges, including maintaining narrative consistency, developing plot structure, architecting reader experience, and refining their expressive intent, which are not well-addressed by current LLM-backed tools. In this paper, we define these needs by grounding them in cognitive and theoretical literature, then survey previous computational narrative research that holds promise for supporting each of them in a co-creative setting.

2022IN2WRITINGRelevant
creativity frameworks › computational creativitycreativity frameworks › psychological/cognitivecreativity frameworks › creative-textual creativityevaluation › creativity evaluation

Creative Planning with Language Models: Practice, Evaluation and Applications

Alexander Spangher

The use of large language models (LLMs) in human-centered creative tasks — such as journalism, scientific writing, and storytelling — has showcased their potential for content generation but highlighted a critical gap: planning. Planning, used here to describe the “actions” humans perform before (and during) the writing process, is a fundamental process in many creative domains. This tutorial explores how planning has been learned and deployed in creative workflows, unifying three scenarios: Full Data Regimens (when observational data for actions and the resulting text exist), Partial (when text exists but actions can be inferred) and Low (when neither exist). The tutorial discusses forward and backward learning approaches for planning in LLMs, evaluation metrics tailored to latent plans, and practical applications in computational journalism, web agents, and other creative domains. By bridging theoretical concepts and practical demonstrations, this tutorial aims to inspire new research directions in leveraging LLMs for creative and goal-oriented planning tasks.

2025NAACLRelevant
creativity frameworks › computational creativitycreativity frameworks › psychological/cognitive

Creative Problem Solving in Large Language and Vision Models - What Would it Take?

Lakshmi Nair

We advocate for a strong integration of Computational Creativity (CC) with research in large language and vision models (LLVMs) to address a key limitation of these models, i.e., creative problem solving. We present preliminary experiments showing how CC principles can be applied to address this limitation. Our goal is to foster discussions on creative problem solving in LLVMs and CC at prestigious ML venues.

2024FINDINGSRelevant
creativity frameworks › psychological/cognitivecreativity frameworks › computational creativityevaluation › creativity evaluationevaluation › automatic metrics