Papers

93 papers found

Review Papers

Generating creativity through ChatGPT: an empirical investigation in open innovation platforms

Li Yongjun et al.

Since the release of ChatGPT, the remarkable content comprehension and generation abilities of large language models (LLMs) have spurred knowledge democratization, however, on the one hand, they can lower the barriers to individual innovation, potentially facilitating the enhancement of personal innovative capabilities. On the other hand, their exceptional capabilities and popularity may lead to resistance and speculation, hindering demand for innovation. This work aims to examine whether they can assist people in creative tasks. This study utilizes an extensive dataset from an Open Innovation Platform (OIP), known for fostering creativity through external ideas. Leveraging ChatGPT’s release as a natural experiment, it employs the Difference-in-Difference (DID) technique to examine how LLMs impact people’s creative participation. The results indicate that the launching enhances peoples’ willingness to engage in creative tasks, reflected in increased works. And this enhancement is not necessarily a threat to the quality of innovation, as the works have increased in length by 44%, complexity by 35%, and votes by 1.5% on average. Additionally, heterogeneity analysis shows these effects are more pronounced in higher-tier innovation works and among less experienced users. Further mechanism analysis suggests that the improvement in innovation performance in OIPs stems from the lowering of the innovation threshold enabled by ChatGPT. Moreover, LLMs can increase interaction feedback to stimulate external creativity, thereby improving innovation performance. This study’s conclusions contribute to understanding the impact of the new technology, LLMs, on innovation activities across diverse populations.

Not Relevant

Generative AI and creativity: a systematic literature review and meta-analysis

Niklas Holzner et al.

Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

Not Relevant

How AI ideas affect the creativity, diversity, and evolution of human ideas: Evidence from a large, dynamic experiment

Joshua Ashkinaze et al.

Exposure to large language model output is rapidly increasing. How will seeing AI-generated ideas affect human ideas? We conducted a dynamic experiment (800+ participants, 40+ countries) where participants viewed creative ideas that were from ChatGPT or prior experimental participants, and then brainstormed their own idea. We varied the number of AI-generated examples (none, low, or high exposure) and if the examples were labeled as “AI” (disclosure). We find that high AI exposure (but not low AI exposure) did not affect the creativity of individual ideas but did increase the average amount and rate of change of collective idea diversity. AI made ideas different, not better. There were no main effects of disclosure. We also found that self-reported creative people were less influenced by knowing an idea was from AI and that participants may knowingly adopt AI ideas when the task is difficult. Our findings suggest that introducing AI ideas may increase collective diversity but not individual creativity.

Proceedings of the ACM collective intelligence conferenceNot Relevant

Leveraging large models to evaluate novel content: a case study on advertisement creativity

Zhaoyi Joey Hou et al.

Evaluating creativity is challenging, even for humans, not only because of its subjectivity but also because it involves complex cognitive processes. Inspired by work in marketing, we attempt to break down visual advertisement creativity into atypicality and originality. With fine-grained human annotations on these dimensions, we propose a suite of tasks specifically for such a subjective problem. We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark, demonstrating both the promises and challenges of using VLMs for automatic creativity assessment.

Not Relevant

Artificial intelligence reshapes creativity: a multidimensional evaluation

Chenchen Zhang et al.

Artificial intelligence (AI) is reshaping creativity by challenging its long-held status as a uniquely human faculty. This study uses bibliometric analysis to reveal AI’s evolution from a passive instrument to an active co-creator that amplifies human intuition and expands creative possibilities. We highlight how AI-driven evaluative frameworks offer more objective, scalable, and inclusive assessments of creativity, disrupting bias-prone traditional methods. Also, this transformation raises pressing ethical and legal concerns, particularly regarding authorship, intellectual property, and recognition of machine-generated outputs. By mapping these tensions and opportunities, the study provides a critical foundation for rethinking creativity in the age of human–machine collaboration. Our findings point toward an urgent need for new conceptual models that align innovation with ethical and societal responsibility.

Not Relevant

Evaluating AI’s ideas: The role of individual creativity and expertise in human-AI co-creativity

Paul V DiStefano et al.

As generative artificial intelligence (genAI) becomes increasingly integrated into education and work, understanding who benefits most from human-AI collaboration is crucial. This study examines how domain expertise and individual differences—creative self-efficacy and baseline creative ability—influence human-AI co-creativity in an engineering design task. Using pre-generated ideas from GPT-3.5-turbo, engineering (N = 99) and psychology students (N = 212) generated an initial solution, evaluated AI-generated ideas, and revised their idea. Linear mixed-effects models demonstrated expertise and generation ability predicted solution quality. Engineering students produced more original and effective solutions, yet both groups improved comparably supporting the “rising tides lift all boats” hypothesis. A novel categorization scheme revealed group differences in idea inspiration: engineers generated more novel solutions, while psychology students tended to adopt existing ideas. These findings highlight the role of domain knowledge and individual differences in pre-existing creativity in maximizing human-AI co-creativity, emphasizing the need to develop these human abilities alongside genAI.

Not Relevant

Assessing creativity across multi-step intervention using generative AI models

Eran Hadas and Arnon Hershkovitz

Creativity is an imperative skill for today’s learners, one that has important contributions to issues of inclusion and equity in education. Therefore, assessing creativity is of major importance in educational contexts. However, scoring creativity based on traditional tools suffers from subjectivity and is heavily time- and labour-consuming. This is indeed the case for the commonly used Alternative Uses Test (AUT), in which participants are asked to list as many different uses as possible for a daily object. The test measures divergent thinking (DT), which involves exploring multiple possible solutions in various semantic domains. This study leverages recent advancements in generative AI (GenAI) to automate the AUT scoring process, potentially increasing efficiency and objectivity. Using two validated models, we analyze the dynamics of creativity dimensions in a multi-step intervention aimed at improving creativity by using repeated AUT sessions (N=157 9th-grade students). Our research questions focus on the behavioural patterns of DT dimensions over time, their correlation with the number of practice opportunities, and the influence of response order on creativity scores. The results show improvement in fluency and flexibility, as a function of practice opportunities, as well as various correlations between DT dimensions. By automating the scoring process, this study aims to provide deeper insights into the development of creative skills over time and explore the capabilities of GenAI in educational assessments. Eventually, the use of automatic evaluation can incorporate creativity evaluation in various educational processes at scale.

Relevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › sentence-levelrelated to creativity › mentions creativity as a human ability

Creativity, context, and large language models

Max Peeperkorn

Not Relevant

Creativity behind the prompts: Automated creativity assessment in prompting for text-to-image models

Saif Abdoelrazak

This thesis delves into the intersection of artificial intelligence and creativity, specifically focusing on the application of text-to-image synthesis models. These models, gaining significant attention in recent years, hold the potential to redefine the boundaries of human imagination and challenge conventional notions of creativity. However, they also raise pertinent questions about originality, copyright, and the role of human input in the creative process. The study investigates the use of prompt engineering to augment the creativity of the generated artworks. Various prompt modifiers, including artist names and aesthetic quality descriptors, are employed to guide the synthesis process. The results indicate that the strategic use of these modifiers significantly enhances the creativity of the generated images, providing a concrete strategy for both novice and experienced users of these models. The research also explores the use of topic modeling methods, such as Gibbs Sampling Dirichlet Mixture Model (GSDMM) and BERTopic. However, several challenges, including computational constraints and limitations in the clustering methods used, are identified. Despite these challenges, the research offers valuable insights into the potential of text-to-image synthesis models and the role of prompt engineering in enhancing creativity. Future work aims to address these challenges and further explore the potential of these models in various creative domains.

Not Relevant

AI as humanity's salieri: Quantifying linguistic creativity of language models via systematic attribution of machine text against web text

Ximing Lu et al.

Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.

Relevant
creativity frameworks › creative-textual creativityevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)related to creativity › related to creativity as a human abilityscope › prompt engineeringscope › technical researchtextual genre › literaturetextual genre › poetry

Creative agents: Simulating the systems model of creativity with generative agents

Naomi Imasato et al.

With the growing popularity of generative AI for images, video, and music, we witnessed models rapidly improve in quality and performance. However, not much attention is paid towards enabling AI’s ability to “be creative”. We often attribute the quality of “being creative” to an individual or an object, but we believe that countless variables participate in determining what or who is creative, transcending a single entity or artifact. Csikszentmihalyi’s systems model of creativity suggests that creativity is a product of interactions among multiple parts of a society that create, evaluate, and record. In this study, we implemented and simulated Csikszentmihalyi’s systems model of creativity using virtual agents utilizing large language models (LLMs) and text prompts. We conducted experiments in virtual settings where creativity is achieved with the presence of specific characteristics in the artifact. For comparison, the simulations were conducted with two “virtual artists” being 1)in the system, which received feedback from the field, and 2)isolated, which did not. Both agents were compared by analyzing the novelty, which was measured via Creativity Implication Network, and value, quantified through the desired characteristics present in artifacts. Our results suggest that the agents that receive feedback from the field can generate artifacts that are more novel and more valuable, thus more creative, in the framework of the systems model of creativity. Furthermore, the difference becomes more evident when external factors enact changes to the domain.

Not Relevant

Human creativity in the age of LLMs: Randomized experiments on divergent and convergent thinking

Harsh Kumar et al.

Large language models are transforming the creative process by offering unprecedented capabilities to algorithmically generate ideas. While these tools can enhance human creativity when people co-create with them, it’s unclear how this will impact unassisted human creativity. We conducted two large pre-registered parallel experiments involving 1,100 participants attempting tasks targeting the two core components of creativity, divergent and convergent thinking. We compare the effects of two forms of large language model (LLM) assistance—a standard LLM providing direct answers and a coach-like LLM offering guidance—with a control group receiving no AI assistance, and focus particularly on how all groups perform in a final, unassisted stage. Our findings reveal that while LLM assistance can provide short-term boosts in creativity during assisted tasks, it may inadvertently hinder independent creative performance when users work without assistance, raising concerns about the long-term impact on human creativity and cognition.

Proceedings of the 2025 CHI conference on human factors in computing systemsRelevant
creativity frameworks › psychological/cognitiveevaluates a creative feature › logic (puzzles, etc.)evaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationmodel used › ChatGPTmodel used › Large (>32B)related to creativity › mentions creativity as a human ability

A new dataset and method for creativity assessment using the alternate uses task

Luning Sun et al.

Creativity ratings by humans for the alternate uses task (AUT) tend to be subjective and inefficient. To automate the scoring process of the AUT, previous literature suggested using semantic distance from non-contextual models. In this paper, we extend this line of research by including contextual semantic models and more importantly, exploring the feasibility of predicting creativity ratings with supervised discriminative machine learning models. Based on a newly collected dataset, our results show that supervised models can successfully classify between creative and non-creative responses even with unbalanced data, and can generalise well to out-of-domain unseen prompts.

Intelligent computers, algorithms, and applicationsNot Relevant
creativity frameworks › computational creativity

Autonomous measure of creativity in large language models (LLM)

Javier M. Mora-Merchan et al.

The present work proposes a new metric to measure the creativity of Large Language Models (LLM). Unlike traditional approaches that compare the results produced by LLMs with those produced by humans, this study presents an autonomous approach that does not depend on external reference systems. The methodology consists of providing a specific input prompt to LLMs, for example, generating a short three-paragraph story describing children in a park. Then, a large number of stories are generated, and techniques such as morphological or semantic grouping are applied to determine the number of original stories produced. Given that natural language processing (NLP) comparison techniques are computationally intensive and the calculation of distances is of order $$O(n^2)$$O(n2), it’s not possible to directly count the number of original stories generated by prompt. To address this, models have been developed from a calibration set, which simulate story generation and from which the maximum number of stories that can be generated is inferred. This methodology is not specific to the LLM or the prompt used, so it can be used to compare existing systems.

A cross-disciplinary exploration of STEMNot Relevant

The impact of AI on creativity: Enhancing human potential or challenging creative expression

Luka Baklaga

We recommendExploring the Role of Generative AI in Enhancing Language Learning: Opportunities and ChallengesEdwin Creely, International Journal of Changes in Education, 2024Data Science and ApplicationsNarender Chinthamu, Journal of Data Science and Intelligent Systems, 2023Identification and Evaluation of Sustainable Factors in Urban Construction Project Management Using SWOT TechniqueNima Amani, Mahsima Naeij, Green and Low-Carbon Economy, 2024Bioinformatics Applications in Chronic Diseases: A Comprehensive Review of Genomic, Transcriptomics, Proteomic, Metabolomics, and Machine Learning ApproachesTaiwo Temitope Ogunjobi, Medinformatics, 2024A Novel Ensemble Deep Learning Based Polyp Detection Using Colonoscopy DatasetSai Rakshana K., Artificial Intelligence and Applications, 2024Powered by Privacy policy

Not Relevant

Luminate: Structured generation and exploration of design space with large language models for human-AI co-creation

Sangho Suh et al.

Thanks to their generative capabilities, large language models (LLMs) have become an invaluable tool for creative processes. These models have the capacity to produce hundreds and thousands of visual and textual outputs, offering abundant inspiration for creative endeavors. But are we harnessing their full potential? We argue that current interaction paradigms fall short, guiding users towards rapid convergence on a limited set of ideas, rather than empowering them to explore the vast latent design space in generative models. To address this limitation, we propose a framework that facilitates the structured generation of design space in which users can seamlessly explore, evaluate, and synthesize a multitude of responses. We demonstrate the feasibility and usefulness of this framework through the design and development of an interactive system, Luminate, and a user study with 14 professional writers. Our work advances how we interact with LLMs for creative tasks, introducing a way to harness the creative potential of LLMs.

Proceedings of the 2024 CHI conference on human factors in computing systemsRelevant
evaluation › creativity evaluationevaluation › human evalcreativity frameworks › creative-textual creativitycreativity frameworks › psychological/cognitivepost-editing › post-editing with LLMsmodel used › ChatGPTscope › prompt engineeringtextual genre › poetrytextual genre › musictextual genre › literature

Style over story: a process-oriented study of authorial creativity in large language models

Donghoon Jung et al.

Evaluations of large language models (LLMs)' creativity have focused primarily on the quality of their outputs rather than the processes that shape them. This study takes a process-oriented approach, drawing on narratology to examine LLMs as computational authors. We introduce constraint-based decision-making as a lens for authorial creativity. Using controlled prompting to assign authorial personas, we analyze the creative preferences of the models. Our findings show that LLMs consistently emphasize Style over other elements, including Character, Event, and Setting. By also probing the reasoning the models provide for their choices, we show that distinctive profiles emerge across models and argue that our approach provides a novel systematic tool for analyzing AI's authorial creativity.

Relevant
creativity frameworks › creative-textual creativityevaluation › creativity evaluationevaluation › document-levelevaluation › automatic metricsmodel used › ChatGPTtextual genre › literaturescope › prompt engineering

Artificial intelligence as a tool for creativity

Zorana Ivcevic and Mike Grandinetti

The release of ChatGPT has sparked quite a bit of interest about creativity in the context of artificial intelligence (AI), with theorizing and empirical research asking questions about the nature of creativity (both human and artificially-produced) and the valuing of work produced by humans and artificial means. In this article, we discuss one specific scenario identified in the creativity research community – co-creation, or use of AI as a tool that could augment human creativity. We present emerging research relevant to how AI can be used on a continuum of four levels of creativity, from mini-c/creativity in learning to little-c/everyday creativity to Pro-C/professional creativity and Big-C/eminent creativity. In this discussion, AI is defined broadly, not to include only large language models (e.g., ChatGPT) which might approach general AI, but also other computer programs that perform tasks typically understood as requiring human intelligence. We conclude by considering future directions for research on AI as a tool for creativity across the four c's.

Not Relevant

Evaluation is creation: Self and social judgments of creativity across the four-c model

Denis Dumas and James C. Kaufman

Who should evaluate the originality and task-appropriateness of a given idea has been a perennial debate among psychologists of creativity. Here, we argue that the most relevant evaluator of a given idea depends crucially on the level of expertise of the person who generated it. To build this argument, we draw on two complimentary theoretical perspectives. The model of domain learning (MDL) suggests that, for novices in a domain, creativity is by-necessity self-referenced, but as expertise develops, more socially referenced creativity is possible. Relatedly, the four-C model posits four forms of creativity that fall along a continuum of social impact: mini-c, little-c, Pro-c, and Big-C. We show that the MDL implies a learning trajectory that connects the four Cs because, as socially referenced creativity develops, greater societal impact becomes available to a creator. Then, we describe four sources of evaluations that become relevant as an individual learns: judgments from the creators themselves, their local community, consumers of the idea, and finally, critics in the domain. We suggest that creators’ judgments are of essential importance for mini-c, community judgments are paramount for little-c, Pro-c requires either positive evaluations from consumers or critics, and Big-C requires both consumers and critics to evaluate an idea positively for an extended time. We identify key insights and imperatives for the field: aligning our measures (both human and AI scored) with the most relevant evaluations of ideas to support the reliability and validity of our measurements, using evaluations as feedback for learners to support the development of creative metacognition, and the importance of considering domain differences when evaluating ideas.

Not Relevant

The effects of (dis)similarities between the creator and the assessor on assessing creativity: a comparison of humans and LLMs

Martin ‘t Hof et al.

Current research predominantly involves human subjects to evaluate AI creativity. In this explorative study, we questioned the validity of this practice and examined how creator–assessor (dis)similarity—namely to what extent the creator and the assessor were alike—along two dimensions of culture (Western and English-speaking vs. Eastern and Chinese-speaking) and agency (human vs. AI) influences the assessment of creativity. We first asked four types of subjects to create stories, including Eastern participants (university students from China), Eastern AI (Kimi from China), Western participants (university students from The Netherlands), and Western AI (ChatGPT 3.5 from the US). Both Eastern participants and AI created stories in Chinese, which were then translated into English, while both Western participants and AI created stories in English, which were then translated into Chinese. A subset of these stories (2 creative and 2 uncreative per creator type, in total 16 stories) was then randomly selected as assessment materials. Adopting a within-subject design, we then asked new subjects from the same four types (n = 120, 30 per type) to assess these stories on creativity, originality, and appropriateness. The results confirmed that similarities in both dimensions of culture and agency influence the assessment of originality and appropriateness. As for the agency dimension, human assessors preferred human-created stories for originality, while AI assessors showed no preference. Conversely, AI assessors rated AI-generated stories higher in appropriateness, whereas human assessors showed no preference. Culturally, both Eastern and Western assessors favored Eastern-created stories in originality. And as for appropriateness, the assessors always preferred stories from the creators with the same cultural backgrounds. The present study is significant in attempting to ask an often-overlooked question and provides the first empirical evidence to underscore the need for more discussion on using humans to judge AI agents’ creativity or the other way around.

Relevant
creativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelevaluation › human evalmodel used › ChatGPTrelated to creativity › related to creativity as a textual genre

Assessing the creativity of LLMs in proposing novel solutions to mathematical problems

Junyi Ye et al.

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing correct answers, AI systems should also be capable of, or assist humans in, developing novel solutions to mathematical challenges. This study explores the creative potential of Large Language Models (LLMs) in mathematical reasoning, an aspect that has received limited attention in prior research. We introduce a novel framework and benchmark, CreativeMath, which encompasses problems ranging from middle school curricula to Olympic-level competitions, designed to assess LLMs' ability to propose innovative solutions after some known solutions have been provided. Our experiments demonstrate that, while LLMs perform well on standard mathematical tasks, their capacity for creative problem-solving varies considerably. Notably, the Gemini-1.5-Pro model outperformed other LLMs in generating novel solutions. This research opens a new frontier in evaluating AI creativity, shedding light on both the strengths and limitations of LLMs in fostering mathematical innovation, and setting the stage for future developments in AI-assisted mathematical discovery.

Relevant
creativity frameworks › computational creativitycreativity frameworks › psychological/cognitiveevaluation › LLM-as-a-judgeevaluation › automatic metricsevaluation › creativity evaluationevaluation › human evalmodel used › ChatGPTmodel used › Large (>32B)related to creativity › related to creativity as a human ability

A survey on evaluation of large language models

Yupeng Chang et al.

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at:

Not Relevant

The creative agency of large language models: a philosophical inquiry

Paschal Mmesoma Ukpaka

This paper explores the difficult question of whether Large Language Models (LLMs) are intrinsically creative. Because they can independently create original content, LLMs are often seen as creative agents. Contrary to the belief that LLMs are creative, this paper argues that LLMs are not creative for two reasons. First, LLMs are not creative because they lack an essential component of creativity, which is the first-person experience of the world. Secondly, LLMs are not creative because they are not the principal authors of their creative output, for they lack the subjective awareness and intentionality necessary to be regarded as authors, and their output is a collaborative effort of the AI model, data providers, and other stakeholders. Since they are not full-fledged authors in a traditional sense, they are not creative.

Not Relevant

Exploring automated assessment of primary students' creativity in a flow-based music programming environment

Zifeng Liu et al.

Creativity is a vital skill in science, technology, engineering, and mathematics (STEM)-related education, fostering innovation and problem-solving. Traditionally, creativity assessments relied on human evaluations, such as the consensual assessment technique (CAT), which are resource-intensive, time-consuming, and often subjective. Recent advances in computational methods, particularly large language models (LLMs), have enabled automated creativity assessments. In this study, we extend research on automated creativity scoring to a flow-based music programming environment, a context that integrates computational and creative thinking. We collected 383 programming artifacts from 194 primary school students (2022-2024) and employed two automated approaches: an evidence-centred design (ECD) framework-based approach and an LLM-based approach using ChatGPT-4 with few-shot learning. The ECD-based approach integrates divergent thinking, complexity, efficiency, and emotional expressiveness, while the LLM-based approach uses CAT ratings and ECD examples to learn creativity scoring. Results revealed moderate to strong correlations with human evaluations (ECD-based: r = 0.48; LLM-based: r = 0.68), with the LLM-based approach demonstrating greater consistency across varying learning examples (r = 0.82). These findings highlight the potential of automated tools for scalable, objective, and efficient creativity assessment, paving the way for their application in creativity-focused learning environments.

Relevant
creativity frameworks › psychological/cognitiveevaluation › creativity evaluationevaluation › automatic metricsevaluation › LLM-as-a-judgerelated to creativity › mentions creativity as a human ability

Creativity has left the chat: The price of debiasing language models

Behnam Mohammadi

Large Language Models (LLMs) have revolutionized natural language processing but can exhibit biases and may generate toxic content. While alignment techniques like Reinforcement Learning from Human Feedback (RLHF) reduce these issues, their impact on creativity, defined as syntactic and semantic diversity, remains unexplored. We investigate the unintended consequences of RLHF on the creativity of LLMs through three experiments focusing on the Llama-2 series. Our findings reveal that aligned models exhibit lower entropy in token predictions, form distinct clusters in the embedding space, and gravitate towards "attractor states", indicating limited output diversity. Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation. The trade-off between consistency and creativity in aligned models should be carefully considered when selecting the appropriate model for a given application. We also discuss the importance of prompt engineering in harnessing the creative potential of base models.

Relevant
creativity frameworks › creative-textual creativityevaluation › automatic metricsevaluation › creativity evaluationevaluation › document-levelrelated to creativity › mentions creativity as a textual genrescope › creative trainingscope › prompt engineering

Automating evaluation of creativity in LLMs with semantic entropy and efficient multi-agent judge

Tan Min Sen et al.

Large Language Models (LLMs) have achieved remarkable progress in natural language comprehension, reasoning, and generation, sparking interest in their creative potential. Automating creativity evaluation in LLMs, particularly in physical reasoning tasks, presents a transformative opportunity to accelerate scientific discovery by enabling innovative solutions, uncovering patterns, and automating problem-solving processes. Current creativity evaluation frameworks, however, rely heavily on human annotation, making them subjective, resource-intensive, and impractical for scaling. To address this, we introduce a novel automated evaluation framework rooted in cognitive science principles of divergent and convergent thinking. Divergent creativity is measured using Semantic Entropy, a sampling-based metric that quantifies variability in generated outputs to capture the novelty of ideas. Convergent creativity is assessed using a modified retrieval-based discussion framework—60% more efficient—where autonomous multi-agent systems evaluate task solutions across feasibility, safety, and effectiveness. We implement these methodologies within a benchmark based on the MacGyver dataset, which contains 300 real-world, solvable problems requiring innovative use of everyday objects. Our framework evaluates state-of-the-art LLMs, such as GPT and LLaMA models, while analyzing the effects of key parameters like temperature, model size, and recency. By automating creativity evaluation, we establish a scalable, objective, and reproducible methodology to enhance LLM development, paving the way for breakthroughs in scientific discovery and creative problem-solving across diverse fields.

Relevant
creativity frameworks › psychological/cognitivecreativity frameworks › computational creativityevaluation › creativity evaluationevaluation › human evalevaluation › sentence-levelevaluation › LLM-as-a-judgeevaluation › automatic metricsmodel used › ChatGPTmodel used › Large (>32B)model used › Medium (8-24)related to creativity › mentions creativity as a human ability

DeepMath-creative: a benchmark for evaluating mathematical creativity of large language models

Xiaoyang Chen et al.

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria – emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations – the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

Not Relevant

How do hackathons foster creativity? Towards automated evaluation of creativity at scale

Jeanette Falk et al.

Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

Proceedings of the 2025 CHI conference on human factors in computing systems

Automatic assessment of mathematical creativity using natural language processing

Rebecca Marrone et al.

Creativity is now accepted as a core 21st-century competency and is increasingly an explicit part of school curricula around the world. Therefore, the ability to assess creativity for both formative and summative purposes is vital. However, the fitness-for-purpose of creativity tests has recently come under scrutiny. Current creativity assessments have up to five key weaknesses that create a barrier to their widespread use in educational settings. These are: (a) A lack of domain/subject specificity; (b) Inconsistency, leading to a lack of trust; (c) A lack of authenticity in classroom settings; (d) Slowness (in providing useful results); (e) High cost to administer. The aim of the present study is to explore the feasibility of the automated assessment of mathematical creativity, drawing on tools and techniques from the field of natural language processing, as a means to address these weaknesses. This paper describes the performance of a machine learning algorithm, relative to human judges, demonstrating the practicality of automated creativity assessment for large-scale, school-based assessments. The importance of creativity is recognized in education systems globally. The ability to assess creativity, for both formative and summative purposes, is therefore vital. However, the quality of creativity tests (their validity and reliability) tends to come at a cost. In simple terms, the better the creativity test, the greater the effort, and therefore cost, required to deliver and score that test. The aim of the present study is to explore the feasibility of the automated assessment of mathematical creativity, drawing on tools and techniques from the field of natural language processing. This paper describes how a machine learning algorithm assesses mathematical creativity and compares this to human judges.

How do hackathons foster creativity? Towards AI collaborative evaluation of creativity at scale

Jeanette Falk et al.

Hackathons have become popular collaborative events for accelerating the development of creative ideas and prototypes. There are several case studies showcasing creative outcomes across domains such as industry, education, and research. However, there are no large-scale studies on creativity in hackathons which can advance theory on how hackathon formats lead to creative outcomes. We conducted a computational analysis of 193,353 hackathon projects. By operationalizing creativity through usefulness and novelty, we refined our dataset to 10,363 projects, allowing us to analyze how participant characteristics, collaboration patterns, and hackathon setups influence the development of creative projects. The contribution of our paper is twofold: We identified means for organizers to foster creativity in hackathons. We also explore the use of large language models (LLMs) to augment the evaluation of creative outcomes and discuss challenges and opportunities of doing this, which has implications for creativity research at large.

A robot walks into a bar: Can language models serve as creativity SupportTools for comedy? An evaluation of LLMs’ humour alignment with comedians

Piotr Mirowski et al.

We interviewed twenty professional comedians who perform live shows in front of audiences and who use artificial intelligence in their artistic process as part of 3-hour workshops on “AI x Comedy” conducted at the Edinburgh Festival Fringe in August 2023 and online. The workshop consisted of a comedy writing session with large language models (LLMs), a human-computer interaction questionnaire to assess the Creativity Support Index of AI as a writing tool, and a focus group interrogating the comedians’ motivations for and processes of using AI, as well as their ethical concerns about bias, censorship and copyright. Participants noted that existing moderation strategies used in safety filtering and instruction-tuned LLMs reinforced hegemonic viewpoints by erasing minority groups and their perspectives, and qualified this as a form of censorship. At the same time, most participants felt the LLMs did not succeed as a creativity support tool, by producing bland and biased comedy tropes, akin to “cruise ship comedy material from the 1950s, but a bit less racist”. Our work extends scholarship about the subtle difference between, one the one hand, harmful speech, and on the other hand, “offensive” language as a practice of resistance, satire and “punching up”. We also interrogate the global value alignment behind such language models, and discuss the importance of community-based value alignment and data ownership to build AI tools that better suit artists’ needs. Warning: this study may contain offensive language and discusses self-harm.

Proceedings of the 2024 ACM conference on fairness, accountability, and transparency

Knowledge-Enhanced Large Language Models and Human-AI Collaboration Frameworks for Creativity Support - ProQuest

Unknown

Explore millions of resources from scholarly journals, books, newspapers, videos and more, on the ProQuest Platform.

Automated scoring of scientific creativity in german

Benjamin Goecke et al.

ABSTRACT Automated scoring is a current hot topic in creativity research. However, most research has focused on the English language and popular verbal creative thinking tasks, such as the alternate uses task. Therefore, in this study, we present a large language model approach for automated scoring of a scientific creative thinking task that assesses divergent ideation in experimental tasks in the German language. Participants are required to generate alternative explanations for an empirical observation. This work analyzed a total of 13,423 unique responses. To predict human ratings of originality, we used XLM‐RoBERTa (Cross‐lingual Language Model‐RoBERTa), a large, multilingual model. The prediction model was trained on 9,400 responses. Results showed a strong correlation between model predictions and human ratings in a held‐out test set ( n  = 2,682; r  = 0.80; CI‐95% [0.79, 0.81]). These promising findings underscore the potential of large language models for automated scoring of scientific creative thinking in the German language. We encourage researchers to further investigate automated scoring of other domain‐specific creative thinking tasks.

Exploring poetic creativity in large language models: a dynamic multi-agent framework for poem generation

Scarlet Weatherby et al.

Creativity Benchmark: A benchmark for marketing creativity for large language models

Ninad Bhat et al.

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $Δθ\approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Automated assessment of creativity in multilingual narratives.

Simone A. Luchini et al.

Multilingual semantic distance: Automatic verbal creativity assessment in many languages.

John D. Patterson et al.

"It Felt Like Having a Second Mind": Investigating human-AI co-creativity in prewriting with large language models

Qian Wan et al.

Prewriting is the process of discovering and developing ideas before writing a first draft, which requires divergent thinking and often implies unstructured strategies such as diagramming, outlining, free-writing, etc. Although large language models (LLMs) have been demonstrated to be useful for a variety of tasks including creative writing, little is known about how users would collaborate with LLMs to support prewriting. The preferred collaborative role and initiative of LLMs during such a creative process is also unclear. To investigate human-LLM collaboration patterns and dynamics during prewriting, we conducted a three-session qualitative study with 15 participants in two creative tasks: story writing and slogan writing. The findings indicated that during collaborative prewriting, there appears to be a three-stage iterative Human-AI Co-creativity process that includes Ideation, Illumination, and Implementation stages. This collaborative process champions the human in a dominant role, in addition to mixed and shifting levels of initiative that exist between humans and LLMs. This research also reports on collaboration breakdowns that occur during this process, user perceptions of using existing LLMs during Human-AI Co-creativity, and discusses design implications to support this co-creativity process.

Natural language processing in computational creativity: a systematic

Owen Graham and Megan Walter

An integrated benchmark for verbal creativity testing of LLMs and humans

Anca Dinu and Andra Maria Florescu

Homogenizing effect of large language model on creativity: An empirical comparison of human and ChatGPT writing

Kibum Moon

Evaluating creativity and deception in large language models: a simulation framework for multi-agent balderdash

Parsa Hejabi et al.

Large Language Models (LLMs) have shown impressive capabilities in complex tasks and interactive environments, yet their creativity remains underexplored. This paper introduces a simulation framework utilizing the game Balderdash to evaluate both the creativity and logical reasoning of LLMs. In Balderdash, players generate fictitious definitions for obscure terms to deceive others while identifying correct definitions. Our framework enables multiple LLM agents to participate in this game, assessing their ability to produce plausible definitions and strategize based on game rules and history. We implemented a centralized game engine featuring various LLMs as participants and a judge LLM to evaluate semantic equivalence. Through a series of experiments, we analyzed the performance of different LLMs, examining metrics such as True Definition Ratio, Deception Ratio, and Correct Guess Ratio. The results provide insights into the creative and deceptive capabilities of LLMs, highlighting their strengths and areas for improvement. Specifically, the study reveals that infrequent vocabulary in LLMs' input leads to poor reasoning on game rules and historical context (https://github.com/ParsaHejabi/Simulation-Framework-for-Multi-Agent-Balderdash).

Enhancing creativity in large language models through associative thinking strategies

Pronita Mehrotra et al.

This paper explores the enhancement of creativity in Large Language Models (LLMs) like vGPT-4 through associative thinking, a cognitive process where creative ideas emerge from linking seemingly unrelated concepts. Associative thinking strategies have been found to effectively help humans boost creativity. However, whether the same strategies can help LLMs become more creative remains under-explored. In this work, we investigate whether prompting LLMs to connect disparate concepts can augment their creative outputs. Focusing on three domains – Product Design, Storytelling, and Marketing – we introduce creativity tasks designed to assess vGPT-4's ability to generate original and useful content. By challenging the models to form novel associations, we evaluate the potential of associative thinking to enhance the creative capabilities of LLMs. Our findings show that leveraging associative thinking techniques can significantly improve the originality of vGPT-4's responses.

Putting GPT-3's creativity to the (alternative uses) test

Claire Stevenson et al.

AI large language models have (co-)produced amazing written works from newspaper articles to novels and poetry. These works meet the standards of the standard definition of creativity: being original and useful, and sometimes even the additional element of surprise. But can a large language model designed to predict the next text fragment provide creative, out-of-the-box, responses that still solve the problem at hand? We put Open AI's generative natural language model, GPT-3, to the test. Can it provide creative solutions to one of the most commonly used tests in creativity research? We assessed GPT-3's creativity on Guilford's Alternative Uses Test and compared its performance to previously collected human responses on expert ratings of originality, usefulness and surprise of responses, flexibility of each set of ideas as well as an automated method to measure creativity based on the semantic distance between a response and the AUT object in question. Our results show that – on the whole – humans currently outperform GPT-3 when it comes to creative output. But, we believe it is only a matter of time before GPT-3 catches up on this particular task. We discuss what this work reveals about human and AI creativity, creativity testing and our definition of creativity.

Aug-creativity: Framework for human-centered creativity with vision language models

Dan Li et al.

Human-computer interaction

User-controlled knowledge fusion in large language models: Balancing creativity and hallucination

Chen Zhang

In modern dialogue systems, the use of Large Language Models (LLMs) has grown exponentially due to their capacity to generate diverse, relevant, and creative responses. Despite their strengths, striking a balance between the LLMs' creativity and their faithfulness to external knowledge remains a key challenge. This paper presents an innovative user-controllable mechanism that modulates the balance between an LLM's imaginative capabilities and its adherence to factual information. Our approach incorporates a numerical tag during the fine-tuning phase of the LLM's training, representing the degree of faithfulness to the reference knowledge in the generated responses. This degree is computed through an automated process that measures lexical overlap using ROUGE scores, semantic similarity using Sentence-BERT embeddings, and an LLM's self-evaluation score. During model inference, users can manipulate this numerical tag, thus controlling the degree of the LLM's reliance on external knowledge. We conduct extensive experiments across various scenarios, demonstrating the adaptability of our method and its efficacy in ensuring the quality and accuracy of the LLM's responses. The results highlight the potential of our approach to enhance the versatility of LLMs while maintaining a balance between creativity and hallucination.

Evaluating the creativity of LLMs in persian literary text generation

Armin Tourajmehr et al.

Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models' ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.

Do LLMs agree on the creativity evaluation of alternative uses?

Abdullah Al Rabeyah et al.

This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT). While LLMs are increasingly used to evaluate creative content, previous studies have primarily focused on a single model assessing responses generated by the same model or humans. This paper explores whether LLMs can impartially and accurately evaluate creativity in outputs generated by both themselves and other models. Using an oracle benchmark set of AUT responses, categorized by creativity level (common, creative, and highly creative), we experiment with four state-of-the-art LLMs evaluating these outputs. We test both scoring and ranking methods and employ two evaluation settings (comprehensive and segmented) to examine if LLMs agree on the creativity evaluation of alternative uses. Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle, indicating a high level of agreement and validating the reliability of LLMs in creativity assessment of alternative uses. Notably, models do not favour their own responses, instead they provide similar creativity assessment scores or rankings for alternative uses generated by other models. These findings suggest that LLMs exhibit impartiality and high alignment in creativity evaluation, offering promising implications for their use in automated creativity assessment.

The current state of artificial intelligence generative language models is more creative than humans on divergent thinking tasks

Kent F. Hubert et al.

Does increasing reliance on artificial intelligence boost creativity? Assessing AI-augmented creativity with large language models

Jiaoping Chen et al.

What shapes a creative machine mind? Comprehensively benchmarking creativity in foundation models

Zicong He et al.

The meteoric rise of foundation models (FMs) has expanded their capabilities far beyond conventional tasks. Creativity, long regarded as a hallmark of human intelligence and a driver of innovation, is now increasingly recognized as a critical dimension of machine intelligence in the era of generative FMs, complementing traditional measures of accuracy. However, existing evaluation frameworks for creativity remain fragmented, relying on ad hoc metrics not firmly grounded in established theories. To address this gap, we introduce C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs. C^2-Eval distinguishes between two complementary forms of creativity: convergent creativity, where tasks admit constrained solutions (e.g., code generation), and divergent creativity, where tasks are open-ended (e.g., storytelling). It evaluates both dimensions using fine-grained criteria derived from social-science theory, focusing on Usefulness, Originality, and Surprise (U-O-S). Through extensive experiments on leading proprietary and open-source models, we analyze trade-offs in their creative capabilities. Our results highlight both the strengths and challenges of current FMs in pursuing a creative machine mind, showing that C^2-Eval is an effective lens for examining the evolving landscape of creative AI.

How the ideation process shapes the creative output in innovation contests—an analysis using a large language model

Martin G. Moehrle et al.

LLM discussion: Enhancing the creativity of large language models via discussion framework and role-play

Li-Chun Lu et al.

Large language models (LLMs) have shown exceptional proficiency in natural language processing but often fall short of generating creative and original responses to open-ended questions. To enhance LLM creativity, our key insight is to emulate the human process of inducing collective creativity through engaging discussions with participants from diverse backgrounds and perspectives. To this end, we propose LLM Discussion, a three-phase discussion framework that facilitates vigorous and diverging idea exchanges and ensures convergence to creative answers. Moreover, we adopt a role-playing technique by assigning distinct roles to LLMs to combat the homogeneity of LLMs. We evaluate the efficacy of the proposed framework with the Alternative Uses Test, Similarities Test, Instances Test, and Scientific Creativity Test through both LLM evaluation and human study. The results show that our proposed framework outperforms single-LLM approaches and existing multi-LLM frameworks across various creativity metrics. The code is available at https://github.com/lawraa/LLM-Discussion.

Creative thought embeddings: a framework for instilling creativity in large language models

Qusay H. Mahmoud

Proceedings of the AAAI symposium series

Comparing Large Language Models verbal creativity to human verbal creativity

Anca Dinu and Andra Florescu

Proceedings of the 10th italian conference on computational linguistics (CLiC-it 2024)

Steering large language models to evaluate and amplify creativity

Matthew Lyle Olson et al.

Although capable of generating creative text, Large Language Models (LLMs) are poor judges of what constitutes "creativity". In this work, we show that we can leverage this knowledge of how to write creatively in order to better judge what is creative. We take a mechanistic approach that extracts differences in the internal states of an LLM when prompted to respond "boringly" or "creatively" to provide a robust measure of creativity that corresponds strongly with human judgment. We also show these internal state differences can be applied to enhance the creativity of generated text at inference time.

Creativity support in the age of large language models: An empirical study involving emerging writers

Tuhin Chakrabarty et al.

The development of large language models (LLMs) capable of following instructions and engaging in conversational interactions sparked increased interest in their utilization across various support tools. We investigate the utility of modern LLMs in assisting professional writers via an empirical user study (n=30). The design of our collaborative writing interface is grounded in the cognitive process model of writing that views writing as a goal-oriented thinking process encompassing non-linear cognitive activities: planning, translating, and reviewing. Participants are asked to submit a post-completion survey to provide feedback on the potential and pitfalls of LLMs as writing collaborators. Upon analyzing the writer-LLM interactions, we find that while writers seek LLM's help across all three types of cognitive activities, they find LLMs more helpful in translation and reviewing. Our findings from analyzing both the interactions and the survey responses highlight future research directions in creative writing assistance using LLMs.

Using large language models to evaluate alternative uses task flexibility score

Eran Hadas and Arnon Hershkovitz

Brainstorm, then select: a generative language model improves its creativity score

Douglas Summers-Stay et al.

The AAAI-23 workshop on creative AI across modalities

Fluid transformers and creative analogies: Exploring large language models’ capacity for augmenting cross-domain analogical creativity

Zijian Ding et al.

Creativity and cognition

Large language model in creative work: The role of collaboration modality and user expertise

Zenan Chen and Jason Chan

Since the launch of ChatGPT in December 2022, large language models (LLMs) have been rapidly adopted by businesses to assist users in a wide range of open-ended tasks, including creative work. Although the versatility of LLM has unlocked new ways of human-artificial intelligence collaboration, it remains uncertain how LLMs should be used to enhance business outcomes. To examine the effects of human-LLM collaboration on business outcomes, we conducted an experiment where we tasked expert and nonexpert users to write an ad copy with and without the assistance of LLMs. Here, we investigate and compare two ways of working with LLMs: (1) using LLMs as “ghostwriters,” which assume the main role of the content generation task, and (2) using LLMs as “sounding boards” to provide feedback on human-created content. We measure the quality of the ads using the number of clicks generated by the created ads on major social media platforms. Our results show that different collaboration modalities can result in very different outcomes for different user types. Using LLMs as sounding boards enhances the quality of the resultant ad copies for nonexperts. However, using LLMs as ghostwriters did not provide significant benefits and is, in fact, detrimental to expert users. We rely on textual analyses to understand the mechanisms, and we learned that using LLMs as ghostwriters produces an anchoring effect, which leads to lower-quality ads. On the other hand, using LLMs as sounding boards helped nonexperts achieve ad content with low semantic divergence to content produced by experts, thereby closing the gap between the two types of users. This paper was accepted by D. J. Wu, information systems. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2023.03014 .

Evaluating creativity with AI: Comparing GPT models and human experts in idea evaluation

Theresa S. Kränzle

Creativity support in the age of large language models: An empirical study involving professional writers

Tuhin Chakrabarty et al.

Creativity and cognition

A large-scale evaluation of the collaborative potential of human and machine creativity

Dawei Wang et al.

Homogenization effects of large language models on human creative ideation

Barrett R Anderson et al.

Creativity and cognition

Large Language Models show both individual and collective creativity comparable to humans

Luning Sun et al.

CS4: Measuring the creativity of large language models automatically by controlling the number of story-writing constraints

Anirudh Atmakuru et al.

Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at https://github.com/anirudhlakkaraju/cs4_benchmark.

Evaluating text creativity across diverse domains: a dataset and large language model evaluator

Qian Cao et al.

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability

Jennifer Haase et al.

Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs – including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek – across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.

A framework for collaborating a large language model tool in brainstorming for triggering creative thoughts

Hung-Fu Chang and Tong Li

A comparative approach to assessing linguistic creativity of large language models and humans

Anca Dinu et al.

The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.

Evaluating creativity: Can LLMs be good evaluators in creative writing tasks?

Sungeun Kim and Dongsuk Oh

Evaluating creative output with generative artificial intelligence: Comparing GPT models and human experts in idea evaluation

Theresa Kranzle and Katelyn Sharratt

ABSTRACT Traditional techniques for evaluating creative outcomes are typically based on evaluations made by human experts. These methods suffer from challenges such as subjectivity, biases, limited availability, ‘crowding’, and high transaction costs. We propose that large language models (LLMs) can be used to overcome these shortcomings. However, there is a dearth of research comparing the performance of LLMs to traditional expert evaluations for evaluating creative outcomes such as ideas. Our study compares the alignment of expert evaluations with evaluations from the LLM GPT‐4. Our results reveal that to achieve moderate evaluation alignment with experts, LLMs require using a base framework and a spectrum‐based few‐shot prompt. We offer six theoretical contributions, shifting the focus from whether LLMs can evaluate to how specific design choices shape their alignment with human judgement. These insights are situated within broader frameworks from cognitive science, creativity theory, and machine learning. Furthermore, we outline six propositions for organizations interested in LLM‐supported evaluation methods. Key recommendations include utilizing base frameworks for large‐scale idea screening, establishing a database of evaluated ideas to optimize few‐shot performance, and leveraging AI–human collaboration for internal and external idea sourcing. Additionally, we highlight the need for privacy considerations when using third‐party LLMs for proprietary idea evaluations. This research contributes to innovation management literature by exploring methods for integrating LLM into creative evaluation processes to enhance scalability and efficiency while retaining evaluation quality.

Testing language creativity of large language models and humans

Anca Dinu and Andra-Maria Florescu

Proceedings of the 5th international conference on natural language processing for digital humanities

Deep associations, high creativity: a simple yet effective metric for evaluating large language models

Ziliang Qiu and Renfen Hu

The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $\rho = 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.

Automating chemosensory creativity assessment with large language models

Qian Janice Wang and Robert Pellegrino

How do humans and language models reason about creativity? A comparative analysis

Antonio Laverghetta Jr et al.

Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is "far" from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially – to upwards of $0.99$ – suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

Automated scoring of creative problem solving with large language models: A comparison of originality and quality ratings.

Simone A. Luchini et al.

Rethinking creativity evaluation: a critical analysis of existing creativity evaluations

Li-Chun Lu et al.

We systematically examine, analyze, and compare representative creativity measures–creativity index, perplexity, syntactic templates, and LLM-as-a-Judge–across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.

Evaluating creative short story generation in humans and large language models

Mete Ismayilzada et al.

Story-writing is a fundamental aspect of human imagination, relying heavily on creativity to produce narratives that are novel, effective, and surprising. While large language models (LLMs) have demonstrated the ability to generate high-quality stories, their creative story-writing capabilities remain under-explored. In this work, we conduct a systematic analysis of creativity in short story generation across 60 LLMs and 60 people using a five-sentence cue-word-based creative story-writing task. We use measures to automatically evaluate model- and human-generated stories across several dimensions of creativity, including novelty, surprise, diversity, and linguistic complexity. We also collect creativity ratings and Turing Test classifications from non-expert and expert human raters and LLMs. Automated metrics show that LLMs generate stylistically complex stories, but tend to fall short in terms of novelty, surprise and diversity when compared to average human writers. Expert ratings generally coincide with automated metrics. However, LLMs and non-experts rate LLM stories to be more creative than human-generated stories. We discuss why and how these differences in ratings occur, and their implications for both human and artificial creativity.

A survey on large language model hallucination via a creativity perspective

Xuhui Jiang et al.

Hallucinations in large language models (LLMs) are always seen as limitations. However, could they also be a source of creativity? This survey explores this possibility, suggesting that hallucinations may contribute to LLM application by fostering creativity. This survey begins with a review of the taxonomy of hallucinations and their negative impact on LLM reliability in critical applications. Then, through historical examples and recent relevant theories, the survey explores the potential creative benefits of hallucinations in LLMs. To elucidate the value and evaluation criteria of this connection, we delve into the definitions and assessment methods of creativity. Following the framework of divergent and convergent thinking phases, the survey systematically reviews the literature on transforming and harnessing hallucinations for creativity in LLMs. Finally, the survey discusses future research directions, emphasizing the need to further explore and refine the application of hallucinations in creative processes within LLMs.

On the creativity of large language models

Giorgio Franceschelli and Mirco Musolesi

Abstract Large language models (LLMs) are revolutionizing several areas of Artificial Intelligence. One of the most remarkable applications is creative writing, e.g., poetry or storytelling: the generated outputs are often of astonishing quality. However, a natural question arises: can LLMs be really considered creative? In this article, we first analyze the development of LLMs under the lens of creativity theories, investigating the key open questions and challenges. In particular, we focus our discussion on the dimensions of value, novelty, and surprise as proposed by Margaret Boden in her work. Then, we consider different classic perspectives, namely product, process, press, and person. We discuss a set of “easy” and “hard” problems in machine creativity, presenting them in relation to LLMs. Finally, we examine the societal impact of these technologies with a particular focus on the creative industries, analyzing the opportunities offered, the challenges arising from them, and the potential associated risks, from both legal and ethical points of view.

Automated creativity evaluation for large language models: a reference-based approach

Ruizhe Li et al.

Creative writing is a key capability of Large Language Models (LLMs), with potential applications in literature, storytelling, and various creative domains. However, evaluating the creativity of machine-generated texts remains a significant challenge, as existing methods either rely on costly manual annotations or fail to align closely with human assessments. In this paper, we propose an effective automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts across various tests. Experimental results demonstrate that our method significantly improves the alignment between LLM evaluations and human assessments, achieving a pairwise accuracy of 0.75 (+15\%).

A causality-aware paradigm for evaluating creativity of multimodal large language models

Zhongzhan Huang et al.

Noveltybench: Evaluating creativity and diversity in language models

Yiming Zhang et al.

Second conference on language modeling

Automatic scoring of metaphor creativity with large language models

Paul V. DiStefano et al.

Divergent creativity in humans and large language models

Antoine Bellemare-Pepin et al.

The recent surge of Large Language Models (LLMs) has led to claims that they are approaching a level of creativity akin to human capabilities. This idea has sparked a blend of excitement and apprehension. However, a critical piece that has been missing in this discourse is a systematic evaluation of LLMs' semantic diversity, particularly in comparison to human divergent thinking. To bridge this gap, we leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a substantial dataset of 100,000 humans. We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities, though they fall short of the typical performance of highly creative humans. Notably, even the top performing LLMs are still largely surpassed by highly creative individuals, underscoring a ceiling that current LLMs still fail to surpass. Our human-machine benchmarking framework addresses the polemic surrounding the imminent replacement of human creative labour by AI, disentangling the quality of the respective creative linguistic outputs using established objective measures. While prompting deeper exploration of the distinctive elements of human inventive thought compared to those of AI systems, we lay out a series of techniques to improve their outputs with respect to semantic diversity, such as prompt design and hyper-parameter tuning.

Is temperature the creativity parameter of large language models?

Max Peeperkorn et al.

Large language models (LLMs) are applied to all sorts of creative tasks, and their outputs vary from beautiful, to peculiar, to pastiche, into plain plagiarism. The temperature parameter of an LLM regulates the amount of randomness, leading to more diverse outputs; therefore, it is often claimed to be the creativity parameter. Here, we investigate this claim using a narrative generation task with a predetermined fixed context, model and prompt. Specifically, we present an empirical analysis of the LLM output for different temperature values using four necessary conditions for creativity in narrative generation: novelty, typicality, cohesion, and coherence. We find that temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality. However, the influence of temperature on creativity is far more nuanced and weak than suggested by the "creativity parameter" claim; overall results suggest that the LLM generates slightly more novel outputs as temperatures get higher. Finally, we discuss ideas to allow more controlled LLM creativity, rather than relying on chance via changing the temperature parameter.

On characterizations of large language models and creativity evaluation

Max Peeperkorn et al.

Evaluating large language model creativity from a literary perspective

Murray Shanahan and Catherine Clarke

This paper assesses the potential for large language models (LLMs) to serve as assistive tools in the creative writing process, by means of a single, in-depth case study. In the course of the study, we develop interactive and multi-voice prompting strategies that interleave background descriptions (scene setting, plot elements), instructions that guide composition, samples of text in the target style, and critical discussion of the given samples. We qualitatively evaluate the results from a literary critical perspective, as well as from the standpoint of computational creativity (a sub-field of artificial intelligence). Our findings lend support to the view that the sophistication of the results that can be achieved with an LLM mirrors the sophistication of the prompting.

The language of creativity: Evidence from humans and large language models

William Orwig et al.

ABSTRACT Recent developments in computerized scoring via semantic distance have provided automated assessments of verbal creativity. Here, we extend past work, applying computational linguistic approaches to characterize salient features of creative text. We hypothesize that, in addition to semantic diversity, the degree to which a story includes perceptual details, thus transporting the reader to another time and place, would be predictive of creativity. Additionally, we explore the use of generative language models to supplement human data collection and examine the extent to which machine‐generated stories can mimic human creativity. We collect 600 short stories from human participants and GPT‐3, subsequently randomized and assessed on their creative quality. Results indicate that the presence of perceptual details, in conjunction with semantic diversity, is highly predictive of creativity. These results were replicated in an independent sample of stories ( n  = 120) generated by GPT‐4. We do not observe a significant difference between human and AI‐generated stories in terms of creativity ratings, and we also observe positive correlations between human and AI assessments of creativity. Implications and future directions are discussed.

Art or artifice? Large language models and the false promise of creativity

Tuhin Chakrabarty et al.

Proceedings of the CHI conference on human factors in computing systems

Assessing and understanding creativity in large language models

Yunpu Zhao et al.

Abstract In the field of natural language processing, the rapid development of large language model (LLM) has attracted increasing attention. LLMs have shown a high level of creativity in various tasks, but the methods for assessing such creativity are inadequate. Assessment of LLM creativity needs to consider differences from humans, requiring multiple dimensional measurement while balancing accuracy and efficiency. This paper aims to establish an efficient framework for assessing the level of creativity in LLMs. By adapting the modified Torrance tests of creative thinking, the research evaluates the creative performance of various LLMs across 7 tasks, emphasizing 4 criteria including fluency, flexibility, originality, and elaboration. In this context, we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method. In addition, this study presents a novel analysis of LLMs’ responses to diverse prompts and role-play situations. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration. In addition, the use of prompts and role-play settings of the model significantly influence creativity. Additionally, the experimental results also indicate that collaboration among multiple LLMs can enhance originality. Notably, our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity. The findings underscore the significant impact of LLM design on creativity and bridge artificial intelligence and human creativity, offering insights into LLMs’ creativity and potential applications.