Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models

Wenqing Wang

2025FINDINGS

Abstract

Current exploration on creative generation focuses mainly on short stories, poetry, and scripts. With the expansion of Large Language Models (LLMs) context windows, “novel” avenues emerge. This study aims to extend the boundaries of Natural Language Generation (NLG) evaluation by exploring LLMs’ capabilities in more challenging long-form fiction. We propose a new multi-level evaluation framework that incorporates ten metrics across the Macro, Meso, and Micro levels. An annotated fiction dataset, sourced from human authors, LLMs, and human-AI collaborations in both English and Chinese is then constructed. Human evaluation reveals notable disparities between LLM-generated and human-authored fictions, particularly the “high-starting, low-ending” pattern in LLM outputs. We further probe ten high-performing LLMs through different prompt templates, achieving moderate correlations by strategically utilizing diverse LLMs tailored to different levels, as an initial step towards better automatic fiction evaluation. Finally, we offer a fine-grained analysis of LLMs capabilities through six issues, providing promising insights for future advancements.

Relevance Assessment

Research Gap

Notes

Notes are automatically saved as you type

Tags

Creativity Evaluation Methods › Creativity-Specific EvaluationCreativity Evaluation Methods › Human EvaluationTextual Domain › Literary TextsResearch Focus › Prompt EngineeringRelationship to Creativity › ExplicitProprietary Models › OpenAI ChatGPTModel Scale › Large (>32B)Human-In-The-Loop › Autonomous GenerationCreativity Frameworks › Linguistic CreativityProprietary Models › Anthropic ClaudeProprietary Models › Google GeminiResearch Focus › BenchmarkResearch Focus › Controllable Generation

Search Queries

("LLMs" OR large language models) AND creative - ("LLMs" OR large language models) AND creative