Back to Papers

Summary of the Visually Grounded Story Generation Challenge

Xudong Hong

2024INLG

Abstract

Recent advancements in vision-and-language models have opened new possibilities for natural language generation, particularly in generating creative stories from visual input. We thus host an open-sourced shared task, Visually Grounded Story Generation (VGSG), to explore whether these models can create coherent, diverse, and visually grounded narratives. This task challenges participants to generate coherent stories based on sequences of images, where characters and events must be grounded in the images provided. The task is structured into two tracks: the Closed track with constraints on fixed visual features and the Open track which allows all kinds of models. We propose the first two-stage model using GPT-4o as the baseline for the Open track that first generates descriptions for the images and then creates a story based on those descriptions. Human and automatic evaluations indicate that: 1) Retrieval augmentation helps generate more human-like stories, and 2) Largescale pre-trained LLM improves story quality by a large margin; 3) Traditional automatic metrics can not capture the overall quality.

Relevance Assessment

Research Gap

Notes

Notes are automatically saved as you type

Tags

Search Queries

Paper ID: 85d5b812-9def-4b0c-8fbf-21f6c2049e73Added: 9/21/2025