MagicScroll: Enhancing Immersive Storytelling with
Controllable Scroll Image Generation

Bingyuan Wang1 Hengyu Meng1 Rui Cao2 Zeyu Cai1 Lanjiong Li1 Yue Ma3 Qifeng Chen3 Zeyu Wang1,3#
1 HKUST(GZ)       2 Xiamen University Malaysia       3 HKUST      
(# Corresponding Author)

Example results generated by MagicScroll. Our framework is designed for generating coherent, controllable, and engaging nontypical aspect-ratio images from story texts.
We support multi-layered, refined controls over style, content, and layout, with multiple conditions including predicted masks, reference images, and style concepts.



Abstract

Scroll images are a unique medium commonly used in virtual reality (VR) providing an immersive visual storytelling experience. Despite rapid advances in diffusion-based image generation, it remains an open research question to generate scroll images suitable for immersive, coherent, and controllable storytelling in VR. This paper proposes a multi-layered, diffusion-based scroll image generation framework with a novel semantic-aware denoising process. We incorporate layout prediction and style control modules to generate coherent scroll images of any aspect ratio. Based on the scroll image generation framework, we use different multi-window strategies to render diverse visual forms such as chains, rings, and forks for VR storytelling. Quantitative and qualitative evaluations demonstrate that our techniques can significantly enhance text-image consistency and visual coherence in scroll image generation, as well as the level of immersion and engagement of VR storytelling. We will release our source code to facilitate better collaborations on immersive storytelling between AI researchers and creative practitioners.



Method

A framework to generate nontypical aspect-ratio images from storytelling text with style and layout controls.

Your Image

Results

Qualitative Comparison of Our Method with Other Baselines

The results demonstrate the high versatility and core advantages of MagicScroll. (a) Style mimicry in specific historical contexts. (b) Semantic layout planning and control. (c) Content richness and diversity.


Your Image

Generation in Different Aspect Ratios

By providing control over style, concept, and layout at all foreground, midground, and background levels, our framework can meet the needs of visual storytelling content generation in various scenarios.


Your Image

Your Image
Your Image

From left to right: “In a serene garden, lakes and waterfalls flow gently as two girls, dressed in long skirts, run through it. The flowing water travels through dense forests, reaching vast meadows covered with green trees. Distant mountain peaks come into view, and amidst the ever-changing clouds, the mountains and waters harmonize. In this fantastical world, we gradually witness a series of towering castles standing in the distant lakeside, narrating an ancient story under the blue sky.”



More Results Generated by MagicScroll


Videos Synthsized from MagicScroll Outputs

Our results can be used with an image-to-video method (e.g., Runway) to achieve impressive dynamics, better fit for industrial demands and creative needs.