MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling
via Multi-Layered Semantic-Aware Denoising

Bingyuan Wang1 Hengyu Meng3 Zeyu Cai1 Lanjiong Li1 Yue Ma2 Qifeng Chen2 Zeyu Wang1,2#
1 HKUST(GZ)       2 HKUST       3 South China University of Technology      
(# Corresponding Author)

Example results generated by MagicScroll. Our framework is designed for generating coherent, controllable, and engaging nontypical aspect-ratio images from story texts.
We support multi-layered, refined controls over style, content, and layout, with multiple conditions including predicted masks, reference images, and style concepts.


Visual storytelling often uses nontypical aspect-ratio images like scroll paintings, comic strips, and panoramas to create an expressive and compelling narrative. While generative AI has achieved great success and shown the potential to reshape the creative industry, it remains a challenge to generate coherent and engaging content with arbitrary size and controllable style, concept, and layout, all of which are essential for visual storytelling. To overcome the shortcomings of previous methods including repetitive content, style inconsistency, and lack of controllability, we propose MagicScroll, a multi-layered, progressive diffusion-based image generation framework with a novel semantic-aware denoising process. The model enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions. We also establish the first benchmark for nontypical aspect-ratio image generation for visual storytelling including mediums like paintings, comics, and cinematic panoramas, with customized metrics for systematic evaluation. Through comparative and ablation studies, MagicScroll showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience. We plan to release the code and benchmark in the hope of a better collaboration between AI researchers and creative practitioners involving visual storytelling.


A framework to generate nontypical aspect-ratio images from storytelling text with style and layout controls.

Your Image


Qualitative Comparison of Our Method with Other Baselines

The results demonstrate the high versatility and core advantages of MagicScroll. (a) Style mimicry in specific historical contexts. (b) Semantic layout planning and control. (c) Content richness and diversity.

Your Image

Generation in Different Aspect Ratios

By providing control over style, concept, and layout at all foreground, midground, and background levels, our framework can meet the needs of visual storytelling content generation in various scenarios.

Your Image

Your Image
Your Image

From left to right: “In a serene garden, lakes and waterfalls flow gently as two girls, dressed in long skirts, run through it. The flowing water travels through dense forests, reaching vast meadows covered with green trees. Distant mountain peaks come into view, and amidst the ever-changing clouds, the mountains and waters harmonize. In this fantastical world, we gradually witness a series of towering castles standing in the distant lakeside, narrating an ancient story under the blue sky.”

More Results Generated by MagicScroll

Videos Synthsized from MagicScroll Outputs

Our results can be used with an image-to-video method (e.g., Runway) to achieve impressive dynamics, better fit for industrial demands and creative needs.