CineCtrl: Generative Photographic Control for Scene-Consistent Video Cinematic Editing

Huiqiang Sun^1,2, Liao Shen^1,2, Zhan Peng¹, Kun Wang³, Size Wu², Yuhang Zang⁴, Tianqi Liu^1,2, Zihao Huang^1,2,
Xingyu Zeng³, Zhiguo Cao¹, Wei Li^2†, Chen Change Loy²,
¹HUST ²S-Lab, NTU ³SenseTime ⁴AI Lab

Paper

Code

TL;DR

CineCtrl is the first video cinematic editing framework that provides fine control over professional camera parameters. We have five photographic effect parameters (Bokeh blur parameter, Refocused disparity, Focal length, Shutter speed, Color temperature) and one camera poses control parameter.

Input Video

Output Result

Case 1: para = [0.8, 0.5, 0, 0, 0.5] cam:Up

Input Video

Output Result

Case 2: para = [0.7, 0.1, 0.4, 0.8, 0] cam:Up

Input Video

Output Result

Case 3: para = [0.5, 0.8, 0.1, 0.7, 0] cam:Right

Input Video

Output Result

Case 4: para = [0.9, 0.8, 0, 0.9, 0] cam:Left

Input Video

Output Result

Case 5: para = [0.7, 0.1, 0.0, 0, 0.7] cam:Up

Input Video

Output Result

Case 6: para = [0.9, 0.8, 0, 0.8, 0] cam:Down

Input Video

Output Result

Case 7: para = [0.7, 0.8, 0, 0.7, 0] cam:Right

Abstract

Cinematic storytelling is profoundly shaped by the artful manipulation of photographic elements such as depth of field and exposure. These effects are crucial in conveying mood and creating aesthetic appeal. However, controlling these effects in generative video models remains highly challenging, as most existing methods are restricted to camera motion control. In this paper, we propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters (e.g., bokeh, shutter speed). We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs, allowing fine-grained, independent control without compromising scene consistency. To overcome the shortage of training data, we develop a comprehensive data generation strategy that leverages simulated photographic effects with a dedicated real-world collection pipeline, enabling the construction of a large-scale dataset for robust model training. Extensive experiments demonstrate that our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.

Method

Overall framework of CineCtrl, which is built upon the Wan2.1 T2V framework, and extended to a V2V model. To enable camera control, we inject both camera trajectory and photographic parameter signals into the DiT block. Through our proposed Camera-Decoupled Cross-Attention mechanism, we disentangle these two signals to achieve accurate and independent control.

Dataset

We generate training pairs by applying our proposed photographic effect simulator to both a synthetic dataset and a high-quality real-world dataset, which we curated from web and movie sources through a shot detection and filtering pipeline.