LOVECon: text-driven training-free long video editing with ControlNet

Liao Z Y, Xie Q S, Deng Z J

Sci China Inf Sci, 2025, 68(10): 200112

In this work, we aim to bridge the gap, establishing a simple and effective baseline for text-driven training-free LOng Video Editing with ControlNet, dubbed as LOVECon. Technically, LOVECon follows the basic video editing pipeline based on Stable Diffusion and ControlNet, with an additional step of splitting long videos into consecutive windows to accommodate limited computational memory. On top of these, we also introduce a novel crosswindow attention mechanism to maintain coherence in style and subtleties across windows. To ensure the structural fidelity to the original source video, we enrich the latent states of edited frames with information extracted from the source video through DDIM inversion. Additionally, LOVECon incorporates a video interpolation model, which polishes the latent states of the edited frames in the late stages of generation, to alleviate frame flickering. These techniques contribute to smoother transitions in long videos and significantly mitigate visual artifacts.

PDF Supplementary