Layer-Aware Video Composition
via Split-then-Merge

1University of Illinois Urbana-Champaign    2Google
Work done during an internship at Google    Joint last authors

TL;DR

Split-then-Merge (StM) is a new video composition framework that improves control and handles data scarcity by splitting unlabeled videos into foreground and background layers, and then self-composing them.

Method

Decomposer

Decomposer Architecture

Composer

Composer Architecture

The StM Decomposer integrates off-the-shelf models to split unlabeled videos. First, motion segmentation generates a foreground mask, which is used to extract the foreground layer. An inpainting model then fills the "holes" in the masked background video. During training, the Composer is trained to reconstruct a ground-truth video latent from foreground, background, and text inputs. A transformation-aware training pipeline and identity-preservation loss ensure the model avoids "copy-paste" shortcuts and learns genuine affordance.

StM-50K Test Examples

Additional Results (Outdoor)

Additional Results (Indoor)

Logically Impossible Composition

Multi Object Composition

SOTA Comparison

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A goat is running on a snowy road"

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A boat is moving on the water"

References

[1] Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose Anything in Video Diffusion Transformers. arXiv preprint arXiv:2504.02436, 2025.

[2] Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks. Transactions on Machine Learning Research, 2024.

[3] Chenfei Wu et al. Qwen-image technical report. 2025.

[4] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Example: Exemplar-based Image Editing with Diffusion Models. CVPR 2023.

BibTeX

@article{kara2025stm,
  title={Layer-Aware Video Composition via Split-then-Merge},
  author={Kara, Ozgur and Chen, Yujia and Yang, Ming-Hsuan and Rehg, James M. and Chu, Wen-Sheng and Tran, Du},
  journal={arXiv preprint},
  year={2025}
}