Layer-Aware Video Composition via Split-then-Merge

Method

Decomposer

Composer

The StM Decomposer integrates off-the-shelf models to split unlabeled videos. First, motion segmentation generates a foreground mask, which is used to extract the foreground layer. An inpainting model then fills the "holes" in the masked background video. During training, the Composer is trained to reconstruct a ground-truth video latent from foreground, background, and text inputs. A transformation-aware training pipeline and identity-preservation loss ensure the model avoids "copy-paste" shortcuts and learns genuine affordance.

StM-50K Test Examples

Input Foreground

Input Background

StM Result

"A pig is walking in the forest"

Additional Results (Outdoor)

Input Foreground

Input Background

StM Result

"A car is turning at the crossroads"

Additional Results (Indoor)

Input Foreground

Input Background

StM Result

"A pig is wandering around in the balcony"

Logically Impossible Composition

Input Foreground

Input Background

StM Result

"A boat is floating on a road"

Multi Object Composition

Input Foreground

Input Background

StM Result

"A pig is walking indoors"

SOTA Comparison

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A goat is running on a snowy road"

Input Foreground

Input Background

StM (Ours)

SkyReels [1]

AnyV2V [2]

Qwen+I2V [3]

PBE+I2V [4]

Copy-Paste

"A boat is moving on the water"

References

[1] Zhengcong Fei, Debang Li, Di Qiu, Jiahua Wang, Yikun Dou, Rui Wang, Jingtao Xu, Mingyuan Fan, Guibin Chen, Yang Li, et al. SkyReels-A2: Compose Anything in Video Diffusion Transformers. arXiv preprint arXiv:2504.02436, 2025.

[2] Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks. Transactions on Machine Learning Research, 2024.

[3] Chenfei Wu et al. Qwen-image technical report. 2025.

[4] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by Example: Exemplar-based Image Editing with Diffusion Models. CVPR 2023.

@article{kara2025stm, title={Layer-Aware Video Composition via Split-then-Merge}, author={Kara, Ozgur and Chen, Yujia and Yang, Ming-Hsuan and Rehg, James M. and Chu, Wen-Sheng and Tran, Du}, journal={arXiv preprint}, year={2025} }

Layer-Aware Video Composition
via Split-then-Merge

TL;DR

Method

Decomposer

Composer

StM-50K Test Examples

Input Foreground

Input Background

StM Result

"A pig is walking in the forest"

Additional Results (Outdoor)

Input Foreground

Input Background

StM Result

"A car is turning at the crossroads"

Additional Results (Indoor)

Input Foreground

Input Background

StM Result

"A pig is wandering around in the balcony"

Logically Impossible Composition

Input Foreground

Input Background

StM Result

"A boat is floating on a road"

Multi Object Composition

Input Foreground

Input Background

StM Result

"A pig is walking indoors"

SOTA Comparison

"A goat is running on a snowy road"

"A boat is moving on the water"

References

BibTeX