A Review on Background, Technology, Comparison, and Future Tendency of Video Generation

Zhiyu Han

doi:10.54097/18313836

Authors

Zhiyu Han

DOI:

https://doi.org/10.54097/18313836

Keywords:

Video Generation; Generative Adversarial Model; Variational Auto-Encoders; Transformer Model; Diffusion Model.

Abstract

Video generation techniques incorporate recent advances in deep learning and generative modeling, and are widely used in film and television, education, advertising, virtual reality, and other fields. The background lies in the growing need to generate high-resolution, dynamically consistent, and semantically accurate videos to meet diverse scene requirements. Existing techniques, including Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs), Transformer and Diffusion Models, have achieved significant improvements in video quality and generation efficiency. This paper systematically reviews the development history of video generation technology, from model principles, application scenarios to technical advantages and shortcomings, and analyzes the performance of the current mainstream models in detail. Combined with the experimental results, this paper summarizes the future trends of multimodal fusion, resolution improvement, generation efficiency optimization and 3D video generation. In the future, video generation technology will focus on deep alignment of multimodal fusion, real-time high-resolution generation, dynamic scene optimization and 3D modeling, which will promote its wide application in virtual reality, scientific research, and film and television production, and open up new paths for interactive content generation.

Downloads

Download data is not yet available.

References

[1] Liu Y, Zhang K, Li Y, Yan Z, Gao C, Chen R, Yuan Z, Huang Y, Sun H, Gao J, He L, & Sun L. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv:2402.17177, 2024.

[2] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, & Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems, 2014, 27, 2672-2680.

[3] Kingma D P, & Welling M. Auto-Encoding Variational Bayes. arXiv:1312.6114, 2013.

[4] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, & Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30, 5998-6008.

[5] Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y, Zhang W, Cui B, & Yang M-H. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv:2209.00796, 2022.

[6] Tulyakov S, Liu M-Y, Yang X, & Kautz J. MoCoGAN: Decomposing Motion and Content for Video Generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. pp. 1526-1535.

[7] Ho J, Chan W, Saharia C, Whang J, Gao R, Gritsenko A, Kingma D P, Poole B, Norouzi M, Fleet D J, & Salimans T. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022.

[8] Skorokhodov I, Tulyakov S, & Elhoseiny M. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2. arXiv:2112.14683, 2021.

[9] Yan W, Zhang Y, Abbeel P, & Srinivas A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv:2104.10157, 2021.

[10] Hong W, Ding M, Zheng W, Liu X, & Tang J. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. arXiv:2205.15868, 2022.

[11] Rakhimov R, Volkhonskiy D, Artemov A, Zorin D, & Burnaev E. Latent Video Transformer. arXiv:2006.10704, 2020.

[12] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, & Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020.

[13] Ho J, Salimans T, Gritsenko A, Chan W, Norouzi M, & Fleet D J. Video Diffusion Models. arXiv:2204.03458, 2022.

[14] Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, & Zisserman A. The Kinetics Human Action Video Dataset. arXiv:1705.06950, 2017.

[15] Xu J, Mei T, Yao T, & Rui Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27-30; Las Vegas, NV, USA. IEEE.

[16] Heusel M, Ramsauer H, Unterthiner T, Nessler B, & Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv preprint arXiv:1706.08500, 2017.

[17] Lai W-S, Huang J-B, Wang O, Shechtman E, Yumer E, & Yang M-H. Learning Blind Video Temporal Consistency. arXiv:1808.00449, 2018.

[18] Hessel J, Holtzman A, Forbes M, Bras R L, & Choi Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning.arXiv:2104.08718, 2021.