A Review on Background, Technology, Comparison, and Future Tendency of Video Generation
DOI:
https://doi.org/10.54097/18313836Keywords:
Video Generation; Generative Adversarial Model; Variational Auto-Encoders; Transformer Model; Diffusion Model.Abstract
Video generation techniques incorporate recent advances in deep learning and generative modeling, and are widely used in film and television, education, advertising, virtual reality, and other fields. The background lies in the growing need to generate high-resolution, dynamically consistent, and semantically accurate videos to meet diverse scene requirements. Existing techniques, including Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs), Transformer and Diffusion Models, have achieved significant improvements in video quality and generation efficiency. This paper systematically reviews the development history of video generation technology, from model principles, application scenarios to technical advantages and shortcomings, and analyzes the performance of the current mainstream models in detail. Combined with the experimental results, this paper summarizes the future trends of multimodal fusion, resolution improvement, generation efficiency optimization and 3D video generation. In the future, video generation technology will focus on deep alignment of multimodal fusion, real-time high-resolution generation, dynamic scene optimization and 3D modeling, which will promote its wide application in virtual reality, scientific research, and film and television production, and open up new paths for interactive content generation.
Downloads
References
[1] Liu Y, Zhang K, Li Y, Yan Z, Gao C, Chen R, Yuan Z, Huang Y, Sun H, Gao J, He L, & Sun L. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv:2402.17177, 2024.
[2] Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, & Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems, 2014, 27, 2672-2680.
[3] Kingma D P, & Welling M. Auto-Encoding Variational Bayes. arXiv:1312.6114, 2013.
[4] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, & Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems, 2017, 30, 5998-6008.
[5] Yang L, Zhang Z, Song Y, Hong S, Xu R, Zhao Y, Zhang W, Cui B, & Yang M-H. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv:2209.00796, 2022.
[6] Tulyakov S, Liu M-Y, Yang X, & Kautz J. MoCoGAN: Decomposing Motion and Content for Video Generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2018. pp. 1526-1535.
[7] Ho J, Chan W, Saharia C, Whang J, Gao R, Gritsenko A, Kingma D P, Poole B, Norouzi M, Fleet D J, & Salimans T. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022.
[8] Skorokhodov I, Tulyakov S, & Elhoseiny M. StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2. arXiv:2112.14683, 2021.
[9] Yan W, Zhang Y, Abbeel P, & Srinivas A. VideoGPT: Video Generation using VQ-VAE and Transformers. arXiv:2104.10157, 2021.
[10] Hong W, Ding M, Zheng W, Liu X, & Tang J. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. arXiv:2205.15868, 2022.
[11] Rakhimov R, Volkhonskiy D, Artemov A, Zorin D, & Burnaev E. Latent Video Transformer. arXiv:2006.10704, 2020.
[12] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, & Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020.
[13] Ho J, Salimans T, Gritsenko A, Chan W, Norouzi M, & Fleet D J. Video Diffusion Models. arXiv:2204.03458, 2022.
[14] Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, & Zisserman A. The Kinetics Human Action Video Dataset. arXiv:1705.06950, 2017.
[15] Xu J, Mei T, Yao T, & Rui Y. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27-30; Las Vegas, NV, USA. IEEE.
[16] Heusel M, Ramsauer H, Unterthiner T, Nessler B, & Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv preprint arXiv:1706.08500, 2017.
[17] Lai W-S, Huang J-B, Wang O, Shechtman E, Yumer E, & Yang M-H. Learning Blind Video Temporal Consistency. arXiv:1808.00449, 2018.
[18] Hessel J, Holtzman A, Forbes M, Bras R L, & Choi Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning.arXiv:2104.08718, 2021.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Highlights in Science, Engineering and Technology

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.







