A text-to-video model is a machine learning model that uses a natural language description as input to produce a video relevant to the input text.[1] Advancements during the 2020s in the generation of high-quality, text-conditioned videos have largely been driven by the development of video diffusion models.[2]
^Artificial Intelligence Index Report 2023(PDF) (Report). Stanford Institute for Human-Centered Artificial Intelligence. p. 98. Multiple high quality text-to-video models, AI systems that can generate video clips from prompted text, were released in 2022.