Diffusion model

In machine learning, diffusion models, also known as diffusion probabilistic models or score-based generative models, are a class of latent variable generative models. A diffusion model consists of three major components: the forward process, the reverse process, and the sampling procedure.^[1] The goal of diffusion models is to learn a diffusion process for a given dataset, such that the process can generate new elements that are distributed similarly as the original dataset. A diffusion model models data as generated by a diffusion process, whereby a new datum performs a random walk with drift through the space of all possible data.^[2] A trained diffusion model can be sampled in many ways, with different efficiency and quality.

There are various equivalent formalisms, including Markov chains, denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.^[3] They are typically trained using variational inference.^[4] The model responsible for denoising is typically called its "backbone". The backbone may be of any kind, but they are typically U-nets or transformers.

As of 2024^[update], diffusion models are mainly used for computer vision tasks, including image denoising, inpainting, super-resolution, image generation, and video generation. These typically involves training a neural network to sequentially denoise images blurred with Gaussian noise.^[2]^[5] The model is trained to reverse the process of adding noise to an image. After training to convergence, it can be used for image generation by starting with an image composed of random noise, and applying the network iteratively to denoise the image.

Diffusion-based image generators have seen widespread commercial interest, such as Stable Diffusion and DALL-E. These models typically combine diffusion models with other models, such as text-encoders and cross-attention modules to allow text-conditioned generation.^[6]

Other than computer vision, diffusion models have also found applications in natural language processing^[7] such as text generation^[8]^[9] and summarization,^[10] sound generation,^[11] and reinforcement learning.^[12]^[13]

^ Chang, Ziyi; Koulieris, George Alex; Shum, Hubert P. H. (2023). "On the Design Fundamentals of Diffusion Models: A Survey". arXiv:2306.04542 [cs.LG].
^ ^a ^b Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
^ Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2023). "Diffusion Models in Vision: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (9): 10850–10869. arXiv:2209.04747. doi:10.1109/TPAMI.2023.3261988. PMID 37030794. S2CID 252199918.
^ Cite error: The named reference ho was invoked but never defined (see the help page).
^ Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].
^ Cite error: The named reference dalle2 was invoked but never defined (see the help page).
^ Li, Yifan; Zhou, Kun; Zhao, Wayne Xin; Wen, Ji-Rong (August 2023). "Diffusion Models for Non-autoregressive Text Generation: A Survey". Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 6692–6701. arXiv:2303.06574. doi:10.24963/ijcai.2023/750. ISBN 978-1-956792-03-4.
^ Han, Xiaochuang; Kumar, Sachin; Tsvetkov, Yulia (2023). "SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics: 11575–11596. arXiv:2210.17432. doi:10.18653/v1/2023.acl-long.647.
^ Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 9040–9057. arXiv:2310.15296. doi:10.18653/v1/2023.findings-emnlp.606.
^ Zhang, Haopeng; Liu, Xiao; Zhang, Jiawei (2023). "DiffuSum: Generation Enhanced Extractive Summarization with Diffusion". Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 13089–13100. arXiv:2305.01735. doi:10.18653/v1/2023.findings-acl.828.
^ Yang, Dongchao; Yu, Jianwei; Wang, Helin; Wang, Wen; Weng, Chao; Zou, Yuexian; Yu, Dong (2023). "Diffsound: Discrete Diffusion Model for Text-to-Sound Generation". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 31: 1720–1733. arXiv:2207.09983. doi:10.1109/taslp.2023.3268730. ISSN 2329-9290.
^ Janner, Michael; Du, Yilun; Tenenbaum, Joshua B.; Levine, Sergey (2022-12-20). "Planning with Diffusion for Flexible Behavior Synthesis". arXiv:2205.09991 [cs.LG].
^ Chi, Cheng; Xu, Zhenjia; Feng, Siyuan; Cousineau, Eric; Du, Yilun; Burchfiel, Benjamin; Tedrake, Russ; Song, Shuran (2024-03-14). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion". arXiv:2303.04137 [cs.RO].

[chang23design-1] Chang, Ziyi; Koulieris, George Alex; Shum, Hubert P. H. (2023). "On the Design Fundamentals of Diffusion Models: A Survey". arXiv:2306.04542 [cs.LG].

[song-2] Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].

[3] Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2023). "Diffusion Models in Vision: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (9): 10850–10869. arXiv:2209.04747. doi:10.1109/TPAMI.2023.3261988. PMID 37030794. S2CID 252199918.

[ho-4] Cite error: The named reference ho was invoked but never defined (see the help page).

[gu-5] Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].

[dalle2-6] Cite error: The named reference dalle2 was invoked but never defined (see the help page).

[7] Li, Yifan; Zhou, Kun; Zhao, Wayne Xin; Wen, Ji-Rong (August 2023). "Diffusion Models for Non-autoregressive Text Generation: A Survey". Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. California: International Joint Conferences on Artificial Intelligence Organization. pp. 6692–6701. arXiv:2303.06574. doi:10.24963/ijcai.2023/750. ISBN 978-1-956792-03-4.

[8] Han, Xiaochuang; Kumar, Sachin; Tsvetkov, Yulia (2023). "SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control". Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics: 11575–11596. arXiv:2210.17432. doi:10.18653/v1/2023.acl-long.647.

[9] Xu, Weijie; Hu, Wenxiang; Wu, Fanyou; Sengamedu, Srinivasan (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 9040–9057. arXiv:2310.15296. doi:10.18653/v1/2023.findings-emnlp.606.

[10] Zhang, Haopeng; Liu, Xiao; Zhang, Jiawei (2023). "DiffuSum: Generation Enhanced Extractive Summarization with Diffusion". Findings of the Association for Computational Linguistics: ACL 2023. Stroudsburg, PA, USA: Association for Computational Linguistics: 13089–13100. arXiv:2305.01735. doi:10.18653/v1/2023.findings-acl.828.

[11] Yang, Dongchao; Yu, Jianwei; Wang, Helin; Wang, Wen; Weng, Chao; Zou, Yuexian; Yu, Dong (2023). "Diffsound: Discrete Diffusion Model for Text-to-Sound Generation". IEEE/ACM Transactions on Audio, Speech, and Language Processing. 31: 1720–1733. arXiv:2207.09983. doi:10.1109/taslp.2023.3268730. ISSN 2329-9290.

[12] Janner, Michael; Du, Yilun; Tenenbaum, Joshua B.; Levine, Sergey (2022-12-20). "Planning with Diffusion for Flexible Behavior Synthesis". arXiv:2205.09991 [cs.LG].

[13] Chi, Cheng; Xu, Zhenjia; Feng, Siyuan; Cousineau, Eric; Du, Yilun; Burchfiel, Benjamin; Tedrake, Russ; Song, Shuran (2024-03-14). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion". arXiv:2303.04137 [cs.RO].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]