Multimodal learning

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval,[1] text-to-image generation,[2] aesthetic ranking,[3] and image captioning.[4]

Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.[5]

  1. ^ Hendriksen, Mariya; Bleeker, Maurits; Vakulenko, Svitlana; van Noord, Nanne; Kuiper, Ernst; de Rijke, Maarten (2021). "Extending CLIP for Category-to-image Retrieval in E-commerce". arXiv:2112.11294 [cs.CV].
  2. ^ "Stable Diffusion Repository on GitHub". CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. Archived from the original on January 18, 2023. Retrieved 17 September 2022.
  3. ^ LAION-AI/aesthetic-predictor, LAION AI, 2024-09-06, retrieved 2024-09-08
  4. ^ Mokady, Ron; Hertz, Amir; Bermano, Amit H. (2021). "ClipCap: CLIP Prefix for Image Captioning". arXiv:2111.09734 [cs.CV].
  5. ^ Zia, Tehseen (January 8, 2024). "Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024". Unite.ai. Retrieved 2024-06-01.