Part of a series on |
Machine learning and data mining |
---|
Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.
In contrast, unimodal models can process only one type of data, such as text (typically represented as feature vectors) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.[1]
Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.[2]