{"id":7016,"date":"2023-08-01T06:15:38","date_gmt":"2023-08-01T14:15:38","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7016"},"modified":"2025-04-24T17:15:00","modified_gmt":"2025-04-24T17:15:00","slug":"introduction-to-multimodal-deep-learning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/","title":{"rendered":"Introduction to Multimodal Deep Learning"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"mf bg\">\n<figure class=\"mg mh mi mj mk mf bg paragraph-image\"><picture><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg\" alt=\"\" width=\"2400\" height=\"1875\"><\/picture><figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mu\" href=\"https:\/\/unsplash.com\/@anniespratt?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Annie Spratt<\/a> on <a class=\"af mu\" href=\"https:\/\/unsplash.com\/s\/photos\/photo-book?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"b353\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Our experience of the world is multimodal \u2014 we see objects, hear sounds, feel the texture, smell odors and taste flavors and then come up to a decision. <strong class=\"be nq\">Multimodal learning<\/strong> suggests that when a number of our senses \u2014 visual, auditory, kinesthetic \u2014 are being engaged in the processing of information, we understand and remember more. By combining these modes, learners can combine information from different sources.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<div class=\"ns nt eb nu bg nv\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*AW5Xn9Fkog_wNDaBBeCG3w.png\" alt=\"\" width=\"700\" height=\"329\"><\/figure><div class=\"mq mr nr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*AW5Xn9Fkog_wNDaBBeCG3w.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Baseline of multimodal learning \u2014 Photo on <a class=\"af mu\" href=\"https:\/\/www.researchgate.net\/figure\/Baseline-of-multimodal-deep-learning-model-It-deals-with-multisource-data-directly-and_fig1_334532323\" target=\"_blank\" rel=\"noopener ugc nofollow\">ResearchGate<\/a><\/figcaption>\n<\/figure>\n<p id=\"f678\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">When it comes to deep learning, the approach of training models on only source of information\u2014be it images, text, audio, video\u2014is commonplace.<\/p>\n<p id=\"3597\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nq\">But there\u2019s also a way to build models that incorporate two data types\u2014say, text and images\u2014at the same time. <\/strong>Working with multimodal data not only improves neural networks, but it also includes better feature extraction from all sources that thereby contribute to making predictions at a larger scale.<\/p>\n<h1 id=\"b1a2\" class=\"nw nx fo be ny nz oa go ob oc od gr oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Benefits of multimodal data<\/h1>\n<p id=\"9258\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">Modes are, essentially, channels of information. These data from multiple sources are semantically correlated, and sometimes provide complementary information to each other, thus reflecting patterns that aren\u2019t visible when working with individual modalities on their own. Such systems consolidate heterogeneous, disconnected data from various sensors, thus helping produce more robust predictions.<\/p>\n<p id=\"34b2\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, in an emotion detector, we could combine information gathered from an EEG and also eye movement signals to combine and classify someone\u2019s current mood\u2014thus combining two different data sources for one deep learning task.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:455\/1*ZR7roGjziJzwm7iGCrpECw.gif\" alt=\"\" width=\"455\" height=\"368\"><\/figure><div class=\"mq mr ox\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*ZR7roGjziJzwm7iGCrpECw.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*ZR7roGjziJzwm7iGCrpECw.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*ZR7roGjziJzwm7iGCrpECw.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*ZR7roGjziJzwm7iGCrpECw.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*ZR7roGjziJzwm7iGCrpECw.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*ZR7roGjziJzwm7iGCrpECw.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:910\/1*ZR7roGjziJzwm7iGCrpECw.gif 910w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 455px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*ZR7roGjziJzwm7iGCrpECw.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*ZR7roGjziJzwm7iGCrpECw.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*ZR7roGjziJzwm7iGCrpECw.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*ZR7roGjziJzwm7iGCrpECw.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*ZR7roGjziJzwm7iGCrpECw.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*ZR7roGjziJzwm7iGCrpECw.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:910\/1*ZR7roGjziJzwm7iGCrpECw.gif 910w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 455px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Example of multimodal learning \u2014 Photo on <a class=\"af mu\" href=\"https:\/\/media.springernature.com\/original\/springer-static\/image\/chp%3A10.1007%2F978-3-319-46672-9_58\/MediaObjects\/432150_1_En_58_Fig1_HTML.gif\" target=\"_blank\" rel=\"noopener ugc nofollow\">Springer<\/a><\/figcaption>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"39f5\" class=\"nw nx fo be ny nz pq go ob oc pr gr oe of ps oh oi oj pt ol om on pu op oq or bj\" data-selectable-paragraph=\"\">How multimodal learning works<\/h1>\n<p id=\"2054\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">Deep neural networks have been successfully applied to unsupervised feature learning for single modalities\u2014eg. text, images or audio. Here, we aim to do information fusion from different modalities to improve our network\u2019s predictive ability. The overall task can mainly be divided into three phases \u2014 individual feature learning, information fusion and testing.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:418\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg\" alt=\"\" width=\"418\" height=\"325\"><\/figure><div class=\"mq mr pv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:836\/format:webp\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 836w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 418px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:836\/1*EnM3FqCkm4MhpBCexePWWQ.jpeg 836w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 418px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mu\" href=\"https:\/\/dev.to\/kayis\" target=\"_blank\" rel=\"noopener ugc nofollow\">kayis <\/a>on <a class=\"af mu\" href=\"https:\/\/dev.to\/kayis\/introduction-to-multimodal-learning-model-4ngm\" target=\"_blank\" rel=\"noopener ugc nofollow\">Dev<\/a><\/figcaption>\n<\/figure>\n<p id=\"ba3b\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">We\u2019ll need the following:<\/p>\n<ul class=\"\">\n<li id=\"2a90\" class=\"mv mw fo be b gm mx my mz gp na nb nc pw ne nf ng px ni nj nk py nm nn no np pz qa qb bj\" data-selectable-paragraph=\"\">At least two information sources<\/li>\n<li id=\"6d27\" class=\"mv mw fo be b gm qc my mz gp qd nb nc pw qe nf ng px qf nj nk py qg nn no np pz qa qb bj\" data-selectable-paragraph=\"\">An information processing model for each source<\/li>\n<li id=\"b904\" class=\"mv mw fo be b gm qc my mz gp qd nb nc pw qe nf ng px qf nj nk py qg nn no np pz qa qb bj\" data-selectable-paragraph=\"\">A learning model for the combined information<\/li>\n<\/ul>\n<p id=\"a392\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Given these prerequisites, let\u2019s take a look at the steps involved in multimodal learning in more detail<\/p>\n<h2 id=\"9350\" class=\"qh nx fo be ny qi qj qk ob ql qm qn oe nd qo qp qq nh qr qs qt nl qu qv qw qx bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Representation of modalities<\/strong><\/h2>\n<p id=\"578c\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">A first fundamental step is learning how to represent inputs and summarizing the data in a way that expresses the multiple modalities. The heterogeneity of multimodal data makes it challenging to construct such representations.<\/p>\n<p id=\"5cec\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, text is often symbolic, while audio and visual modalities will be represented as signals. For more details have a look at <a class=\"af mu\" href=\"https:\/\/openreview.net\/forum?id=Hk4OO3W_bS\" target=\"_blank\" rel=\"noopener ugc nofollow\">this foundational research paper on multimodal learning<\/a>.<\/p>\n<h2 id=\"1055\" class=\"qh nx fo be ny qi qj qk ob ql qm qn oe nd qo qp qq nh qr qs qt nl qu qv qw qx bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Translation<\/strong><\/h2>\n<p id=\"71dd\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">A second step is to addresses how to translate (map) data from one modality to another. Not only is the data heterogeneous, but the relationship between modalities is often open-ended or subjective. There has to be a direct relation between (sub)elements from two or more different modalities.<\/p>\n<p id=\"b9c2\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, we may want to align the steps in a recipe to a video showing the dish being made. To tackle this challenge, we need to measure similarity between different modalities and deal with possible long range dependencies and ambiguities.<\/p>\n<h2 id=\"cd7b\" class=\"qh nx fo be ny qi qj qk ob ql qm qn oe nd qo qp qq nh qr qs qt nl qu qv qw qx bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Feature extraction<\/strong><\/h2>\n<p id=\"7972\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">Features need to be extracted from individual sources of information by building models that best suit the type of data. Feature extraction from one source is independent from another.<\/p>\n<p id=\"7878\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, in image-to-text translation, the features extracted from images are in the form of finer details, like edges and environmental surroundings, while corresponding features extracted from text are in form of tokens.<\/p>\n<p id=\"c68b\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">After all the features important for prediction are extracted from both data sources, it\u2019s time to combine the different features into one shared representation.<\/p>\n<h2 id=\"caef\" class=\"qh nx fo be ny qi qj qk ob ql qm qn oe nd qo qp qq nh qr qs qt nl qu qv qw qx bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Fusion and Co-learning<\/strong><\/h2>\n<p id=\"7465\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">The next step is to combine information from two or more modalities to perform a prediction.<\/p>\n<p id=\"d62e\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">For example, for audio-visual speech recognition, a visual description of lip motion is fused with the audio input to predict spoken words. The information coming from these different modalities may have varying predictive power and noise topology, with possibly missing data in at least one of the modalities. For more details do have a look at <a class=\"af mu\" href=\"http:\/\/datascienceassn.org\/sites\/default\/files\/Improved%20Multimodal%20Deep%20Learning%20with%20Variation%20of%20Information.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">this<\/a> source.<\/p>\n<p id=\"897c\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Here, we can take a weighted combination of the subnetworks so that each input modality can have a learned contribution (Theta) towards the output prediction. This enables the inclusion of the useful features from different sources more as compared to others.<\/p>\n<figure class=\"mg mh mi mj mk mf mq mr paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:597\/1*gGxjiJ0G3Be7u770n349lQ.png\" alt=\"\" width=\"597\" height=\"209\"><\/figure><div class=\"mq mr qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1194\/format:webp\/1*gGxjiJ0G3Be7u770n349lQ.png 1194w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 597px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*gGxjiJ0G3Be7u770n349lQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*gGxjiJ0G3Be7u770n349lQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*gGxjiJ0G3Be7u770n349lQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*gGxjiJ0G3Be7u770n349lQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*gGxjiJ0G3Be7u770n349lQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*gGxjiJ0G3Be7u770n349lQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1194\/1*gGxjiJ0G3Be7u770n349lQ.png 1194w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 597px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mn mo mp mq mr ms mt be b bf z dv\" data-selectable-paragraph=\"\">Image to text translation using multimodal deep learning<\/figcaption>\n<\/figure>\n<p id=\"88af\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">The model architecture for different modalities can be chosen according to the need\u2014eg. an LSTM for text data or a CNN for images. We can then combine the features and pass it to the final classifier by aggregating the models.<\/p>\n<h1 id=\"a456\" class=\"nw nx fo be ny nz oa go ob oc od gr oe of og oh oi oj ok ol om on oo op oq or bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"233c\" class=\"pw-post-body-paragraph mv mw fo be b gm os my mz gp ot nb nc nd ou nf ng nh ov nj nk nl ow nn no np fh bj\" data-selectable-paragraph=\"\">The primary thing to keep in mind, when dealing with multimodal datasets, is the aggregation of features. Everything up until feature extraction from individual data sources follows the same rules and steps and is independent of other sources. The fusion of information, keeping in mind the weightage to be given to each data type, is the primary area of research.<\/p>\n<p id=\"906a\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">To learn more about specific multimodal learning techniques, check out this <a href=\"https:\/\/github.com\/pliang279\/awesome-multimodal-ml\">GitHub repo<\/a>:<\/p>\n<p id=\"1dcb\" class=\"pw-post-body-paragraph mv mw fo be b gm mx my mz gp na nb nc nd ne nf ng nh ni nj nk nl nm nn no np fh bj\" data-selectable-paragraph=\"\">Have a happy learning! Do share your experiences. If there are any areas, papers, and interesting datasets to work on, please let me know!<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Annie Spratt on Unsplash Our experience of the world is multimodal \u2014 we see objects, hear sounds, feel the texture, smell odors and taste flavors and then come up to a decision. Multimodal learning suggests that when a number of our senses \u2014 visual, auditory, kinesthetic \u2014 are being engaged in the processing [&hellip;]<\/p>\n","protected":false},"author":53,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[155],"class_list":["post-7016","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Introduction to Multimodal Deep Learning - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Introduction to Multimodal Deep Learning\" \/>\n<meta property=\"og:description\" content=\"Photo by Annie Spratt on Unsplash Our experience of the world is multimodal \u2014 we see objects, hear sounds, feel the texture, smell odors and taste flavors and then come up to a decision. Multimodal learning suggests that when a number of our senses \u2014 visual, auditory, kinesthetic \u2014 are being engaged in the processing [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-01T14:15:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg\" \/>\n<meta name=\"author\" content=\"Pragati Baheti\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Pragati Baheti\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Introduction to Multimodal Deep Learning - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/","og_locale":"en_US","og_type":"article","og_title":"Introduction to Multimodal Deep Learning","og_description":"Photo by Annie Spratt on Unsplash Our experience of the world is multimodal \u2014 we see objects, hear sounds, feel the texture, smell odors and taste flavors and then come up to a decision. Multimodal learning suggests that when a number of our senses \u2014 visual, auditory, kinesthetic \u2014 are being engaged in the processing [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-01T14:15:38+00:00","article_modified_time":"2025-04-24T17:15:00+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg","type":"","width":"","height":""}],"author":"Pragati Baheti","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Pragati Baheti","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/"},"author":{"name":"Pragati Baheti","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439"},"headline":"Introduction to Multimodal Deep Learning","datePublished":"2023-08-01T14:15:38+00:00","dateModified":"2025-04-24T17:15:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/"},"wordCount":873,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/","url":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/","name":"Introduction to Multimodal Deep Learning - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg","datePublished":"2023-08-01T14:15:38+00:00","dateModified":"2025-04-24T17:15:00+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*-BXJVpQniyc7lK6c6u_p7A.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/introduction-to-multimodal-deep-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Introduction to Multimodal Deep Learning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/54958874fd9a373469e70e19b6597439","name":"Pragati Baheti","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/851362323c20d10f17041155fc07cae2","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1535716570395-96x96.jpg","caption":"Pragati Baheti"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/pragatibaheti001gmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7016","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7016"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7016\/revisions"}],"predecessor-version":[{"id":15592,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7016\/revisions\/15592"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7016"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7016"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7016"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7016"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}