{"id":7009,"date":"2023-08-01T06:02:08","date_gmt":"2023-08-01T14:02:08","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7009"},"modified":"2025-04-24T17:15:03","modified_gmt":"2025-04-24T17:15:03","slug":"recurrent-neural-networks-rnns-in-computer-vision-image-captioning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/","title":{"rendered":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"lt bg\">\n<figure class=\"lu lv lw lx ly lt bg paragraph-image\"><picture><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg\" alt=\"\" width=\"2400\" height=\"1667\"><\/picture><figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mi\" href=\"https:\/\/unsplash.com\/@simplicity?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Marija Zaric<\/a> on <a class=\"af mi\" href=\"https:\/\/unsplash.com\/s\/photos\/label?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"c5a7\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\"><a class=\"af mi\" href=\"https:\/\/medium.com\/@jeremyscohen\/rnns-in-computer-vision-b04b438d805c\" rel=\"noopener\">In a previous article<\/a>, I discussed the possibilities of computer vision-based deep learning with both RNNs and CNNs.<\/p>\n<p id=\"05cb\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Generally, ML engineers will specialize in one model architecture and let the other slide.<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"3fc9\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">My point and purpose for writing this post is the following: learning both allows to tackle a wider range of use-cases.<\/p><\/blockquote>\n<p id=\"3b48\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Last week, I tried the final project of the course <a class=\"af mi\" href=\"https:\/\/www.coursera.org\/learn\/intro-to-deep-learning\" target=\"_blank\" rel=\"noopener ugc nofollow\">Introduction to Deep Learning<\/a>from HSE (Higher School of Economics). In this project, we learn how to use the output of a convolutional neural network (CNN) for tasks other than image classification or regression.<\/p>\n<p id=\"a167\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Here, we\u2019ll instead learn how to feed this output into another neural network: a recurrent neural network (RNN). An RNN is a type of neural network that can work with sequences such as text, sound, videos, finance data, and more.<\/p>\n<p id=\"08f7\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Combining CNNs and RNNs helps us work with images and sequences of words in this case. The goal, then, is to generate captions for a given image.<\/p>\n<p id=\"685c\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">\ud83d\udcac For example, we could run the desired network on Conor McGregor\u2019s UFC image and get a description. This will be our objective.<\/strong><\/p>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*Wd5nSZa7p069yG8i8yCTwA.png\" alt=\"\" width=\"700\" height=\"471\"><\/figure><div class=\"me mf no\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*Wd5nSZa7p069yG8i8yCTwA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Wd5nSZa7p069yG8i8yCTwA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Wd5nSZa7p069yG8i8yCTwA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Wd5nSZa7p069yG8i8yCTwA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Wd5nSZa7p069yG8i8yCTwA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Wd5nSZa7p069yG8i8yCTwA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Wd5nSZa7p069yG8i8yCTwA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*Wd5nSZa7p069yG8i8yCTwA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">(image produced by neural network from this <a class=\"af mi\" href=\"https:\/\/cdn.vox-cdn.com\/thumbor\/uGr5UyGoRjjUdkr8-npC61MRoEY=\/0x0:2880x1920\/1200x800\/filters:focal(1321x381:1781x841)\/cdn.vox-cdn.com\/uploads\/chorus_image\/image\/66129128\/087_Conor_McGregor_x_Donald_Cerrone.0.jpg\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"12a2\" class=\"og oh fo be oi oj ok ol om on oo op oq mt or os ot mx ou ov ow nb ox oy oz pa bj\" data-selectable-paragraph=\"\">Before we start<\/h2>\n<ul class=\"\">\n<li id=\"fbe9\" class=\"mj mk fo be b ml pb mn mo mp pc mr ms nk pd mv mw nl pe mz na nm pf nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><a class=\"af mi\" href=\"https:\/\/mailchi.mp\/820bed51b8dc\/onestepahead?utm_source=medium&amp;utm_medium=email&amp;utm_campaign=Medium+captioning\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"nj\">Subscribe to the daily emails<\/em><\/a><em class=\"nj\"> and learn cutting-edge Computer Vision &amp; Self-Driving Cars every day!<\/em><\/li>\n<li id=\"d1a2\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><em class=\"nj\">Visit <\/em><a class=\"af mi\" href=\"https:\/\/www.thinkautonomous.ai\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"nj\">thinkautonomous.ai<\/em><\/a><em class=\"nj\"> and get at the leading-edge of Autonomous Technologies<\/em><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"18df\" class=\"po oh fo be oi pp pq pr om ps pt pu oq pv pw px py pz qa qb qc qd qe qf qg qh bj\" data-selectable-paragraph=\"\">Why image captioning?<\/h1>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*DKqniHLa5_5ZFwVN_ncgZw.png\" alt=\"\" width=\"700\" height=\"284\"><\/figure><div class=\"me mf qi\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*DKqniHLa5_5ZFwVN_ncgZw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*DKqniHLa5_5ZFwVN_ncgZw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*DKqniHLa5_5ZFwVN_ncgZw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*DKqniHLa5_5ZFwVN_ncgZw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*DKqniHLa5_5ZFwVN_ncgZw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*DKqniHLa5_5ZFwVN_ncgZw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*DKqniHLa5_5ZFwVN_ncgZw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*DKqniHLa5_5ZFwVN_ncgZw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mi\" href=\"https:\/\/research.googleblog.com\/2014\/11\/a-picture-is-worth-thousand-coherent.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<blockquote class=\"qj\"><p id=\"6364\" class=\"qk ql fo be qm qn qo qp qq qr qs nf dv\" data-selectable-paragraph=\"\">A picture is worth a thousand words. But sometimes we actually want the words.<\/p><\/blockquote>\n<p id=\"352a\" class=\"pw-post-body-paragraph mj mk fo be b ml qt mn mo mp qu mr ms mt qv mv mw mx qw mz na nb qx nd ne nf fh bj\" data-selectable-paragraph=\"\">Let\u2019s pause a moment and try to understand the possibilities of image captioning.<\/p>\n<p id=\"287a\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">If the output is a bunch of words, it means that we are going to use these words. <\/strong>Specifically, we\u2019ll use these words for contextual understanding, or to describe more detailed scenarios.<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"a4d2\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Let\u2019s say that you have to identify a specific type of clothing to then recommend other clothes in matching styles. This is called \u201cvisual search\u201d and could change fashion retail forever.<\/p><\/blockquote>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*XnC1B8q3jHGwtUpMhtIDRQ.png\" alt=\"\" width=\"700\" height=\"375\"><\/figure><div class=\"me mf qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*XnC1B8q3jHGwtUpMhtIDRQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"264a\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">In the image above, captioning can help us understand the specific clothes a person is wearing\u2014and their overall style. In this example, the detail is not strong enough. A better image to input to the network would include more specific fashion items.<\/p>\n<p id=\"c481\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Going to the extreme use case, we could even translate a football match in real-time and replace the on-air commentary with AI-generated voices discussing the match. I\u2019m not talking about robot voice here; we could imitate whoever we want.<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"5988\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Just look at this AI that can imitate Joe Rogan: <a class=\"af mi\" href=\"https:\/\/fakejoerogan.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/fakejoerogan.com<\/a><\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"a4d8\" class=\"po oh fo be oi pp pq pr om ps pt pu oq pv pw px py pz qa qb qc qd qe qf qg qh bj\" data-selectable-paragraph=\"\">How?<\/h1>\n<p id=\"7b4b\" class=\"pw-post-body-paragraph mj mk fo be b ml pb mn mo mp pc mr ms mt pd mv mw mx pe mz na nb pf nd ne nf fh bj\" data-selectable-paragraph=\"\">To do this, we need to use 2 different neural networks: a CNN and an RNN. Here I assume you\u2019re a bit familiar with both. Before getting into technical details, let\u2019s view the dataset and the output we want to generate.<\/p>\n<h2 id=\"9d54\" class=\"og oh fo be oi oj ok ol om on oo op oq mt or os ot mx ou ov ow nb ox oy oz pa bj\" data-selectable-paragraph=\"\">Dataset<\/h2>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:534\/1*CNk55KpPjeK12qVGZ1xZxA.png\" alt=\"\" width=\"534\" height=\"366\"><\/figure><div class=\"me mf re\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1068\/format:webp\/1*CNk55KpPjeK12qVGZ1xZxA.png 1068w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 534px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*CNk55KpPjeK12qVGZ1xZxA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*CNk55KpPjeK12qVGZ1xZxA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*CNk55KpPjeK12qVGZ1xZxA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*CNk55KpPjeK12qVGZ1xZxA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*CNk55KpPjeK12qVGZ1xZxA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*CNk55KpPjeK12qVGZ1xZxA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1068\/1*CNk55KpPjeK12qVGZ1xZxA.png 1068w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 534px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">Image \u2014 Label<\/figcaption>\n<\/figure>\n<p id=\"f372\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The dataset is a collection of images and captions. Here, it\u2019s the <a class=\"af mi\" href=\"http:\/\/cocodataset.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">COCO dataset<\/a>. For each image, a set of sentences (captions) is used as a label to describe the scene.<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"2627\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">It means our final output will be one of these sentences.<\/p><\/blockquote>\n<h2 id=\"8f1a\" class=\"og oh fo be oi oj ok ol om on oo op oq mt or os ot mx ou ov ow nb ox oy oz pa bj\" data-selectable-paragraph=\"\">Pre-processing<\/h2>\n<blockquote class=\"ng nh ni\"><p id=\"ab6d\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The words are converted into tokens through a process of creating what are called <strong class=\"be nn\">word embeddings<\/strong>.<\/p><\/blockquote>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:611\/1*IitCsxlfRaEsbU_6bzU6Bw.png\" alt=\"\" width=\"611\" height=\"335\"><\/figure><div class=\"me mf rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1222\/format:webp\/1*IitCsxlfRaEsbU_6bzU6Bw.png 1222w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 611px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*IitCsxlfRaEsbU_6bzU6Bw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*IitCsxlfRaEsbU_6bzU6Bw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*IitCsxlfRaEsbU_6bzU6Bw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*IitCsxlfRaEsbU_6bzU6Bw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*IitCsxlfRaEsbU_6bzU6Bw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*IitCsxlfRaEsbU_6bzU6Bw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1222\/1*IitCsxlfRaEsbU_6bzU6Bw.png 1222w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 611px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mi\" href=\"https:\/\/freecontent.manning.com\/wp-content\/uploads\/Chollet_DLfT_01.png\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<blockquote class=\"qj\"><p id=\"ff72\" class=\"qk ql fo be qm qn qo qp qq qr qs nf dv\" data-selectable-paragraph=\"\">The process to convert an image into words\/token is as follows:<\/p><\/blockquote>\n<ul class=\"\">\n<li id=\"5875\" class=\"mj mk fo be b ml qt mn mo mp qu mr ms nk qv mv mw nl qw mz na nm qx nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">Take an image <\/strong>as an input and embed it<\/li>\n<li id=\"f4f1\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">Condition<\/strong> the RNN on that embedding<\/li>\n<li id=\"74ef\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">Predict<\/strong> the next token given a START input token<\/li>\n<li id=\"3689\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">Use<\/strong> the predicted token as an input at the next time step<\/li>\n<li id=\"93ad\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><strong class=\"be nn\">Iterate<\/strong> until you predict an END token<\/li>\n<\/ul>\n<blockquote class=\"ng nh ni\"><p id=\"27f0\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">TL;DR \u2014 We have images and sentences for each. Sentences are converted into vectors.<\/p><\/blockquote>\n<h2 id=\"6dc8\" class=\"og oh fo be oi oj ok ol om on oo op oq mt or os ot mx ou ov ow nb ox oy oz pa bj\" data-selectable-paragraph=\"\">Encoder<\/h2>\n<p id=\"a831\" class=\"pw-post-body-paragraph mj mk fo be b ml pb mn mo mp pc mr ms mt pd mv mw mx pe mz na nb pf nd ne nf fh bj\" data-selectable-paragraph=\"\">The encoder is a convolutional neural network named <strong class=\"be nn\">Inception V3.<\/strong> This is a popular architecture for image classification.<\/p>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*XGgOduWU6d2zpSEa\" alt=\"\" width=\"700\" height=\"262\"><\/figure><div class=\"me mf qi\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*XGgOduWU6d2zpSEa 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*XGgOduWU6d2zpSEa 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*XGgOduWU6d2zpSEa 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*XGgOduWU6d2zpSEa 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*XGgOduWU6d2zpSEa 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*XGgOduWU6d2zpSEa 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*XGgOduWU6d2zpSEa 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*XGgOduWU6d2zpSEa 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*XGgOduWU6d2zpSEa 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*XGgOduWU6d2zpSEa 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*XGgOduWU6d2zpSEa 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*XGgOduWU6d2zpSEa 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*XGgOduWU6d2zpSEa 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*XGgOduWU6d2zpSEa 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">Inception v3 (<a class=\"af mi\" href=\"https:\/\/research.googleblog.com\/2016\/03\/train-your-own-image-classifier-with.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"29b2\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The code used to compute that CNN with Keras is below:<\/p>\n<pre>def get_cnn_encoder():\n    K.set_learning_phase(False)\n    model = keras.applications.InceptionV3(include_top=False)\n    preprocess_for_model = keras.applications.inception_v3.preprocess_input\n\n    model = keras.models.Model(model.inputs, keras.layers.GlobalAveragePooling2D()(model.output))\n    return model, preprocess_for_model<\/pre>\n<p id=\"3c5f\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">As you can see, the fully-connected layer is cropped with the parameter <code class=\"cw rj rk rl rm b\">include_top=False<\/code> inside the function call. This means that we directly use the convolutional features and we don&#8217;t activate them for a particular purpose (classification, regression, etc.).<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"c39f\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Here; I assume you are already familiar with CNNs and this kind of code.<br>\nWe simply create an Inception v3 model that we return; we don\u2019t have to create the layers ourselves.<\/p><\/blockquote>\n<h2 id=\"2418\" class=\"og oh fo be oi oj ok ol om on oo op oq mt or os ot mx ou ov ow nb ox oy oz pa bj\" data-selectable-paragraph=\"\">Decoder<\/h2>\n<p id=\"a58f\" class=\"pw-post-body-paragraph mj mk fo be b ml pb mn mo mp pc mr ms mt pd mv mw mx pe mz na nb pf nd ne nf fh bj\" data-selectable-paragraph=\"\">The decoder part of the model is using a recurrent neural networks and LSTM cells to generate the captions.<\/p>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*xouGPXQYGad-e4T3\" alt=\"\" width=\"700\" height=\"311\"><\/figure><div class=\"me mf rn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*xouGPXQYGad-e4T3 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*xouGPXQYGad-e4T3 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*xouGPXQYGad-e4T3 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*xouGPXQYGad-e4T3 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*xouGPXQYGad-e4T3 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*xouGPXQYGad-e4T3 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*xouGPXQYGad-e4T3 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*xouGPXQYGad-e4T3 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*xouGPXQYGad-e4T3 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*xouGPXQYGad-e4T3 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*xouGPXQYGad-e4T3 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*xouGPXQYGad-e4T3 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*xouGPXQYGad-e4T3 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*xouGPXQYGad-e4T3 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"mb mc md me mf mg mh be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mi\" href=\"https:\/\/miro.medium.com\/max\/4744\/1*ERwScS7k6IH3hZIJmGdHDg.png\" rel=\"noopener\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"9c1b\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Essentially, the CNN output is adapted and fed to an RNN that learns to generate the words.<\/p>\n<h1 id=\"270a\" class=\"po oh fo be oi pp ro pr om ps rp pu oq pv rq px py pz rr qb qc qd rs qf qg qh bj\" data-selectable-paragraph=\"\"><strong class=\"al\">How?<\/strong><\/h1>\n<p id=\"4d3c\" class=\"pw-post-body-paragraph mj mk fo be b ml pb mn mo mp pc mr ms mt pd mv mw mx pe mz na nb pf nd ne nf fh bj\" data-selectable-paragraph=\"\">First, you might notice the vertical layers This is what a recurrent neural network produces. Every vertical layer is trying to predict the next word given the image.<\/p>\n<p id=\"8011\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The first layer will take the embedded image and predict \u201cstart\u201d; then \u201cman\u201d is predicted, so the RNN will write \u201ca man\u201d; other tags are then generated, such as \u201cpizza\u201d.<\/p>\n<p id=\"2460\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">We then learn how to say that a man holds a slice of pizza. <strong class=\"be nn\">The features are used and we try to correlate that with our captions.<\/strong><\/p>\n<p id=\"dad4\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">In order to get a long-term memory, the RNN type is full of LSTM cells (Long Short-Term Memory) that can keep the state of a word. For example, <code class=\"cw rj rk rl rm b\">a man holding ___ beer<\/code> could be understood as <code class=\"cw rj rk rl rm b\">a man holding his beer<\/code> so the notion of masculinity is preserved here.<\/p>\n<p id=\"ac28\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Finally, the horizontal layers are, like in deep learning, neural net layers. We could even stack more of these.<\/p>\n<p id=\"8ed1\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Let\u2019s dive into the code to actually visualize it.<\/p>\n<blockquote class=\"ng nh ni\"><p id=\"7c10\" class=\"mj mk nj be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf fh bj\" data-selectable-paragraph=\"\">The decoder part first uses word embeddings. Let\u2019s analyze the function.<\/p><\/blockquote>\n<p id=\"a000\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">We first define a <code class=\"cw rj rk rl rm b\">Decoder<\/code> class and two placeholders. In TensorFlow, a placeholder is used to feed data into a model when training. We\u2019ll have one placeholder for image embedding and one for the sentences.<\/p>\n<pre>class decoder:\n    img_embeds = tf.placeholder('float32', [None, IMG_EMBED_SIZE])\n    sentences = tf.placeholder('int32', [None, None])<\/pre>\n<p id=\"a85f\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">Then, we define our functions:<\/p>\n<ul class=\"\">\n<li id=\"d9f4\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><code class=\"cw rj rk rl rm b\"><strong class=\"be nn\">img_embed_to_bottleneck<\/strong><\/code> will reduce the number of parameters.<\/li>\n<li id=\"d041\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><code class=\"cw rj rk rl rm b\"><strong class=\"be nn\">img_embed_bottleneck_to_h0<\/strong><\/code> will convert the previously retrieved image embedding into the initial LSTM cell<\/li>\n<li id=\"77f8\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><code class=\"cw rj rk rl rm b\"><strong class=\"be nn\">word_embed<\/strong><\/code> will create a word embedding layer: the length of the vocabulary (all existing words)<\/li>\n<\/ul>\n<pre>  img_embed_to_bottleneck = L.Dense(IMG_EMBED_BOTTLENECK, input_shape=(None, IMG_EMBED_SIZE), activation='elu')\n  img_embed_bottleneck_to_h0 = L.Dense(LSTM_UNITS,input_shape=(None, IMG_EMBED_BOTTLENECK),activation='elu')\n  word_embed = L.Embedding(len(vocab), WORD_EMBED_SIZE)<\/pre>\n<figure class=\"np nq nr ns nt lt\"><\/figure>\n<ul class=\"\">\n<li id=\"936d\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">The next part creates an <strong class=\"be nn\">LSTM<\/strong> cell of a few hundred units<\/li>\n<li id=\"7b37\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">Finally, the network must predict words. We call these predictions logits, and we thus need to convert the LSTM output into logits<\/li>\n<li id=\"21ec\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><code class=\"cw rj rk rl rm b\"><strong class=\"be nn\">token_logits_bottleneck<\/strong><\/code> converts the LSTM to a logits bottleneck. That reduces the model complexity<\/li>\n<li id=\"7024\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\"><code class=\"cw rj rk rl rm b\"><strong class=\"be nn\">token_logits<\/strong><\/code> converts the bottleneck features into logits using a <code class=\"cw rj rk rl rm b\">Dense()<\/code> layer<\/li>\n<\/ul>\n<pre>  lstm = tf.nn.rnn_cell.LSTMCell(LSTM_UNITS)\n  token_logits_bottleneck = L.Dense(LOGIT_BOTTLENECK, input_shape=(None, LSTM_UNITS), activation=\"elu\")\n  token_logits = L.Dense(len(vocab), input_shape=(None, LOGIT_BOTTLENECK))<\/pre>\n<figure class=\"np nq nr ns nt lt\"><\/figure>\n<ul class=\"\">\n<li id=\"940d\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">We can then<strong class=\"be nn\"> condition our LSTM cell on the image <\/strong>embeddings placeholder.<\/li>\n<li id=\"76ed\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">We <strong class=\"be nn\">embed all the tokens<\/strong> but the last<\/li>\n<li id=\"36dc\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">Then, we <strong class=\"be nn\">create a dynamic RNN<\/strong> and <strong class=\"be nn\">calculate token logits<\/strong> for all the hidden states. We\u2019ll use this with the ground truth<\/li>\n<li id=\"6fa2\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">We create a <strong class=\"be nn\">loss mask<\/strong> that will take the value <strong class=\"be nn\">1 for real tokens<\/strong> and<strong class=\"be nn\"> 0 otherwise<\/strong><\/li>\n<li id=\"d466\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">Finally, we compute a <strong class=\"be nn\">cross-entropy loss<\/strong>, generally used for classification. This loss is used to compare the <code class=\"cw rj rk rl rm b\">flat_ground_truth<\/code>to the <code class=\"cw rj rk rl rm b\">flat_token_logits<\/code> (prediction).<\/li>\n<\/ul>\n<pre>  c0 = h0 = img_embed_bottleneck_to_h0(img_embed_to_bottleneck(img_embeds))\n  word_embeds = word_embed(sentences[:, :-1])\n  hidden_states, _ = tf.nn.dynamic_rnn(lstm, word_embeds,initial_state=tf.nn.rnn_cell.LSTMStateTuple(c0, h0))\n\n  flat_hidden_states = tf.reshape(hidden_states, [-1, LSTM_UNITS])\n  flat_token_logits = token_logits(token_logits_bottleneck(flat_hidden_states))\n  flat_ground_truth = tf.reshape(sentences[:, 1:], [-1])\n\n  flat_loss_mask = tf.not_equal(flat_ground_truth, pad_idx)\n  xent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=flat_ground_truth, logits=flat_token_logits)\n  loss = tf.reduce_mean(tf.boolean_mask(xent, flat_loss_mask))<\/pre>\n<figure class=\"np nq nr ns nt lt\"><\/figure>\n<h1 id=\"aeb6\" class=\"po oh fo be oi pp ro pr om ps rp pu oq pv rq px py pz rr qb qc qd rs qf qg qh bj\" data-selectable-paragraph=\"\">Results<\/h1>\n<p id=\"013b\" class=\"pw-post-body-paragraph mj mk fo be b ml pb mn mo mp pc mr ms mt pd mv mw mx pe mz na nb pf nd ne nf fh bj\" data-selectable-paragraph=\"\">Let\u2019s visualize some results on real data.<\/p>\n<\/div>\n<\/div>\n<div class=\"lt\">\n<div class=\"ab ca\">\n<div class=\"rt ru rv rw rx ry ce rz cf sa ch bg\">\n<div class=\"np nq nr ns nt ab jw\">\n<figure class=\"kv lt sb sc sd se sf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1088\/1*7tAF6FTcycwW8jWbX2MqJQ.png\" alt=\"\" width=\"552\" height=\"1132\"><\/figure><div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1104\/format:webp\/1*7tAF6FTcycwW8jWbX2MqJQ.png 1104w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 552px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*7tAF6FTcycwW8jWbX2MqJQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*7tAF6FTcycwW8jWbX2MqJQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*7tAF6FTcycwW8jWbX2MqJQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*7tAF6FTcycwW8jWbX2MqJQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*7tAF6FTcycwW8jWbX2MqJQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*7tAF6FTcycwW8jWbX2MqJQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1104\/1*7tAF6FTcycwW8jWbX2MqJQ.png 1104w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 552px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<figure class=\"kv lt sg sc sd se sf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:662\/1*9Jv-tjzpcKT5_CvuayLXZg.png\" alt=\"\" width=\"449\" height=\"846\"><\/figure><div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:898\/format:webp\/1*9Jv-tjzpcKT5_CvuayLXZg.png 898w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 449px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*9Jv-tjzpcKT5_CvuayLXZg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*9Jv-tjzpcKT5_CvuayLXZg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*9Jv-tjzpcKT5_CvuayLXZg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*9Jv-tjzpcKT5_CvuayLXZg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*9Jv-tjzpcKT5_CvuayLXZg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*9Jv-tjzpcKT5_CvuayLXZg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:898\/1*9Jv-tjzpcKT5_CvuayLXZg.png 898w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 449px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<\/div>\n<div class=\"ab jw\">\n<figure class=\"kv lt sh sc sd se sf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:766\/1*-YJ3Ut7W_Os01kp53CSZYA.png\" alt=\"\" width=\"391\" height=\"834\"><\/figure><div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:782\/format:webp\/1*-YJ3Ut7W_Os01kp53CSZYA.png 782w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 391px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*-YJ3Ut7W_Os01kp53CSZYA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*-YJ3Ut7W_Os01kp53CSZYA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*-YJ3Ut7W_Os01kp53CSZYA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*-YJ3Ut7W_Os01kp53CSZYA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*-YJ3Ut7W_Os01kp53CSZYA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*-YJ3Ut7W_Os01kp53CSZYA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:782\/1*-YJ3Ut7W_Os01kp53CSZYA.png 782w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 391px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<figure class=\"kv lt si sc sd se sf paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:856\/1*r9MBLECh9btjfk_VaLoKOA.png\" alt=\"\" width=\"610\" height=\"598\"><\/figure><div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1220\/format:webp\/1*r9MBLECh9btjfk_VaLoKOA.png 1220w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 610px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*r9MBLECh9btjfk_VaLoKOA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*r9MBLECh9btjfk_VaLoKOA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*r9MBLECh9btjfk_VaLoKOA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*r9MBLECh9btjfk_VaLoKOA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*r9MBLECh9btjfk_VaLoKOA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*r9MBLECh9btjfk_VaLoKOA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1220\/1*r9MBLECh9btjfk_VaLoKOA.png 1220w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 610px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"3430\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">A couple representative takeways:<\/p>\n<ul class=\"\">\n<li id=\"c0a4\" class=\"mj mk fo be b ml mm mn mo mp mq mr ms nk mu mv mw nl my mz na nm nc nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">It\u2019s not all perfect, but there is a solid context understanding.<\/li>\n<li id=\"13fc\" class=\"mj mk fo be b ml pj mn mo mp pk mr ms nk pl mv mw nl pm mz na nm pn nd ne nf pg ph pi bj\" data-selectable-paragraph=\"\">On the top right image, the woman is confused with a man.<\/li>\n<\/ul>\n<blockquote class=\"qj\"><p id=\"3380\" class=\"qk ql fo be qm qn qo qp qq qr qs nf dv\" data-selectable-paragraph=\"\">To dive deeper, we might want to train the full network on more on object that represent more specific use cases (fashion, sports, etc).<\/p><\/blockquote>\n<h1 id=\"d585\" class=\"po oh fo be oi pp ro pr om ps rp pu oq pv sj px py pz sk qb qc qd sl qf qg qh bj\" data-selectable-paragraph=\"\">Now, what about the UFC image?<\/h1>\n<figure class=\"np nq nr ns nt lt me mf paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lz ma c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*qo28kO9eP0LlHcY9E42-6A.png\" alt=\"\" width=\"700\" height=\"497\"><\/figure><div class=\"me mf sm\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*qo28kO9eP0LlHcY9E42-6A.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*qo28kO9eP0LlHcY9E42-6A.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*qo28kO9eP0LlHcY9E42-6A.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*qo28kO9eP0LlHcY9E42-6A.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*qo28kO9eP0LlHcY9E42-6A.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*qo28kO9eP0LlHcY9E42-6A.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*qo28kO9eP0LlHcY9E42-6A.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*qo28kO9eP0LlHcY9E42-6A.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"e5b6\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">I\u2019m a bit disappointed\u2014the model doesn\u2019t understand what UFC is, who Conor is, and what a left hook looks like! We definitely can\u2019t use that model with any image. We\u2019d need to train the model on UFC examples to get better sentences. However, I\u2019m convinced that we can achieve it.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"7b8d\" class=\"pw-post-body-paragraph mj mk fo be b ml mm mn mo mp mq mr ms mt mu mv mw mx my mz na nb nc nd ne nf fh bj\" data-selectable-paragraph=\"\">To learn more about the full project check out this <a href=\"https:\/\/github.com\/Jeremy26\/image-captioning\/blob\/master\/image_captioning.ipynb\">GitHub repo<\/a>.<\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Marija Zaric on Unsplash In a previous article, I discussed the possibilities of computer vision-based deep learning with both RNNs and CNNs. Generally, ML engineers will specialize in one model architecture and let the other slide. My point and purpose for writing this post is the following: learning both allows to tackle a [&hellip;]<\/p>\n","protected":false},"author":62,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[162],"class_list":["post-7009","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning\" \/>\n<meta property=\"og:description\" content=\"Photo by Marija Zaric on Unsplash In a previous article, I discussed the possibilities of computer vision-based deep learning with both RNNs and CNNs. Generally, ML engineers will specialize in one model architecture and let the other slide. My point and purpose for writing this post is the following: learning both allows to tackle a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-01T14:02:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:03+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg\" \/>\n<meta name=\"author\" content=\"Jeremy Cohen\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jeremy Cohen\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"9 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/","og_locale":"en_US","og_type":"article","og_title":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning","og_description":"Photo by Marija Zaric on Unsplash In a previous article, I discussed the possibilities of computer vision-based deep learning with both RNNs and CNNs. Generally, ML engineers will specialize in one model architecture and let the other slide. My point and purpose for writing this post is the following: learning both allows to tackle a [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-01T14:02:08+00:00","article_modified_time":"2025-04-24T17:15:03+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg","type":"","width":"","height":""}],"author":"Jeremy Cohen","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Jeremy Cohen","Est. reading time":"9 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/"},"author":{"name":"Jeremy Cohen","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/1e64ca8044dcb12997aabe6d1d38c5a7"},"headline":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning","datePublished":"2023-08-01T14:02:08+00:00","dateModified":"2025-04-24T17:15:03+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/"},"wordCount":1257,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/","url":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/","name":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg","datePublished":"2023-08-01T14:02:08+00:00","dateModified":"2025-04-24T17:15:03+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*H_2MTNOMBya7B0qqyrHqeg.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/recurrent-neural-networks-rnns-in-computer-vision-image-captioning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Recurrent Neural Networks (RNNs) in Computer Vision: Image Captioning"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/1e64ca8044dcb12997aabe6d1d38c5a7","name":"Jeremy Cohen","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2cf8c0f1d0bef51059a3a80ededdf00a","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1686841399905-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1686841399905-96x96.jpg","caption":"Jeremy Cohen"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/hellothinkautonomous-ai\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7009","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/62"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7009"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7009\/revisions"}],"predecessor-version":[{"id":15595,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7009\/revisions\/15595"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7009"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7009"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7009"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7009"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}