{"id":6871,"date":"2023-07-19T09:51:54","date_gmt":"2023-07-19T17:51:54","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6871"},"modified":"2025-04-24T17:15:11","modified_gmt":"2025-04-24T17:15:11","slug":"6-significant-computer-vision-problems-solved-by-ml","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/","title":{"rendered":"6 Significant Computer Vision Problems Solved by ML"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\">\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg\" alt=\"\"\/><\/figure>\n\n\n\n<div class=\"mf bg\">\n<figure class=\"mg mh mi mj mk mf bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n\n\n\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"9b04\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Introduction<\/h1>\n<p id=\"8f6b\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Machine learning has expanded computers\u2019 ability to understand images and extract different information from visual data. In this article, different computer vision tasks will be presented alongside explanations for how each has been tackled using machine learning.<\/p>\n<p id=\"c920\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">A lot of machine learning research has been done in the field of computer vision throughout the last 3 decades. Different topics, tasks, and problems have been studied thoroughly; however, we\u2019ll focus on the core problems of computer vision, and we\u2019ll briefly present some of the more advanced hot topics in computer vision towards the end.<\/p>\n<h1 id=\"f9ae\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Classification<\/h1>\n<p id=\"b88d\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Image classification is the first computer vision task to be tackled by machine learning\u2014in the 1950s, the perceptron algorithm was implemented in the <a class=\"af oj\" href=\"https:\/\/en.wikipedia.org\/wiki\/Harvard_Mark_I\" target=\"_blank\" rel=\"noopener ugc nofollow\">Mark 1<\/a> Perceptron machine, which was used for image classification.<\/p>\n<p id=\"38ca\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Although this algorithm was efficient for structured data problems, it could only perform well on trivial tasks such as classifying different geometric shapes. A few decades later, the <a class=\"af oj\" href=\"https:\/\/en.wikipedia.org\/wiki\/Support_vector_machine\" target=\"_blank\" rel=\"noopener ugc nofollow\">SVM <\/a>algorithm was introduced, which was able to tackle high-dimensional data with a minimum amount of samples, such as small image datasets.<\/p>\n<p id=\"3099\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Finally, what has really revolutionized computer vision is the introduction of convolutional neural networks (CNNs) by <a class=\"af oj\" href=\"http:\/\/yann.lecun.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Yann Lecun<\/a> in his model <a class=\"af oj\" href=\"http:\/\/yann.lecun.com\/exdb\/lenet\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">LeNet<\/a>, which was proven to be superior to other vision-based ML techniques in 2012, when <a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/publication\/267960550_ImageNet_Classification_with_Deep_Convolutional_Neural_Networks\" target=\"_blank\" rel=\"noopener ugc nofollow\">AlexNet <\/a>was the first CNN based model to win the famous <a class=\"af oj\" href=\"http:\/\/www.image-net.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">ImageNet <\/a>competition.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<div class=\"on oo eb op bg oq\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*QqblVCEB2TjuuJhh.png\" alt=\"\" width=\"700\" height=\"240\"><\/figure><div class=\"ok ol om\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*QqblVCEB2TjuuJhh.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*QqblVCEB2TjuuJhh.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*QqblVCEB2TjuuJhh.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*QqblVCEB2TjuuJhh.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*QqblVCEB2TjuuJhh.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*QqblVCEB2TjuuJhh.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*QqblVCEB2TjuuJhh.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*QqblVCEB2TjuuJhh.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*QqblVCEB2TjuuJhh.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*QqblVCEB2TjuuJhh.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*QqblVCEB2TjuuJhh.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*QqblVCEB2TjuuJhh.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*QqblVCEB2TjuuJhh.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*QqblVCEB2TjuuJhh.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div><figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Main-operations-of-a-typical-CNN-architecture_fig2_323589806\" target=\"_blank\" rel=\"noopener ugc nofollow\">Source<\/a><\/figcaption><\/figure>\n<h1 id=\"6757\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Problem Formulation<\/h1>\n<p id=\"d53f\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">This part will be short since problem formulation is quite simple for classification. A classification problem generally involves classifying images into 2 or more classes.<\/p>\n<p id=\"ea4b\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">In the case of using just two classes (ex: cat and no cat, text and background, etc.), the problem is known as a <em class=\"ow\">binary classification<\/em> problem\u2014for which, the last layer of the network will contain 1 neuron with a sigmoid activation function.<\/p>\n<p id=\"9788\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">And in the case of using more than 2 classes (ex: digits, animals, vehicles, etc.), the problem is deemed a <em class=\"ow\">multi-class classification<\/em> problem, for which the last layer will contain <code class=\"cw ox oy oz pa b\">n<\/code> neurons <code class=\"cw ox oy oz pa b\">(n = Number Of Classes)<\/code> with a softmax activation function.<\/p>\n<h1 id=\"b29f\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Datasets and Benchmarks<\/h1>\n<p id=\"873b\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">The most famous dataset is the <a class=\"af oj\" href=\"https:\/\/en.wikipedia.org\/wiki\/MNIST_database\" target=\"_blank\" rel=\"noopener ugc nofollow\">MNIST<\/a> Handwritten Digits dataset, which has been used in the early age of computer vision and is still used as an introduction to image classification problems. Although this dataset has played an essential role in the development of computer vision, the task is considered trivial for the current state of the field and industry.<\/p>\n<p id=\"c184\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Currently, one of the most significant datasets is <a class=\"af oj\" href=\"http:\/\/www.image-net.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">ImageNet<\/a>, which consists of 1M samples of 1K classes of different animal species. This dataset is really important for image classification for two reasons.<\/p>\n<p id=\"2609\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">First, It\u2019s used as a benchmark for evaluating new network architectures through a yearly competition active from 2010 until 2017 and is still used for evaluating of new subsequent architectures. And second, it\u2019s widely used to set pre-trained weights of other networks, as this dataset is so variant it can teach networks to detect important features in images that can be used in other computer vision tasks.<\/p>\n<p id=\"a29c\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">There are a lot of other datasets for image classification that have been used repeatedly through research\u2014such as <a class=\"af oj\" href=\"https:\/\/cs.stanford.edu\/~acoates\/stl10\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">STL-10<\/a>, <a class=\"af oj\" href=\"https:\/\/www.cs.toronto.edu\/~kriz\/cifar.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">CIFAR-10, and CIFAR-100<\/a>\u2014in a manner similar to ImageNet, but with less data and smaller image sizes.<\/p>\n<p id=\"b352\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">As for medical research, a lot of datasets have been developed for different specific tasks such as <a class=\"af oj\" href=\"https:\/\/www.isic-archive.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">ISIC<\/a>, <a class=\"af oj\" href=\"https:\/\/stanfordmlgroup.github.io\/competitions\/mura\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">MURA<\/a>, and <a class=\"af oj\" href=\"https:\/\/dermnetnz.org\/image-library\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">DermNet<\/a>. Medical datasets are harder to collect which is challenging for complicated tasks, as most of the time it\u2019s not very feasible to collect large datasets when needed.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<div class=\"on oo eb op bg oq\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*zPTiJDBSFU5lkCE1.png\" alt=\"\" width=\"700\" height=\"684\"><\/figure><div class=\"ok ol pb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*zPTiJDBSFU5lkCE1.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*zPTiJDBSFU5lkCE1.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*zPTiJDBSFU5lkCE1.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*zPTiJDBSFU5lkCE1.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*zPTiJDBSFU5lkCE1.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*zPTiJDBSFU5lkCE1.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*zPTiJDBSFU5lkCE1.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*zPTiJDBSFU5lkCE1.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*zPTiJDBSFU5lkCE1.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*zPTiJDBSFU5lkCE1.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*zPTiJDBSFU5lkCE1.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*zPTiJDBSFU5lkCE1.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*zPTiJDBSFU5lkCE1.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*zPTiJDBSFU5lkCE1.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\">Examples from <a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Examples-in-the-ImageNet-dataset_fig7_310476818\" target=\"_blank\" rel=\"noopener ugc nofollow\">ImageNet<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"94ec\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Significant Models<\/h1>\n<p id=\"fe23\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Although datasets can affect classification performance by presenting different variations in the data, the model architecture is also critical, as it affects the speed and the ability of the network to fit the data.<\/p>\n<h2 id=\"85d6\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"http:\/\/yann.lecun.com\/exdb\/lenet\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">LeNet5<\/strong> <\/a>(LeCun et al.)<\/h2>\n<p id=\"51e2\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">The LeNet architecture is considered the father of CNNs. It has a simple, shallow architecture, but it demonstrated the ability of convolution layers to learn good features from image data.<\/p>\n<h2 id=\"602e\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1409.1556\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">VGG<\/strong><\/a> (Simonyan, Zisserman)<\/h2>\n<p id=\"9330\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">This network showed that the deeper the network, the more it can learn. It has been proven efficient through experiments using different numbers of layers in the network. Although this network is considered quite small right now, it still can handle different tasks while using ImageNet for pre-training (weights are already <a class=\"af oj\" href=\"https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/keras\/applications\/VGG16\" target=\"_blank\" rel=\"noopener ugc nofollow\">available<\/a>).<\/p>\n<h2 id=\"8966\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1409.4842\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">GoogleNet<\/strong><\/a> (Google) \u2014 Aka InceptionNet<\/h2>\n<p id=\"86d5\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">In this work, they have successfully built an efficient network that can handle more complex tasks more efficiently, using a special building block (Inception Block) that contains 4 parallel paths, each containing different ConvLayers with different kernel sizes.<\/p>\n<p id=\"efec\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">This technique enabled the network to utilize different kernel sizes in each layer while giving more weight to the more suitable kernel sizes. Another feature of this network is the use of intermediate classifiers, which could handle the problem of <a class=\"af oj\" href=\"https:\/\/en.wikipedia.org\/wiki\/Vanishing_gradient_problem\" target=\"_blank\" rel=\"noopener ugc nofollow\">vanishing gradients<\/a> better.<\/p>\n<h2 id=\"72d4\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1512.03385\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">RestNet<\/strong><\/a> (Microsoft)<\/h2>\n<p id=\"4796\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Using residual layers (link), this network attempts to solve the problem of depth\u2014 the deeper the network the harder it is to train. So by adding shortcut connections to the network that skip one or more layers, the network can perform identity mapping, which means it can\u2019t perform worse than a network with fewer layers. Using this technique, they successfully trained a network 8 times deeper than VGG.<\/p>\n<h2 id=\"e87d\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1704.04861\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">MobileNet<\/strong><\/a> (Google)<\/h2>\n<p id=\"6834\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">A move towards speed was made in this work, as researchers at Google used separable layers that decreased the number of computations required without affecting the model\u2019s performance significantly. This technique made <a class=\"af oj\" href=\"https:\/\/heartbeat.comet.ml\/how-to-fit-large-neural-networks-on-the-edge-eb621cdbb33\" target=\"_blank\" rel=\"noopener ugc nofollow\">fitting convolutional neural networks on mobile devices<\/a> much more achievable.<\/p>\n<h1 id=\"2286\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Object Localization and Detection<\/h1>\n<p id=\"bc24\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Object localization and detection is a computer vision problem in which, given an image, the algorithm has to decide the locations of one or more target objects, outputting bounding boxes for each that appears in the image or video frame.<\/p>\n<p id=\"be51\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">This task is used heavily in different applications such as self-driving cars, robotics, augmented reality, and medical applications.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:653\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg\" alt=\"\" width=\"653\" height=\"409\"><\/figure><div class=\"ok ol pt\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1306\/format:webp\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 1306w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 653px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1306\/1*52aTV_JJZYdIUXnF-P91Vg.jpeg 1306w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 653px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\">Example from <a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Example-of-the-difficult-annotation-Objects-shown-dashed-have-been-marked-difficult_fig2_220659463\" target=\"_blank\" rel=\"noopener ugc nofollow\">PASCAL VOC dataset<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"fdcf\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Problem Formulation<\/h1>\n<p id=\"fe46\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">This problem can be formulated in different ways depending on the architecture being used. But generally, the network should output a class for each target object in the image using a sigmoid or a softmax activation function.<\/p>\n<p id=\"d17f\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">And to localize the object, the network outputs four variables representing the bounding box, which can be <code class=\"cw ox oy oz pa b\">(x, y, w, h)<\/code>, with the <code class=\"cw ox oy oz pa b\">x, y<\/code> representing either the center or the top right corner of the bounding box.<\/p>\n<p id=\"212b\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Predicting the bounding box is considered a regression problem. Most of the algorithms require outputting another variable indicating whether an object exists in the selected area or not, as the output is produced for different parts of the image either via CNN implementation or using a sliding window.<\/p>\n<h1 id=\"9db6\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Datasets and Benchmarks<\/h1>\n<p id=\"f458\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Datasets for object localization and detection require more work than for image classification, as the bounding box is annotated around each target object in a given image. Let\u2019s quick review a few datasets.<\/p>\n<p id=\"0c6b\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/cocodataset.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">COCO<\/a>, <a class=\"af oj\" href=\"http:\/\/host.robots.ox.ac.uk\/pascal\/VOC\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">PASCAL<\/a>, and <a class=\"af oj\" href=\"http:\/\/www.image-net.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">ImageNet<\/a> datasets are considered the main datasets used for evaluating new object detection architectures. They consist of large numbers of images of general objects like people, animals, cars, planes, etc., with annotations describing bounding boxes and classes for objects in the image. These datasets are also used for segmentation, which we\u2019ll discuss in the next part.<\/p>\n<h1 id=\"573b\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Significant Models<\/h1>\n<p id=\"387c\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Different techniques are used for building models for object detection. Some models rely on extracting region proposals and classifying each region separately, others use regions of interest (ROI) as an input to the model, while other approaches just use a single-shot network to handle the problem.<\/p>\n<h2 id=\"9d0e\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Sliding Window (Deformable Parts Models)<\/strong><\/h2>\n<p id=\"6846\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Some models have taken a sliding window approach to implement detection, in which a window slides through the image with a certain stride while inserting each block of the image into a classifier. The window is applied to different resolutions of the image to detect objects of different sizes. This approach is considered quite slow compared to the following examples, as the classification model is run once for every possible bounding box in the network.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<div class=\"on oo eb op bg oq\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*CkyqO-KuYC8HTYdj.png\" alt=\"\" width=\"700\" height=\"229\"><\/figure><div class=\"ok ol pu\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*CkyqO-KuYC8HTYdj.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*CkyqO-KuYC8HTYdj.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*CkyqO-KuYC8HTYdj.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*CkyqO-KuYC8HTYdj.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*CkyqO-KuYC8HTYdj.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*CkyqO-KuYC8HTYdj.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*CkyqO-KuYC8HTYdj.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*CkyqO-KuYC8HTYdj.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*CkyqO-KuYC8HTYdj.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*CkyqO-KuYC8HTYdj.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*CkyqO-KuYC8HTYdj.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*CkyqO-KuYC8HTYdj.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*CkyqO-KuYC8HTYdj.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*CkyqO-KuYC8HTYdj.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Object-detection-by-sliding-window-approach_fig1_266215670\" target=\"_blank\" rel=\"noopener ugc nofollow\">Explaining Sliding Window approach<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"3f53\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/en.wikipedia.org\/wiki\/Region_Based_Convolutional_Neural_Networks\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">R-CNN<\/strong><\/a><\/h2>\n<p id=\"413f\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">The R-CNN approach relies on extracting region proposals via selective search. Then each region is wrapped and forwarded to a CNN pre-trained on ImageNet for feature extraction. Finally, the extracted features are forwarded to an SVM to classify each region. This approach has proved more accurate than sliding window approaches, but inference takes longer to process given the network\u2019s separate stages.<\/p>\n<h2 id=\"56ae\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><strong class=\"al\">R-CNN Improvements<\/strong><\/h2>\n<p id=\"2d4b\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Other architectures based on region proposals have followed the R-CNN approach. Fast R-CNN has better performance and speed relative to R-CNN, as it has merged feature extraction and classification into the same CNN. So the network has an image and multiple ROIs as input and it outputs a prediction for both the class and the bounding box for each ROI.<\/p>\n<p id=\"b255\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Faster R-CNN took it a step further by extracting ROIs through the network, which improved the accuracy and speed again. The reason behind this is that the network has more freedom to solve the problem. The weights can be updated throughout the network using end-to-end training, so the network has full control over ROIs and feature extraction.<\/p>\n<h2 id=\"10d2\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1506.02640\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">YOLO<\/strong><\/a><\/h2>\n<p id=\"b91b\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">You only look once (YOLO) is a single-shot network that\u2019s designed for optimal performance. It applies a CNN version of the sliding window approach by dividing the input image into an <code class=\"cw ox oy oz pa b\">ss<\/code> by reducing the image size through the network into <code class=\"cw ox oy oz pa b\">ss<\/code> cells with the same depths, where each cell represents a grid in the original image.<\/p>\n<p id=\"02c0\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">For each cell, there are <code class=\"cw ox oy oz pa b\">B5 + C<\/code> tensors representing <code class=\"cw ox oy oz pa b\">B<\/code> bounding box predictions and an array with length <code class=\"cw ox oy oz pa b\">C<\/code> representing the class. Because of this implementation detail, the network is limited in the number of nearby objects it can predict, as for each cell of the <code class=\"cw ox oy oz pa b\">ss<\/code> grid, it can only predict one class.<\/p>\n<h2 id=\"e062\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1512.02325\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">SSD<\/strong><\/a><\/h2>\n<p id=\"fd02\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">A single-shot multi-box detector can achieve a faster and more accurate performance than YOLO. It uses features from different levels in the network, which help detect objects with different sizes. Adding to that, the fast implementation of non-maximum suppression is essential for the network, as it outputs a large number of boxes for each image.<\/p>\n<h1 id=\"1fb2\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Segmentation<\/h1>\n<p id=\"8275\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Image segmentation is used in various applications (medical, robotics, satellite imagery analysis, etc.) to not only understand the locations of objects in images and video frames, but to more precisely map the boundaries between different objects in the same image.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:535\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg\" alt=\"\" width=\"535\" height=\"572\"><\/figure><div class=\"ok ol pv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1070\/format:webp\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 1070w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 535px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1070\/1*V3YHZCm4EqI8E4E4Qbk-mQ.jpeg 1070w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 535px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\">Examples from <a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Example-images-with-image-segmentation-from-the-COCO-dataset-19_fig3_328953083\" target=\"_blank\" rel=\"noopener ugc nofollow\">COCO dataset<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"ad34\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Problem Formulation<\/h1>\n<p id=\"5b0c\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Image segmentation can be divided into two categories: semantic segmentation and instance segmentation, both of which require pixel-level labels.<\/p>\n<p id=\"7e78\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">For semantic segmentation, the solution requires objects\u2019 pixels for each class of targets in the image to be labeled with the same value. Meanwhile, instance segmentation requires separating different instances of the same class by assigning their pixels different values. Some approaches handle the occlusion of objects so that the occluded part of the object is also represented.<\/p>\n<h1 id=\"976f\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Datasets and Benchmarks<\/h1>\n<p id=\"483f\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Needless to say, segmentation datasets are of the most time-consuming datasets to build. Again, <a class=\"af oj\" href=\"https:\/\/cocodataset.org\/#home\" target=\"_blank\" rel=\"noopener ugc nofollow\">COCO<\/a> and <a class=\"af oj\" href=\"http:\/\/host.robots.ox.ac.uk\/pascal\/VOC\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">PASCAL<\/a> datasets are two of the largest datasets for image segmentation, using general objects, as mentioned earlier.<\/p>\n<p id=\"b692\" class=\"pw-post-body-paragraph nj nk fo be b gm oe nm nn gp of np nq nr og nt nu nv oh nx ny nz oi ob oc od fh bj\" data-selectable-paragraph=\"\">Other datasets are built for more specific applications. For medical applications, there are a lot of datasets such as <a class=\"af oj\" href=\"http:\/\/braintumorsegmentation.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">BraTS<\/a> and <a class=\"af oj\" href=\"https:\/\/competitions.codalab.org\/competitions\/17094\" target=\"_blank\" rel=\"noopener ugc nofollow\">Lits<\/a>, which target tasks like tumor segmentation in different parts of the body and different types of diseases. Datasets like <a class=\"af oj\" href=\"https:\/\/spacenetchallenge.github.io\/datasets\/datasetHomePage.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">SpaceNet<\/a> and the <a class=\"af oj\" href=\"https:\/\/www.agriculture-vision.com\/dataset\" target=\"_blank\" rel=\"noopener ugc nofollow\">Agriculture-Vision<\/a>Database consist of satellite images, which can have a variety of applications used for labeling large-scale things like streets, buildings, water bodies, etc.<\/p>\n<h1 id=\"21c0\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Significant Models<\/h1>\n<p id=\"a27d\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">For models handling segmentation problems, a different concept of ConvLayers should be used in addition to the traditional one\u2014<mark class=\"ado adp ao\">transpose convolution or deconvolution<\/mark>. Transpose convolution can output frames with larger spatial sizes than its input, which is needed for segmentation since the network infers the segmented image from features with smaller spatial size.<\/p>\n<h2 id=\"3052\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1505.04597\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">U-Net<\/strong><\/a><\/h2>\n<p id=\"ec05\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">The U-Net architecture is a full CNN that uses ConvLayers while reducing the spatial size and increasing the depth, and then the reverse by using transpose ConvLayers. Also, an important detail that affects the accuracy of the network is the forward propagation used between the early layers of the network and the later layers with the same spatial sizes, which can provide missing information due to the reduction of the window size.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<div class=\"on oo eb op bg oq\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*EaY7ImL1gy-KFWdBIfYi8g.png\" alt=\"\" width=\"700\" height=\"458\"><\/figure><div class=\"ok ol pw\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*EaY7ImL1gy-KFWdBIfYi8g.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*EaY7ImL1gy-KFWdBIfYi8g.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*EaY7ImL1gy-KFWdBIfYi8g.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*EaY7ImL1gy-KFWdBIfYi8g.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*EaY7ImL1gy-KFWdBIfYi8g.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*EaY7ImL1gy-KFWdBIfYi8g.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*EaY7ImL1gy-KFWdBIfYi8g.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*EaY7ImL1gy-KFWdBIfYi8g.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/publication\/305193694_U-Net_Convolutional_Networks_for_Biomedical_Image_Segmentation\" target=\"_blank\" rel=\"noopener ugc nofollow\">U-Net architecture<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"fb07\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/arxiv.org\/abs\/1703.06870#:~:text=The%20method%2C%20called%20Mask%20R,CNN%2C%20running%20at%205%20fps.\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">Mask R-CNN<\/strong><\/a><\/h2>\n<p id=\"e464\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">This network is an extension of Faster R-CNN, adding an extra branch predicting an object mask parallel to the bounding box prediction branch. In addition to this network being the state of the art for segmentation, it has been extended to perform several different tasks such as human pose estimation and rigid object pose estimation.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:274\/1*0DkaX5fQUNIYcCRMTAkDAw.png\" alt=\"\" width=\"274\" height=\"184\"><\/figure><div class=\"ok ol px\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:548\/format:webp\/1*0DkaX5fQUNIYcCRMTAkDAw.png 548w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 274px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*0DkaX5fQUNIYcCRMTAkDAw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*0DkaX5fQUNIYcCRMTAkDAw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*0DkaX5fQUNIYcCRMTAkDAw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*0DkaX5fQUNIYcCRMTAkDAw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*0DkaX5fQUNIYcCRMTAkDAw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*0DkaX5fQUNIYcCRMTAkDAw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:548\/1*0DkaX5fQUNIYcCRMTAkDAw.png 548w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 274px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/publication\/315454672_Mask_R-CNN\" target=\"_blank\" rel=\"noopener ugc nofollow\">Mask R-CNN architecture<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"5116\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">More Advanced Tasks<\/h1>\n<p id=\"7a22\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Computer vision can also tackle different tasks, which are a bit out of the scope of this article; however, we\u2019ll offer simple descriptions for some examples of other tasks.<\/p>\n<h2 id=\"737f\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\">Data Generation<\/h2>\n<p id=\"ebcf\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">By learning the distribution of a dataset using approaches like <a class=\"af oj\" href=\"https:\/\/heartbeat.fritz.ai\/introduction-to-generative-adversarial-networks-gans-35ef44f21193\" target=\"_blank\" rel=\"noopener ugc nofollow\">GANs<\/a>, we can generate new images that look real and can be used in new datasets. For example, by going to <a href=\"https:\/\/thispersondoesnotexist.com\/\">this website<\/a>, you will see a new picture of a person that looks real\u2014but has just generated by a CNN.<\/p>\n<h2 id=\"e7d2\" class=\"pc mo fo be mp pd pe pf ms pg ph pi mv nr pj pk pl nv pm pn po nz pp pq pr ps bj\" data-selectable-paragraph=\"\">Domain Adaptation<\/h2>\n<p id=\"ce41\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Approaches like GANs and VAEs can be used to transform images from a source domain (street view in summer) to a target domain (street view in winter), which is very beneficial for generalizing networks\u2019 performance on different tasks without annotating new data. This is also used for cool applications like deepfakes.<\/p>\n<h1 id=\"8797\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Neural Style Transfer<\/h1>\n<p id=\"a203\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">Another cool application of CNNs is neural style transfer, which uses a content image and a style image to output an image with the same content of the content image, but with the style of the style image. Using this, we can transform normal pictures into version that appear to be created by Van Gogh or Picasso.<\/p>\n<figure class=\"mg mh mi mj mk mf ok ol paragraph-image\">\n<div class=\"on oo eb op bg oq\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg ml mm c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg\" alt=\"\" width=\"700\" height=\"329\"><\/figure><div class=\"ok ol pu\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*Ld5_hOPgcLQOn5ICcnknlw.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"or os ot ok ol ou ov be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af oj\" href=\"https:\/\/www.researchgate.net\/figure\/Two-examples-of-image-style-transfer-generated-using-the-neural-style-algorithm-of-Gatys_fig1_330828467\" target=\"_blank\" rel=\"noopener ugc nofollow\">https:\/\/www.researchgate.net\/figure\/Two-examples-of-image-style-transfer-generated-using-the-neural-style-algorithm-of-Gatys_fig1_330828467<\/a><\/figcaption>\n<\/figure>\n<h1 id=\"3190\" class=\"mn mo fo be mp mq mr go ms mt mu gr mv mw mx my mz na nb nc nd ne nf ng nh ni bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"520b\" class=\"pw-post-body-paragraph nj nk fo be b gm nl nm nn gp no np nq nr ns nt nu nv nw nx ny nz oa ob oc od fh bj\" data-selectable-paragraph=\"\">In this survey of todays most essential machine learning-based computer vision techniques, we gave simple brief explanations of each topic to give the reader an overview of the range of possibilities, which can be a good way to start your journey with the field.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Machine learning has expanded computers\u2019 ability to understand images and extract different information from visual data. In this article, different computer vision tasks will be presented alongside explanations for how each has been tackled using machine learning. A lot of machine learning research has been done in the field of computer vision throughout the [&hellip;]<\/p>\n","protected":false},"author":55,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[158],"class_list":["post-6871","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>6 Significant Computer Vision Problems Solved by ML - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"6 Significant Computer Vision Problems Solved by ML\" \/>\n<meta property=\"og:description\" content=\"Introduction Machine learning has expanded computers\u2019 ability to understand images and extract different information from visual data. In this article, different computer vision tasks will be presented alongside explanations for how each has been tackled using machine learning. A lot of machine learning research has been done in the field of computer vision throughout the [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-19T17:51:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:11+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg\" \/>\n<meta name=\"author\" content=\"Mohamed Maher\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mohamed Maher\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"6 Significant Computer Vision Problems Solved by ML - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/","og_locale":"en_US","og_type":"article","og_title":"6 Significant Computer Vision Problems Solved by ML","og_description":"Introduction Machine learning has expanded computers\u2019 ability to understand images and extract different information from visual data. In this article, different computer vision tasks will be presented alongside explanations for how each has been tackled using machine learning. A lot of machine learning research has been done in the field of computer vision throughout the [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-07-19T17:51:54+00:00","article_modified_time":"2025-04-24T17:15:11+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg","type":"","width":"","height":""}],"author":"Mohamed Maher","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Mohamed Maher","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/"},"author":{"name":"Mohamed Maher","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/ea19bf83a4d1d2f02261bdfe5bd08dd4"},"headline":"6 Significant Computer Vision Problems Solved by ML","datePublished":"2023-07-19T17:51:54+00:00","dateModified":"2025-04-24T17:15:11+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/"},"wordCount":2341,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/","url":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/","name":"6 Significant Computer Vision Problems Solved by ML - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg","datePublished":"2023-07-19T17:51:54+00:00","dateModified":"2025-04-24T17:15:11+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2572\/1*0RspkFABiHzobtJlX1X9aw.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/6-significant-computer-vision-problems-solved-by-ml\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"6 Significant Computer Vision Problems Solved by ML"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/ea19bf83a4d1d2f02261bdfe5bd08dd4","name":"Mohamed Maher","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/327a8368116d24e7283e987c1b905be8","url":"https:\/\/secure.gravatar.com\/avatar\/c3b3884370befe07abfaeca11672bf0a2d715a4d8db99a5c70bbd43f59063b5f?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/c3b3884370befe07abfaeca11672bf0a2d715a4d8db99a5c70bbd43f59063b5f?s=96&d=mm&r=g","caption":"Mohamed Maher"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/mohamed-mahergmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6871","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/55"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6871"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6871\/revisions"}],"predecessor-version":[{"id":15602,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6871\/revisions\/15602"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6871"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6871"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6871"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6871"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}