{"id":6203,"date":"2023-06-15T08:08:17","date_gmt":"2023-06-15T16:08:17","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=6203"},"modified":"2025-04-24T17:15:24","modified_gmt":"2025-04-24T17:15:24","slug":"the-5-computer-vision-techniques-that-will-change-how-you-see-the-world","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/","title":{"rendered":"The 5 Computer Vision Techniques That Will Change How You See The World"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"7594\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\"><a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/the-5-trends-that-dominated-computer-vision-in-2018-de38fbb9bd86\" target=\"_blank\" rel=\"noopener ugc nofollow\">Computer Vision<\/a> is one of the hottest research fields within Deep Learning at the moment. It sits at the intersection of many academic subjects, such as Computer Science (Graphics, Algorithms, Theory, Systems, Architecture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics, Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and Psychology (Cognitive Science).<\/p>\n<p id=\"aec3\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">As Computer Vision represents a relative understanding of visual environments and their contexts, many scientists believe the field paves the way towards Artificial General Intelligence due to its cross-domain mastery.<\/p>\n<h2 id=\"a7b3\" class=\"mr ms fo be mt mu mv mw mx my mz na nb md nc nd ne mh nf ng nh ml ni nj nk nl bj\" data-selectable-paragraph=\"\">So <strong class=\"al\">what is Computer Vision?<\/strong><\/h2>\n<p id=\"854e\" class=\"pw-post-body-paragraph lt lu fo be b lv nm lx ly lz nn mb mc md no mf mg mh np mj mk ml nq mn mo mp fh bj\" data-selectable-paragraph=\"\">Here are a couple of formal textbook definitions:<\/p>\n<ul class=\"\">\n<li id=\"d661\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc nr me mf mg ns mi mj mk nt mm mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">\u201cthe construction of explicit, meaningful descriptions of physical objects from images\u201d (<a class=\"af mq\" href=\"https:\/\/www.amazon.com\/Computer-Vision-Dana-H-Ballard\/dp\/0131653164\" target=\"_blank\" rel=\"noopener ugc nofollow\">Ballard &amp; Brown<\/a>, 1982)<\/li>\n<li id=\"cf11\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">\u201ccomputing properties of the 3D world from one or more digital images\u201d (<a class=\"af mq\" href=\"https:\/\/www.amazon.com\/Introductory-Techniques-3-D-Computer-Vision\/dp\/0132611082\" target=\"_blank\" rel=\"noopener ugc nofollow\">Trucco &amp; Verri<\/a>, 1998)<\/li>\n<li id=\"acca\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">\u201cto make useful decisions about real physical objects and scenes based on sensed images\u201d (<a class=\"af mq\" href=\"https:\/\/www.amazon.com\/Computer-Vision-Linda-G-Shapiro\/dp\/0130307963\" target=\"_blank\" rel=\"noopener ugc nofollow\">Sockman &amp; Shapiro<\/a>, 2001)<\/li>\n<\/ul>\n<h2 id=\"825d\" class=\"mr ms fo be mt mu mv mw mx my mz na nb md nc nd ne mh nf ng nh ml ni nj nk nl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Why study Computer Vision?<\/strong><\/h2>\n<p id=\"4b5a\" class=\"pw-post-body-paragraph lt lu fo be b lv nm lx ly lz nn mb mc md no mf mg mh np mj mk ml nq mn mo mp fh bj\" data-selectable-paragraph=\"\">The most obvious answer is that there\u2019s a fast-growing collection of useful applications derived from this field of study. Here are just a handful of them:<\/p>\n<ul class=\"\">\n<li id=\"dd43\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc nr me mf mg ns mi mj mk nt mm mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Face recognition: Snapchat and Facebook use <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/building-a-real-time-face-detector-in-android-with-ml-kit-f930eb7b36d9\" target=\"_blank\" rel=\"noopener ugc nofollow\">face-detection<\/a>algorithms to apply filters and recognize you in pictures.<\/li>\n<li id=\"bc7a\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Image retrieval: Google Images uses content-based queries to search relevant images. The algorithms analyze the content in the query image and return results based on best-matched content.<\/li>\n<li id=\"ffb7\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Gaming and controls: A great commercial product in gaming that uses stereo vision is Microsoft Kinect.<\/li>\n<li id=\"a75f\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Surveillance: Surveillance cameras are ubiquitous at public locations and are used to detect suspicious behaviors.<\/li>\n<li id=\"73d1\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Biometrics: Fingerprint, iris and <a class=\"af mq\" href=\"https:\/\/support.apple.com\/en-us\/HT208108\" target=\"_blank\" rel=\"noopener ugc nofollow\">face matching<\/a> remains some common methods in biometric identification.<\/li>\n<li id=\"325c\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Smart cars: Vision remains the main source of information to detect traffic signs and lights and other visual features.<\/li>\n<\/ul>\n<p id=\"06c6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">I recently finished Stanford\u2019s wonderful <a class=\"af mq\" href=\"http:\/\/cs231n.stanford.edu\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">CS231n course<\/a> on using <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/a-beginners-guide-to-convolutional-neural-networks-cnn-cf26c5ee17ed\" target=\"_blank\" rel=\"noopener ugc nofollow\">Convolutional Neural Networks<\/a> for visual recognition. <mark class=\"we wf ao\">Visual recognition tasks such as image classification, <\/mark><mark class=\"we wf ao\"><a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/gentle-guide-on-how-yolo-object-localization-works-with-keras-part-2-65fe59ac12d\" target=\"_blank\" rel=\"noopener ugc nofollow\">localization<\/a><\/mark><mark class=\"we wf ao\">, and <\/mark><mark class=\"we wf ao\"><a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/detecting-objects-in-videos-and-camera-feeds-using-keras-opencv-and-imageai-c869fe1ebcdb\" target=\"_blank\" rel=\"noopener ugc nofollow\">detection<\/a><\/mark><mark class=\"we wf ao\"> are key components of Computer vision.<\/mark><\/p>\n<p id=\"f7b7\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Recent developments in neural networks and deep learning approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. The course is a phenomenal resource that taught me the details of deep learning architectures being used in cutting-edge computer vision research. In this article, I want to share the 5 major computer vision techniques I\u2019ve learned as well as major deep learning models and applications using each of them.<\/p>\n<h1 id=\"f4aa\" class=\"oc ms fo be mt od oe of mx og oh oi nb oj ok ol om on oo op oq or os ot ou ov bj\" data-selectable-paragraph=\"\"><strong class=\"al\">1 \u2014 Image Classification<\/strong><\/h1>\n<\/div>\n<\/div>\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg\" alt=\"\" width=\"2000\" height=\"653\"><\/figure><div class=\"ow bg\">\n<figure class=\"ox oy oz pa pb ow bg paragraph-image\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:4000\/format:webp\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 4000w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 2000px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:4000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg 4000w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 2000px\" data-testid=\"og\"><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"c398\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The problem of <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/basics-of-image-classification-with-pytorch-2f8973c51864\" target=\"_blank\" rel=\"noopener ugc nofollow\">image classification <\/a>goes like this: Given a set of images that are all labeled with a single category, we\u2019re asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. There are a variety of challenges associated with this task, including viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion, illumination conditions, and background clutter.<\/p>\n<p id=\"62b7\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">How might we go about writing an algorithm that can classify images into distinct categories? Computer Vision researchers have come up with a data-driven approach to solve this. Instead of trying to specify what every one of the image categories of interest look like directly in code, they provide the computer with many examples of each image class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class.<\/p>\n<p id=\"44d4\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">In other words, they first accumulate a training dataset of labeled images, then feed it to the computer to process the data. Given that fact, the complete image classification pipeline can be formalized as follows:<\/p>\n<ul class=\"\">\n<li id=\"c2ad\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc nr me mf mg ns mi mj mk nt mm mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Our input is a training dataset that consists of <em class=\"pe\">N<\/em> images, each labeled with one of <em class=\"pe\">K<\/em> different classes.<\/li>\n<li id=\"a619\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Then, we use this training set to <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/training-a-core-ml-model-with-turi-create-to-classify-dog-breeds-d10009bd30b6\" target=\"_blank\" rel=\"noopener ugc nofollow\">train a classifier<\/a> to learn what every one of the classes looks like.<\/li>\n<li id=\"3c97\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it\u2019s never seen before. We\u2019ll then compare the true labels of these images to the ones predicted by the classifier.<\/li>\n<\/ul>\n<p id=\"91cd\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The most popular architecture used for image classification is <strong class=\"be pf\">Convolutional Neural Networks (CNNs). <\/strong>A typical use case for CNNs is where you feed the network images and the network classifies the data. CNNs tend to start with an input \u201cscanner\u201d which isn\u2019t intended to parse all the training data at once. For example, to input an image of 100 x 100 pixels, you wouldn\u2019t want a layer with 10,000 nodes.<\/p>\n<p id=\"7544\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Rather, you create a scanning input layer of say 10 x 10 which you feed the first 10 x 10 pixels of the image. Once you passed that input, you feed it the next 10 x 10 pixels by moving the scanner one pixel to the right. This technique is known as <strong class=\"be pf\">sliding windows<\/strong>.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*VqRKWmxwIakOSnWPURoCSA.jpeg\" alt=\"\" width=\"700\" height=\"239\"><\/figure><div class=\"pg ph pi\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*VqRKWmxwIakOSnWPURoCSA.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"3fb7\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">This input data is then fed through convolutional layers instead of normal layers. Each node only concerns itself with close neighboring cells. These convolutional layers also tend to shrink as they become deeper, mostly by easily divisible factors of the input. Besides these convolutional layers, they also often feature <a class=\"af mq\" href=\"https:\/\/www.coursera.org\/lecture\/convolutional-neural-networks\/pooling-layers-hELHk\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">pooling layers<\/em><\/a>. Pooling is a way to filter out details: a commonly found pooling technique is <em class=\"pe\">max pooling<\/em>, where we take, say, 2 x 2 pixels and pass on the pixel with the most amount of a certain attribute.<\/p>\n<p id=\"af50\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Most image classification techniques nowadays are trained on <a class=\"af mq\" href=\"http:\/\/www.image-net.org\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">ImageNet<\/strong><\/a>, a dataset with approximately 1.2 million high-resolution training images. Test images will be presented with no initial annotation (no segmentation or labels), and algorithms will have to produce labelings specifying what objects are present in the images. Some of the best existing computer vision methods were tried on this dataset by leading computer vision groups from Oxford, INRIA, and XRCE. Typically, computer vision systems use complicated multi-stage pipelines, and the early stages are typically hand-tuned by optimizing a few parameters.<\/p>\n<p id=\"e6ef\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The winner of the 1st ImageNet competition, <a class=\"af mq\" href=\"http:\/\/www.image-net.org\/challenges\/LSVRC\/2012\/supervision.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Alex Krizhevsky (NIPS 2012)<\/a>, developed a very deep convolutional neural net of the type pioneered by Yann LeCun. Its architecture includes 7 hidden layers, not counting some max pooling layers. The early layers were convolutional, while the last 2 layers were globally connected. The activation functions were rectified linear units in every hidden layer. These train much faster and are more expressive than logistic units. In addition to that, it also uses competitive normalization to suppress hidden activities when nearby units have stronger activities. This helps with variations in intensity.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*kQEY8G7mi88orwys7CQLjw.jpeg\" alt=\"\" width=\"700\" height=\"396\"><\/figure><div class=\"pg ph pn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*kQEY8G7mi88orwys7CQLjw.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*kQEY8G7mi88orwys7CQLjw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*kQEY8G7mi88orwys7CQLjw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*kQEY8G7mi88orwys7CQLjw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*kQEY8G7mi88orwys7CQLjw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*kQEY8G7mi88orwys7CQLjw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*kQEY8G7mi88orwys7CQLjw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*kQEY8G7mi88orwys7CQLjw.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"edda\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">In terms of hardware requirements, Alex uses a very efficient implementation of convolutional nets on 2 Nvidia GTX 580 GPUs (over 1000 fast little cores). The GPUs are very good for matrix-matrix multiplies and also have very high bandwidth to memory. This allows him to train the network in a week and makes it quick to combine results from 10 patches at test time. We can spread a network over many cores if we can communicate the states fast enough. As cores get cheaper and datasets get bigger, big neural nets will improve faster than old-fashioned CV systems. Since AlexNet, there have been multiple new models using CNN as their backbone architecture and achieving excellent results in ImageNet: <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1311.2901.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">ZFNet<\/a> (2013), <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1409.4842.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">GoogLeNet<\/a> (2014), <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1409.1556.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">VGGNet<\/a> (2014), <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1512.03385.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">ResNet<\/a> (2015), <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1608.06993.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">DenseNet<\/a> (2016) etc.<\/p>\n<h1 id=\"66f9\" class=\"oc ms fo be mt od oe of mx og oh oi nb oj ok ol om on oo op oq or os ot ou ov bj\" data-selectable-paragraph=\"\"><strong class=\"al\">2 \u2014 Object Detection<\/strong><\/h1>\n<\/div>\n<\/div>\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1200\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg\" alt=\"\" width=\"1200\" height=\"800\"><\/figure><div class=\"ow bg\">\n<figure class=\"ox oy oz pa pb ow bg paragraph-image\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2400\/format:webp\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 2400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1200px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2400\/1*QSgvANEHZ99_MMYqAb1eBg.jpeg 2400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1200px\" data-testid=\"og\"><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"dbc6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The task to <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/detecting-objects-in-videos-and-camera-feeds-using-keras-opencv-and-imageai-c869fe1ebcdb\" target=\"_blank\" rel=\"noopener ugc nofollow\">define objects within images<\/a> usually involves outputting bounding boxes and labels for individual objects. This differs from the classification \/ localization task by applying classification and localization to many objects instead of just a single dominant object. You only have 2 classes of object classification, which means object bounding boxes and non-object bounding boxes. For example, in car detection, you have to detect all cars in a given image with their bounding boxes.<\/p>\n<p id=\"46b6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">If we use the Sliding Window technique like the way we classify and localize images, we need to apply a CNN to many different crops of the image. Because CNN classifies each crop as object or background, we need to apply CNN to huge numbers of locations and scales, which is very computationally expensive!<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg\" alt=\"\" width=\"700\" height=\"446\"><\/figure><div class=\"pg ph po\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*jIjsYKmgH5nykbqQ6Jyfhg.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"afa4\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">In order to cope with this, neural network researchers have proposed to use <strong class=\"be pf\">regions<\/strong> instead, where we find \u201cblobby\u201d image regions that are likely to contain objects.<\/p>\n<p id=\"4fe2\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">This is relatively fast to run. The first model that kicked things off was <a class=\"af mq\" href=\"https:\/\/www.cv-foundation.org\/openaccess\/content_cvpr_2014\/papers\/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">R-CNN<\/strong><\/a>(Region-based Convolutional Neural Network). In a R-CNN, we first scan the input image for possible objects using an algorithm called Selective Search, generating ~2,000 region proposals. Then we run a CNN on top of each of these region proposals. Finally, we take the output of each CNN and feed it into an SVM to classify the region and a linear regression to tighten the bounding box of the object.<\/p>\n<p id=\"28c7\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Essentially, we turned object detection into an image classification problem. However, there are some problems \u2014 the training is slow, a lot of disk space is required, and inference is also slow.<\/p>\n<p id=\"86e4\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">An immediate descendant to R-CNN is <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1504.08083.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">Fast R-CNN<\/strong><\/a>, which improves the detection speed through 2 augmentations: 1) Performing feature extraction before proposing regions, thus only running one CNN over the entire image, and 2) Replacing SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg\" alt=\"\" width=\"700\" height=\"448\"><\/figure><div class=\"pg ph pp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*KD6dHKVvLX7fQIk6PM_cKw.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"c5ce\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Fast R-CNN performed much better in terms of speed, because it trains just one CNN for the entire image. However, the selective search algorithm is still taking a lot of time to generate region proposals.<\/p>\n<p id=\"d5cb\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Thus comes the invention of <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1506.01497.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">Faster R-CNN<\/strong><\/a>, which now is a canonical model for deep learning-based object detection. It replaces the slow selective search algorithm with a fast neural network by inserting a <a class=\"af mq\" href=\"https:\/\/medium.com\/@tanaykarmarkar\/region-proposal-network-rpn-backbone-of-faster-r-cnn-4a744a38d7f9\" rel=\"noopener\"><strong class=\"be pf\">Region Proposal Network<\/strong> <\/a>(RPN) to predict proposals from features. The RPN is used to decide \u201cwhere\u201d to look in order to reduce the computational requirements of the overall inference process. The RPN quickly and efficiently scans every location in order to assess whether further processing needs to be carried out in a given region. It does that by outputting <em class=\"pe\">k<\/em> bounding box proposals each with 2 scores representing the probability of object or not at each location.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg\" alt=\"\" width=\"700\" height=\"628\"><\/figure><div class=\"pg ph pq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*5KOZC81rRy7GsTLzXkNjYg.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"cadb\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor.<\/p>\n<p id=\"55a5\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Altogether, Faster R-CNN achieved much better speeds and higher accuracy. It\u2019s worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. In other words, Faster R-CNN may not be the simplest or fastest method for object detection, but it\u2019s still one of the best performing.<\/p>\n<p id=\"13f0\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Major Object Detection trends in recent years have shifted towards quicker, more efficient detection systems. This was visible in approaches like <a class=\"af mq\" href=\"http:\/\/lanl.arxiv.org\/pdf\/1612.08242v1\" target=\"_blank\" rel=\"noopener ugc nofollow\">You Only Look Once<\/a> (YOLO), <a class=\"af mq\" href=\"http:\/\/lanl.arxiv.org\/pdf\/1512.02325v5\" target=\"_blank\" rel=\"noopener ugc nofollow\">Single Shot MultiBox Detector<\/a> (SSD), and <a class=\"af mq\" href=\"http:\/\/lanl.arxiv.org\/pdf\/1605.06409v2\" target=\"_blank\" rel=\"noopener ugc nofollow\">Region-Based Fully Convolutional Networks<\/a> (R-FCN) as a move towards sharing computation on a whole image. Hence, these approaches differentiate themselves from the costly subnetworks associated with the 3 R-CNN techniques. The main rationale behind these trends is to avoid having separate algorithms focus on their respective subproblems in isolation, as this typically increases training time and can lower network accuracy.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"f87f\" class=\"oc ms fo be mt od qj of mx og qk oi nb oj ql ol om on qm op oq or qn ot ou ov bj\" data-selectable-paragraph=\"\"><strong class=\"al\">3 \u2014 Object Tracking<\/strong><\/h1>\n<\/div>\n<\/div>\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1112\/1*ecz875HfaF_7S7hPsTc6pA.jpeg\" alt=\"\" width=\"1112\" height=\"522\"><\/figure><div class=\"ow bg\">\n<figure class=\"ox oy oz pa pb ow bg paragraph-image\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2224\/format:webp\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 2224w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1112px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2224\/1*ecz875HfaF_7S7hPsTc6pA.jpeg 2224w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1112px\" data-testid=\"og\"><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"9b83\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\"><a class=\"af mq\" href=\"https:\/\/www.pyimagesearch.com\/2018\/07\/23\/simple-object-tracking-with-opencv\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Object Tracking <\/a>refers to the process of following a specific object of interest, or multiple objects, in a given scene. It traditionally has applications in video and real-world interactions where observations are made following an initial object detection. Now, it\u2019s crucial to autonomous driving systems such as self-driving vehicles from companies like Uber and Tesla.<\/p>\n<p id=\"acf7\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Object Tracking methods can be divided into 2 categories according to the observation model: generative method and discriminative method. The generative method uses the generative model to describe the apparent characteristics and minimizes the reconstruction error to search the object, such as PCA.<\/p>\n<p id=\"f7e8\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The discriminative method can be used to distinguish between the object and the background, its performance is more robust, and it gradually becomes the main method in tracking. The discriminative method is also referred to as Tracking-by-Detection, and deep learning belongs to this category. To achieve tracking-by-detection, we detect candidate objects for all frames and use deep learning to recognize the wanted object from the candidates. There are 2 kinds of basic network models that can be used: <strong class=\"be pf\">stacked auto encoders (SAE)<\/strong> and <strong class=\"be pf\">convolutional neural network (CNN).<\/strong><\/p>\n<p id=\"8604\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The most popular deep network for tracking tasks using SAE is <a class=\"af mq\" href=\"https:\/\/papers.nips.cc\/paper\/5192-learning-a-deep-compact-image-representation-for-visual-tracking.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">Deep Learning Tracker<\/strong><\/a><strong class=\"be pf\">,<\/strong> which proposes offline pre-training and online fine-tuning the net. The process works like this:<\/p>\n<ul class=\"\">\n<li id=\"20a8\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc nr me mf mg ns mi mj mk nt mm mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Off-line unsupervised pre-train the stacked denoising auto-encoder using large-scale natural image datasets to obtain the general object representation. Stacked denoising auto-encoder can obtain more robust feature expression ability by adding noise in input images and reconstructing the original images.<\/li>\n<li id=\"0010\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Combine the coding part of the pre-trained network with a classifier to get the classification network, then use the positive and negative samples obtained from the initial frame to fine-tune the network, which can discriminate the current object and background. DLT uses particle filter as the motion model to produce candidate patches of the current frame. The classification network outputs the probability scores for these patches, meaning the confidence of their classifications, then chooses the highest of these patches as the object.<\/li>\n<li id=\"fc85\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">In the model updating, DLT uses the way of limited threshold.<\/li>\n<\/ul>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg\" alt=\"\" width=\"700\" height=\"788\"><\/figure><div class=\"pg ph qo\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*z94UOY-jMke-nZJWh3ZoKA.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"1ab2\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Because of its superiority in image classification and object detection, CNN has become the mainstream deep model in computer vision and in visual tracking. Generally speaking, a large-scale CNN can be trained both as a classifier and as a tracker. 2 representative CNN-based tracking algorithms are <a class=\"af mq\" href=\"https:\/\/pdfs.semanticscholar.org\/bf94\/906f0d7a8ca9da5f6b86e2a476fde1a34dd0.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">fully-convolutional network tracker<\/strong><\/a> <strong class=\"be pf\">(FCNT)<\/strong> and <a class=\"af mq\" href=\"https:\/\/www.cv-foundation.org\/openaccess\/content_cvpr_2016\/papers\/Nam_Learning_Multi-Domain_Convolutional_CVPR_2016_paper.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">multi-domain CNN<\/strong><\/a> <strong class=\"be pf\">(MD Net).<\/strong><\/p>\n<p id=\"a910\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">FCNT analyzes and takes advantage of the feature maps of the <a class=\"af mq\" href=\"http:\/\/www.robots.ox.ac.uk\/~vgg\/research\/very_deep\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">VGG model<\/a>successfully, which is a pre-trained ImageNet, and results in the following observations:<\/p>\n<ul class=\"\">\n<li id=\"153f\" class=\"lt lu fo be b lv lw lx ly lz ma mb mc nr me mf mg ns mi mj mk nt mm mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">CNN feature maps can be used for localization and tracking.<\/li>\n<li id=\"db9f\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Many CNN feature maps are noisy or un-related for the task of discriminating a particular object from its background.<\/li>\n<li id=\"01ea\" class=\"lt lu fo be b lv nx lx ly lz ny mb mc nr nz mf mg ns oa mj mk nt ob mn mo mp nu nv nw bj\" data-selectable-paragraph=\"\">Higher layers capture semantic concepts on object categories, whereas lower layers encode more discriminative features to capture intra-class variation.<\/li>\n<\/ul>\n<p id=\"e866\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Because of these observations, FCNT designs the feature selection network to select the most relevant feature maps on the conv4\u20133 and conv5\u20133 layers of the VGG network. Then in order to avoid overfitting on noisy ones, it also designs extra two channels (called SNet and GNet) for the selected feature maps from two layers\u2019 separately. The <strong class=\"be pf\">GNet<\/strong> captures the category information of the object, while the <strong class=\"be pf\">SNet<\/strong> discriminates the object from a background with a similar appearance.<\/p>\n<p id=\"e192\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Both of the networks are initialized with the given bounding-box in the first frame to get heat maps of the object, and for new frames, a region of interest (ROI) centered at the object location in the last frame is cropped and propagated. At last, through SNet and GNet, the classifier gets two heat maps for prediction, and the tracker decides which heat map will be used to generate the final tracking result according to whether there are distractors. The pipeline of FCNT is shown below.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg\" alt=\"\" width=\"700\" height=\"199\"><\/figure><div class=\"pg ph pn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*tGDaMhb--A2VODKkL3IyYQ.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"ad1d\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Different from the idea of FCNT, MD Net uses all the sequences of a video to to track movements in them. The networks mentioned above use irrelevant image data to reduce the training demand of tracking data, and this idea has some deviation from tracking. The object of one class in this video can be the background in another video, so MD Net proposes the idea of multi-domain to distinguish the object and background in every domain independently. And a domain indicates a set of videos that contain the same kind of object.<\/p>\n<p id=\"c4c9\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">As shown below, MD Net is divided into 2 parts: the shared layers and the K branches of domain-specific layers. Each branch contains a binary classification layer with softmax loss, which is used to distinguish the object and background in each domain, and the shared layers sharing with all domains to ensure the general representation.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*HL4OhGMRtbHuy7v8depXjg.jpeg\" alt=\"\" width=\"700\" height=\"291\"><\/figure><div class=\"pg ph qp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*HL4OhGMRtbHuy7v8depXjg.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"85a6\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">In recent years, deep learning researchers have tried different ways to adapt to features of the visual tracking task. There are many directions that have been explored: applying other network models such as <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/detecting-the-language-of-a-persons-name-using-pytorch-rnn-29a9090c20f2\" target=\"_blank\" rel=\"noopener ugc nofollow\">Recurrent Neural Net<\/a>and <a class=\"af mq\" href=\"https:\/\/codeburst.io\/deep-learning-deep-belief-network-fundamentals-d0dcfd80d7d4\" target=\"_blank\" rel=\"noopener\">Deep Belief Net<\/a>, designing the network structure to adapt to video processing and end-to-end learning, optimizing the process, structure, and parameters, or even combining deep learning with traditional methods of computer vision or approaches in other fields such as <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/the-7-nlp-techniques-that-will-change-how-you-communicate-in-the-future-part-i-f0114b2f0497\" target=\"_blank\" rel=\"noopener ugc nofollow\">Language Processing<\/a>and Speech Recognition.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"ab ca pr ps pt pu\" role=\"separator\"><strong class=\"al\" style=\"color: var(--wpex-heading-color); font-size: var(--wpex-text-3xl); font-family: var(--wpex-body-font-family, var(--wpex-font-sans));\">4 \u2014 Semantic Segmentation<\/strong><\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1280\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg\" alt=\"\" width=\"1280\" height=\"720\"><\/figure><div class=\"ow bg\">\n<figure class=\"ox oy oz pa pb ow bg paragraph-image\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2560\/format:webp\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 2560w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1280px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2560\/1*0V2fYKOROa4nCuj3Mi3DgQ.jpeg 2560w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1280px\" data-testid=\"og\"><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"5468\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Central to Computer Vision is the process of <a class=\"af mq\" href=\"https:\/\/medium.com\/@ghop02\/building-an-image-segmentation-app-in-ios-3377eb4a3e7c\" rel=\"noopener\">segmentation<\/a>, which divides whole images into pixel groupings which can then be labelled and classified.<\/p>\n<p id=\"4ee4\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Particularly, <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/a-2019-guide-to-semantic-segmentation-ca8242f5a7fc\" target=\"_blank\" rel=\"noopener ugc nofollow\">Semantic Segmentation<\/a> tries to semantically understand the role of each pixel in the image (e.g. is it a car, a motorbike, or some other type of class?). For example, in the picture above, apart from recognizing the person, the road, the cars, the trees, etc., we also have to delineate the boundaries of each object. Therefore, unlike classification, we need dense pixel-wise predictions from our models.<\/p>\n<p id=\"9c70\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">As with other computer vision tasks, CNNs have had enormous success on segmentation problems. One of the popular initial approaches was patch classification through a sliding window, where each pixel was separately classified into classes using a patch of images around it. This, however, is very inefficient computationally because we don\u2019t reuse the shared features between overlapping patches.<\/p>\n<p id=\"9580\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">The solution, instead, is UC Berkeley\u2019s <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1411.4038.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">Fully Convolutional Networks<\/strong><\/a> <strong class=\"be pf\">(FCN), <\/strong>which popularized end-to-end CNN architectures for dense predictions without any fully connected layers. This allowed segmentation maps to be generated for images of any size and was also much faster compared to the patch classification approach. Almost all subsequent approaches to semantic segmentation adopted this paradigm.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg\" alt=\"\" width=\"700\" height=\"364\"><\/figure><div class=\"pg ph qo\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*_k5SCYeFy43b_CFv_zFimQ.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"c00e\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">However, one problem remains: convolutions at original image resolution will be very expensive. To deal with this, FCN uses downsampling and upsampling inside the network. The downsampling layer is known as striped convolution, while the upsampling layer is known as transposed convolution.<\/p>\n<p id=\"4f0d\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Despite the upsampling\/downsampling layers, FCN produces coarse segmentation maps because of information loss during pooling. <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1511.00561.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">SegNet<\/a> is a more memory efficient architecture than FCN that uses-max pooling and an encoder-decoder framework. In SegNet, shortcut\/skip connections are introduced from higher resolution feature maps to improve the coarseness of upsampling\/downsampling.<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*1-ulho5NzNJhq6YNR9KREg.jpeg\" alt=\"\" width=\"700\" height=\"196\"><\/figure><div class=\"pg ph qq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*1-ulho5NzNJhq6YNR9KREg.jpeg 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"7b6b\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Recent research in Semantic Segmentation all relies heavily on fully convolutional networks, such as <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1511.07122.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">Dilated Convolutions<\/a>, <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1412.7062.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">DeepLab<\/a>, and <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1611.06612.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">RefineNet<\/a>.<\/p>\n<h1 id=\"da80\" class=\"oc ms fo be mt od oe of mx og oh oi nb oj ok ol om on oo op oq or os ot ou ov bj\" data-selectable-paragraph=\"\"><strong class=\"al\">5 \u2014 Instance Segmentation<\/strong><\/h1>\n<\/div>\n<\/div>\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1048\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg\" alt=\"\" width=\"1048\" height=\"600\"><\/figure><div class=\"ow bg\">\n<figure class=\"ox oy oz pa pb ow bg paragraph-image\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2096\/format:webp\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 2096w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1048px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2096\/1*pDJ1P9Rv-jcas51SZsVt4A.jpeg 2096w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1048px\" data-testid=\"og\"><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"7839\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Beyond Semantic Segmentation, Instance Segmentation segments different instances of classes, such as labelling 5 cars with 5 different colors. In classification, there\u2019s generally an image with a single object as the focus and the task is to say what that image is. But in order to segment instances, we need to carry out far more complex tasks. We see complicated sights with multiple overlapping objects and different backgrounds, and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:558\/1*ClYLqVgNwZP_nUON061x_w.jpeg\" alt=\"\" width=\"558\" height=\"422\"><\/figure><div class=\"pg ph qr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1116\/format:webp\/1*ClYLqVgNwZP_nUON061x_w.jpeg 1116w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 558px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*ClYLqVgNwZP_nUON061x_w.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*ClYLqVgNwZP_nUON061x_w.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*ClYLqVgNwZP_nUON061x_w.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*ClYLqVgNwZP_nUON061x_w.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*ClYLqVgNwZP_nUON061x_w.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*ClYLqVgNwZP_nUON061x_w.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1116\/1*ClYLqVgNwZP_nUON061x_w.jpeg 1116w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 558px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"fb61\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">So far, we\u2019ve seen how to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes. Can we extend such techniques to locate exact pixels of each object instead of just bounding boxes? This instance segmentation problem is explored at Facebook AI using an architecture known as <a class=\"af mq\" href=\"https:\/\/arxiv.org\/pdf\/1703.06870.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">Mask R-CNN<\/strong><\/a>.<\/p>\n<p id=\"9678\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN\u2019s underlying intuition is straightforward Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel-level segmentation?<\/p>\n<p id=\"7c24\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch is a Fully Convolutional Network on top of a CNN-based feature map. Given the CNN Feature Map as the input, the network outputs a matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a <a class=\"af mq\" href=\"https:\/\/en.wikipedia.org\/wiki\/Mask_%28computing%29\" target=\"_blank\" rel=\"noopener ugc nofollow\">binary mask<\/a>).<\/p>\n<figure class=\"ox oy oz pa pb ow pg ph paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:692\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg\" alt=\"\" width=\"692\" height=\"300\"><\/figure><div class=\"pg ph qs\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1384\/format:webp\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 1384w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1384\/1*QgOk_xUmBM-_MlWSXBK-Dg.jpeg 1384w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 692px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"dbeb\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Additionally, when run without modifications on the original Faster R-CNN architecture, the regions of the feature map selected by RoIPool (Region of Interests Pool) were slightly misaligned from the regions of the original image. Since image segmentation requires pixel-level specificity, unlike bounding boxes, this naturally led to inaccuracies. Mask R-CNN solves this problem by adjusting RoIPool to be more precisely aligned using a method known as <strong class=\"be pf\">RoIAlign<\/strong> (Region of Interests Align). Essentially, RoIAlign uses bilinear interpolation to avoid error in rounding, which causes inaccuracies in detection and segmentation.<\/p>\n<p id=\"4cd4\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Once these masks are generated, Mask R-CNN combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentations:<\/p>\n<\/div>\n<\/div>\n<div class=\"ow\">\n<div class=\"ab ca\">\n<div class=\"qt qu qv qw qx qy ce qz cf ra ch bg\">\n<figure class=\"ox oy oz pa pb ow rc rd paragraph-image\">\n<div class=\"pj pk eb pl bg pm\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg pc pd c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg\" alt=\"\" width=\"1000\" height=\"539\"><\/figure><div class=\"pg ph rb\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2000\/format:webp\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 2000w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*fbDDJ5z8q5xaZ4BhiQGDIw.jpeg 2000w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"6373\" class=\"oc ms fo be mt od oe of mx og oh oi nb oj ok ol om on oo op oq or os ot ou ov bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Conclusion<\/strong><\/h1>\n<p id=\"c592\" class=\"pw-post-body-paragraph lt lu fo be b lv nm lx ly lz nn mb mc md no mf mg mh np mj mk ml nq mn mo mp fh bj\" data-selectable-paragraph=\"\">These 5 major computer vision techniques can help a computer extract, analyze, and understand useful information from a single or a sequence of images. There are many other advanced techniques that I haven\u2019t touched, including <a class=\"af mq\" href=\"https:\/\/medium.com\/@jamesonthecrow\/20-minute-masterpiece-4b6043fdfff5\" rel=\"noopener\">style transfer<\/a>, colorization, action recognition, <a class=\"af mq\" href=\"https:\/\/heartbeat.comet.ml\/3d-face-reconstruction-with-position-map-regression-networks-36f0ac2d3ef1\" target=\"_blank\" rel=\"noopener ugc nofollow\">3D objects<\/a>, human pose estimation, and more.<\/p>\n<p id=\"d53b\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\">Indeed, the field of Computer Vision is too expensive to cover in depth, and I would encourage you to explore it further, whether through online courses, blog tutorials, or formal documents. I\u2019d highly recommend CS231n for starters, as you\u2019ll learn to implement, train, and debug your own neural networks. As a bonus, you can get all the lecture slides and assignment guidelines from <a class=\"af mq\" href=\"https:\/\/github.com\/khanhnamle1994\/computer-vision\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be pf\">my GitHub repository<\/strong><\/a>. I hope it\u2019ll guide you in the quest of changing how to see the world!<\/p>\n<p id=\"6a6b\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\"><em class=\"pe\">If you enjoyed this piece, I\u2019d love it if you hit the clap button<\/em> \ud83d\udc4f <em class=\"pe\">so others might stumble upon it. You can find my own code on<\/em> <a class=\"af mq\" href=\"https:\/\/github.com\/khanhnamle1994\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">GitHub<\/em><\/a><em class=\"pe\">, and more of my writing and projects at<\/em> <a class=\"af mq\" href=\"https:\/\/jameskle.com\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">https:\/\/jameskle.com\/<\/em><\/a><em class=\"pe\">. You can also follow me on <\/em><a class=\"af mq\" href=\"https:\/\/twitter.com\/@james_aka_yale\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">Twitter<\/em><\/a><em class=\"pe\">, <\/em><a class=\"af mq\" href=\"mailto:khanhle.1013@gmail.com\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">email me directly<\/em><\/a><em class=\"pe\"> or <\/em><a class=\"af mq\" href=\"http:\/\/www.linkedin.com\/in\/khanhnamle94\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">find me on LinkedIn<\/em><\/a><em class=\"pe\">. <\/em><a class=\"af mq\" href=\"http:\/\/eepurl.com\/deWjzb\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"pe\">Sign up for my newsletter<\/em><\/a><em class=\"pe\"> to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!<\/em><\/p>\n<p id=\"b037\" class=\"pw-post-body-paragraph lt lu fo be b lv lw lx ly lz ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp fh bj\" data-selectable-paragraph=\"\"><strong class=\"be pf\">Discuss this post on <\/strong><strong class=\"be pf\"><a class=\"af mq\" href=\"https:\/\/news.ycombinator.com\/item?id=16820833\" target=\"_blank\" rel=\"noopener ugc nofollow\">Hacker News.<\/a><\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Computer Vision is one of the hottest research fields within Deep Learning at the moment. It sits at the intersection of many academic subjects, such as Computer Science (Graphics, Algorithms, Theory, Systems, Architecture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics, Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and Psychology (Cognitive Science). As Computer Vision [&hellip;]<\/p>\n","protected":false},"author":39,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[150],"class_list":["post-6203","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The 5 Computer Vision Techniques That Will Change How You See The World - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The 5 Computer Vision Techniques That Will Change How You See The World\" \/>\n<meta property=\"og:description\" content=\"Computer Vision is one of the hottest research fields within Deep Learning at the moment. It sits at the intersection of many academic subjects, such as Computer Science (Graphics, Algorithms, Theory, Systems, Architecture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics, Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and Psychology (Cognitive Science). As Computer Vision [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-15T16:08:17+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:15:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg\" \/>\n<meta name=\"author\" content=\"James Le\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"James Le\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"20 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The 5 Computer Vision Techniques That Will Change How You See The World - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/","og_locale":"en_US","og_type":"article","og_title":"The 5 Computer Vision Techniques That Will Change How You See The World","og_description":"Computer Vision is one of the hottest research fields within Deep Learning at the moment. It sits at the intersection of many academic subjects, such as Computer Science (Graphics, Algorithms, Theory, Systems, Architecture), Mathematics (Information Retrieval, Machine Learning), Engineering (Robotics, Speech, NLP, Image Processing), Physics (Optics), Biology (Neuroscience), and Psychology (Cognitive Science). As Computer Vision [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-06-15T16:08:17+00:00","article_modified_time":"2025-04-24T17:15:24+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg","type":"","width":"","height":""}],"author":"James Le","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"James Le","Est. reading time":"20 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/"},"author":{"name":"James Le","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/9ea207111d311668f59477646ffd469a"},"headline":"The 5 Computer Vision Techniques That Will Change How You See The World","datePublished":"2023-06-15T16:08:17+00:00","dateModified":"2025-04-24T17:15:24+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/"},"wordCount":3583,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/","url":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/","name":"The 5 Computer Vision Techniques That Will Change How You See The World - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg","datePublished":"2023-06-15T16:08:17+00:00","dateModified":"2025-04-24T17:15:24+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*TaXXuvQ6kBn1nCcLVlhpAA.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/the-5-computer-vision-techniques-that-will-change-how-you-see-the-world\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The 5 Computer Vision Techniques That Will Change How You See The World"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/9ea207111d311668f59477646ffd469a","name":"James Le","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/e9faebcdd7afdaff187857dc289b23ba","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1678305362870-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1678305362870-96x96.jpg","caption":"James Le"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/khanhle-1013gmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6203","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=6203"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6203\/revisions"}],"predecessor-version":[{"id":15614,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/6203\/revisions\/15614"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=6203"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=6203"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=6203"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=6203"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}