{"id":7465,"date":"2023-09-12T16:11:25","date_gmt":"2023-09-13T00:11:25","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7465"},"modified":"2025-04-24T17:14:08","modified_gmt":"2025-04-24T17:14:08","slug":"computer-vision-and-deep-learning-from-image-to-video-analysis","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/","title":{"rendered":"Computer Vision and Deep Learning: From Image to Video Analysis"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" style=\"color: var(--wpex-text-2); font-family: var(--wpex-body-font-family, var(--wpex-font-sans)); font-size: var(--wpex-body-font-size, 13px);\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg\" alt=\"\" width=\"1000\" height=\"750\"><\/figure><p id=\"5947\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\"><\/p>\n<\/div>\n<\/div>\n<div class=\"mw\">\n<div class=\"ab ca\">\n<div class=\"mx my mz na nb nc ce nd cf ne ch bg\">\n<figure class=\"ni nj nk nl nm mw nn no paragraph-image\"><figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mv\" href=\"https:\/\/unsplash.com\/@dmjdenise?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Denise Jans<\/a> on <a class=\"af mv\" href=\"https:\/\/unsplash.com\/s\/photos\/film?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"3f71\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Computer vision, at its core, is about understanding images. The field has seen rapid growth over the last few years, especially due to deep learning and the ability to detect obstacles, segment images, or extract relevant context from a given scene.<\/p>\n<p id=\"ad6c\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Using computer vision, we can build autonomous cars, smart buildings, fashion recommender systems, augmented reality tools\u2026the possibilities are endless.<\/p>\n<p id=\"b9ae\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">One area in particular is starting to garner more attention: <strong class=\"be od\"><em class=\"lz\">Video<\/em><\/strong>. Most applications of computer vision today center on images, with less focused on sequences of images (i.e. video frames).<\/p>\n<p id=\"bac0\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Video allows for deeper situational understanding, because sequences of images provide new information about <em class=\"lz\">action<\/em>. For example, we can track an obstacle through a sequence of images and understand its behavior to predict the next move. We can track a human pose, and understand the action taken with action classification.<\/p>\n<p id=\"f24b\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">When analyzing videos, we create new use cases and move from \u201cthis image contains 3 people\u201d to \u201cthis images contains 3 people playing X\u201d.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"e7f1\" class=\"pk on fo be oo pl pm pn os po pp pq ow pr ps pt pu pv pw px py pz qa qb qc qd bj\" data-selectable-paragraph=\"\">Video Analysis Algorithms<\/h1>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*gxagNRM2ZKVwUui0\" alt=\"\" width=\"700\" height=\"394\"><\/figure><div class=\"nf ng qe\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*gxagNRM2ZKVwUui0 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*gxagNRM2ZKVwUui0 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*gxagNRM2ZKVwUui0 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*gxagNRM2ZKVwUui0 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*gxagNRM2ZKVwUui0 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*gxagNRM2ZKVwUui0 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*gxagNRM2ZKVwUui0 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*gxagNRM2ZKVwUui0 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*gxagNRM2ZKVwUui0 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*gxagNRM2ZKVwUui0 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*gxagNRM2ZKVwUui0 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*gxagNRM2ZKVwUui0 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*gxagNRM2ZKVwUui0 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*gxagNRM2ZKVwUui0 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">Photo by <a class=\"af mv\" href=\"https:\/\/unsplash.com\/@fredasem?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Fred Kearney<\/a> on <a class=\"af mv\" href=\"https:\/\/unsplash.com\/?utm_source=medium&amp;utm_medium=referral\" target=\"_blank\" rel=\"noopener ugc nofollow\">Unsplash<\/a><\/figcaption>\n<\/figure>\n<h2 id=\"6857\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Obstacle tracking &amp; video analysis \u2014 An active area of research<\/strong><\/h2>\n<p id=\"151a\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">Whether for surveillance camera systems or football analysis, the next generation of computer vision algorithms will include time.<\/p>\n<p id=\"62fa\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The task of video surveillance involves two kind of algorithms:<\/p>\n<ul class=\"\">\n<li id=\"586c\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">Object tracking<\/strong><\/li>\n<li id=\"938f\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">Action classification<\/strong><\/li>\n<\/ul>\n<p id=\"6c04\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Let\u2019s have a look at both. At the end of this article, you\u2019ll have a more complete picture of video analysis systems.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"a1d5\" class=\"pk on fo be oo pl pm pn os po pp pq ow pr ps pt pu pv pw px py pz qa qb qc qd bj\" data-selectable-paragraph=\"\">1. Object Tracking<\/h1>\n<p id=\"9061\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">A video is a set of frames. When studying a video, we can either study <strong class=\"be od\">a video stream<\/strong> (live image feed) or <strong class=\"be od\">a video sequence<\/strong> (fixed-length video).<\/p>\n<ul class=\"\">\n<li id=\"ad86\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">In a video stream, we consider the current image and the previous ones.<\/li>\n<li id=\"4582\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">In a video sequence, we have access to the full video, from the first image to the last.<\/li>\n<\/ul>\n<p id=\"1910\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Videos take up a lot of storage space and are usually not already using AI. This means that, with video, we simply have raw image data to work with.<\/p>\n<p id=\"a6d2\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">But there is a key difference.<strong class=\"be od\"> Specifically, motion. <\/strong>Motion is the only difference between an image and a video. It\u2019s a powerful thing to track and can lead to action understanding, pose estimation, or movement tracking.<\/p>\n<h2 id=\"3826\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Optical Flow<\/h2>\n<p id=\"5252\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\"><strong class=\"al\">In video analysis, this key problem is called optical flow estimation. <\/strong>Optical flow is the idea of computing a pixel shift between two frames. This is handled as a correspondence problem, as illustrated in the following image:<\/p>\n<figure class=\"rg rh ri rj rk mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*HdSZAckkzQchDz9BEkcAlg.png\" alt=\"\" width=\"700\" height=\"398\"><\/figure><div class=\"nf ng rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*HdSZAckkzQchDz9BEkcAlg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*HdSZAckkzQchDz9BEkcAlg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*HdSZAckkzQchDz9BEkcAlg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*HdSZAckkzQchDz9BEkcAlg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*HdSZAckkzQchDz9BEkcAlg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*HdSZAckkzQchDz9BEkcAlg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*HdSZAckkzQchDz9BEkcAlg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*HdSZAckkzQchDz9BEkcAlg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">Frame 0 \u2014 Frame 1<\/figcaption>\n<\/figure>\n<p id=\"8112\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The output optical flow is a vector of movement between frame 1 and frame 2. It looks like this:<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:480\/1*YuhBcSo1PMqApCNfaEDXxA.gif\" alt=\"\" width=\"480\" height=\"270\"><\/figure><div class=\"nf ng rl\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*YuhBcSo1PMqApCNfaEDXxA.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*YuhBcSo1PMqApCNfaEDXxA.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*YuhBcSo1PMqApCNfaEDXxA.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*YuhBcSo1PMqApCNfaEDXxA.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*YuhBcSo1PMqApCNfaEDXxA.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*YuhBcSo1PMqApCNfaEDXxA.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:960\/1*YuhBcSo1PMqApCNfaEDXxA.gif 960w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 480px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*YuhBcSo1PMqApCNfaEDXxA.gif 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*YuhBcSo1PMqApCNfaEDXxA.gif 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*YuhBcSo1PMqApCNfaEDXxA.gif 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*YuhBcSo1PMqApCNfaEDXxA.gif 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*YuhBcSo1PMqApCNfaEDXxA.gif 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*YuhBcSo1PMqApCNfaEDXxA.gif 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:960\/1*YuhBcSo1PMqApCNfaEDXxA.gif 960w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 480px\" data-testid=\"og\"><\/picture><\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/pharrellwang.com\/static\/gif\/flow.gif\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"23b0\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">A lot of existing datasets address the optical flow problem, such as <a class=\"af mv\" href=\"http:\/\/www.cvlibs.net\/datasets\/kitti\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">KITTI Vision Benchmark Suite<\/a> or <a class=\"af mv\" href=\"http:\/\/sintel.is.tue.mpg.de\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">MPI Sintel<\/a>. They both contain ground truth optical flow data, which is generally hard to get from more popular datasets.<\/p>\n<p id=\"b8e3\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">To solve the optical flow problem, convolutional neural networks can help.<br>\n<a class=\"af mv\" href=\"https:\/\/arxiv.org\/abs\/1504.06852\" target=\"_blank\" rel=\"noopener ugc nofollow\">FlowNet<\/a> is an example of a CNN designed for optical flow tasks, and it can output the optical flow from two frames.<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*xxmptxywxZ2_xFmZ.png\" alt=\"\" width=\"700\" height=\"160\"><\/figure><div class=\"nf ng rm\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*xxmptxywxZ2_xFmZ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*xxmptxywxZ2_xFmZ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*xxmptxywxZ2_xFmZ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*xxmptxywxZ2_xFmZ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*xxmptxywxZ2_xFmZ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*xxmptxywxZ2_xFmZ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*xxmptxywxZ2_xFmZ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*xxmptxywxZ2_xFmZ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*xxmptxywxZ2_xFmZ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*xxmptxywxZ2_xFmZ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*xxmptxywxZ2_xFmZ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*xxmptxywxZ2_xFmZ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*xxmptxywxZ2_xFmZ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*xxmptxywxZ2_xFmZ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/arxiv.org\/pdf\/1504.06852.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"1ae5\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The input of the network is a set of two RGB images; thus it has a depth of 6.<\/p>\n<p class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Optical flow is often represented by colors.<\/p>\n<p id=\"9c96\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\">The first problem we want to solve is understanding the movement of pixels from one frame to another. Optical flow estimations can be done in a video stream or a video sequence. A classification of the output vectors can then be inferred to understand movement.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"6281\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Visual Object Tracking (VOT)<\/h2>\n<p id=\"c1e4\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">First, we can simply track objects <strong class=\"be od\">visual Object Tracking (VOT)<\/strong> is the science of tracking an object given its position in frame 1. We are not using a detection algorithm here\u2014thus, we\u2019re model free. In other words, we don\u2019t know what we are tracking. We are simply given a starter bounding box and are asked to keep track of this object all along.<\/p>\n<p id=\"3a7a\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Tracking is performed by computing the similarity between frames 0 and 1.We check what\u2019s in the bounding box and try to retrieve it in the next frame.<br>\nWe can then move the bounding box a bit and track our obstacle.<\/p>\n<p id=\"d80b\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Other features, such as color, can also be used to track the objects. Here, we compute the color of the given object and then compute the background that represents is the closest color to the object. Then we remove it from our original image to track it.<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*CNDNAd1GRaA_uBAC.png\" alt=\"\" width=\"700\" height=\"206\"><\/figure><div class=\"nf ng rn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*CNDNAd1GRaA_uBAC.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*CNDNAd1GRaA_uBAC.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*CNDNAd1GRaA_uBAC.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*CNDNAd1GRaA_uBAC.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*CNDNAd1GRaA_uBAC.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*CNDNAd1GRaA_uBAC.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*CNDNAd1GRaA_uBAC.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*CNDNAd1GRaA_uBAC.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*CNDNAd1GRaA_uBAC.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*CNDNAd1GRaA_uBAC.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*CNDNAd1GRaA_uBAC.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*CNDNAd1GRaA_uBAC.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*CNDNAd1GRaA_uBAC.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*CNDNAd1GRaA_uBAC.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/www.coursera.org\/lecture\/deep-learning-in-computer-vision\/color-models-Y8Wzn\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*lzDFT1YHYIixAHeO.png\" alt=\"\" width=\"700\" height=\"220\"><\/figure><div class=\"nf ng ro\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*lzDFT1YHYIixAHeO.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*lzDFT1YHYIixAHeO.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*lzDFT1YHYIixAHeO.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*lzDFT1YHYIixAHeO.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*lzDFT1YHYIixAHeO.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*lzDFT1YHYIixAHeO.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*lzDFT1YHYIixAHeO.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*lzDFT1YHYIixAHeO.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*lzDFT1YHYIixAHeO.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*lzDFT1YHYIixAHeO.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*lzDFT1YHYIixAHeO.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*lzDFT1YHYIixAHeO.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*lzDFT1YHYIixAHeO.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*lzDFT1YHYIixAHeO.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/www.coursera.org\/lecture\/deep-learning-in-computer-vision\/color-models-Y8Wzn\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"9b69\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">This is very powerful technique, and it only uses computer vision. We don\u2019t need a single neural network to do this. To summarize this process:<\/p>\n<ol class=\"\">\n<li id=\"5d73\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu rp qo qp bj\" data-selectable-paragraph=\"\">We receive the initial object to track using a bounding box<\/li>\n<li id=\"dc6c\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu rp qo qp bj\" data-selectable-paragraph=\"\">We compute a color histogram of this object<\/li>\n<li id=\"5f17\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu rp qo qp bj\" data-selectable-paragraph=\"\">We compute the color of the background (near the object)<\/li>\n<li id=\"366a\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu rp qo qp bj\" data-selectable-paragraph=\"\">We remove the object color from the total image<\/li>\n<li id=\"4488\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu rp qo qp bj\" data-selectable-paragraph=\"\">We now have a color-based obstacle tracker<\/li>\n<\/ol>\n<blockquote class=\"lu lv lw\"><p id=\"2d1d\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\">To find datasets for this task, check out <a class=\"af mv\" href=\"https:\/\/votchallenge.net\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">votchallenge.net<\/a><\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"110c\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\"><strong class=\"al\">The next step is to apply CNNs for this task<\/strong><\/h2>\n<p id=\"dc6a\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">We must distinguish two main models here: <strong class=\"be od\">MDNet<\/strong> and <strong class=\"be od\">GOTURN.<\/strong><\/p>\n<ul class=\"\">\n<li id=\"e219\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><mark class=\"aee aef ao\">An <\/mark><mark class=\"aee aef ao\"><strong class=\"be od\">MDNet (Multi-Domain Net)<\/strong><\/mark><mark class=\"aee aef ao\"> tracker trains a neural network to distinguish between an object and the background.<\/mark><\/li>\n<\/ul>\n<p id=\"a4e5\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The architecture looks like a VGG model\u2014in the end, we have domain-specific layers (object vs background classifier).<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*VjLXLedY0OQAdvlR.png\" alt=\"\" width=\"700\" height=\"319\"><\/figure><div class=\"nf ng rq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*VjLXLedY0OQAdvlR.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*VjLXLedY0OQAdvlR.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*VjLXLedY0OQAdvlR.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*VjLXLedY0OQAdvlR.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*VjLXLedY0OQAdvlR.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*VjLXLedY0OQAdvlR.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*VjLXLedY0OQAdvlR.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*VjLXLedY0OQAdvlR.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*VjLXLedY0OQAdvlR.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*VjLXLedY0OQAdvlR.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*VjLXLedY0OQAdvlR.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*VjLXLedY0OQAdvlR.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*VjLXLedY0OQAdvlR.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*VjLXLedY0OQAdvlR.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/arxiv.org\/pdf\/1510.07945.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<ul class=\"\">\n<li id=\"00d5\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">GOTURN<\/strong> (Generic Object Tracking Using Regression Networks) works by using two neural networks and specifying the region to search. It can work at over 100 FPS, which is amazing for the task of video tracking.<\/li>\n<\/ul>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*8O4aU3RHVqx7lgzi.png\" alt=\"\" width=\"700\" height=\"344\"><\/figure><div class=\"nf ng rr\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/0*8O4aU3RHVqx7lgzi.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/0*8O4aU3RHVqx7lgzi.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/0*8O4aU3RHVqx7lgzi.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/0*8O4aU3RHVqx7lgzi.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/0*8O4aU3RHVqx7lgzi.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/0*8O4aU3RHVqx7lgzi.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/0*8O4aU3RHVqx7lgzi.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*8O4aU3RHVqx7lgzi.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*8O4aU3RHVqx7lgzi.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*8O4aU3RHVqx7lgzi.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*8O4aU3RHVqx7lgzi.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*8O4aU3RHVqx7lgzi.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*8O4aU3RHVqx7lgzi.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*8O4aU3RHVqx7lgzi.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/davheld.github.io\/GOTURN\/GOTURN.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h2 id=\"a4da\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Multiple Object Tracking (MOT)<\/h2>\n<p id=\"8f1d\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">The last family of trackers is referred to as multiple object tracking. Here\u2019s a look at MOT in practice:<\/p>\n<figure class=\"ni nj nk nl nm mw\">\n<div class=\"ph ig l eb\">\n<div class=\"rs pj l\"><iframe loading=\"lazy\" class=\"ek n fc dx bg\" title=\"Multiple Human Tracking - Deep SORT Implementation\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fbkn6M4LAoHk%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dbkn6M4LAoHk&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fbkn6M4LAoHk%2Fhqdefault.jpg&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=youtube\" width=\"854\" height=\"480\" frameborder=\"0\" scrolling=\"no\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\">Deep SORT algorithm for MOT<\/figcaption>\n<\/figure>\n<p id=\"bcfe\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Unlike the other family of trackers (VOT), MOT is more long-term.<\/p>\n<p id=\"6830\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">We distinguish two variants:<\/p>\n<ul class=\"\">\n<li id=\"3fbd\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><em class=\"lz\">Detection-Based Tracking<\/em><\/li>\n<li id=\"2b50\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><em class=\"lz\">Detection-Free Tracking<\/em><\/li>\n<\/ul>\n<p id=\"6b27\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Let\u2019s consider Detection-Based Tracking. We have two tasks here:<\/p>\n<ul class=\"\">\n<li id=\"36e9\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">Object detection<\/li>\n<li id=\"c259\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">Object association<\/li>\n<\/ul>\n<p id=\"8fb1\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Object association means that we have to associate detections from time t to detections from time t+1. It relies heavily on the quality of the detector.<br>\nA bad detector will render the tracker not functional.<\/p>\n<p id=\"c725\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">A good tracker should handle a few frames with no detections.<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*nJptHkqZerAsKKM-H8RAGw.png\" alt=\"\" width=\"700\" height=\"223\"><\/figure><div class=\"nf ng rt\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*nJptHkqZerAsKKM-H8RAGw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*nJptHkqZerAsKKM-H8RAGw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*nJptHkqZerAsKKM-H8RAGw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*nJptHkqZerAsKKM-H8RAGw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*nJptHkqZerAsKKM-H8RAGw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*nJptHkqZerAsKKM-H8RAGw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*nJptHkqZerAsKKM-H8RAGw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*nJptHkqZerAsKKM-H8RAGw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"c192\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">We can also distinguish between online and offline tracking. Online tracking means that we are on a live feed. Offline tracking is working on a full video, and we have future frames available.<\/p>\n<p id=\"9a69\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">For online tracking, we\u2019re tracking the bounding boxes detected by the CNN.We can use:<\/p>\n<ul class=\"\">\n<li id=\"46fa\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">A CNN for the detection<\/li>\n<li id=\"91ff\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">A Kalman Filter to predict the position at time t from the position at time t-1<\/li>\n<li id=\"51d0\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\">The Hungarian Algorithm for detection of frame association<\/li>\n<\/ul>\n<p id=\"30ee\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The matching metric for the Hungarian algorithm can be IOU (Intersection Over Union) or deep convolutional features. Using deep convolutional features allows for re-identification after occlusion but slows down the tracker.<\/p>\n<blockquote class=\"lu lv lw\"><p id=\"7738\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\">To find datasets for MOT tasks, check out <a class=\"af mv\" href=\"https:\/\/motchallenge.net\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">motchallenge.net<\/a><\/p><\/blockquote>\n<p id=\"54a4\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Multi Object Tracking is adding possibilities to obstacle detection like Game Analysis or Behavioral Prediction.<\/p>\n<p id=\"864c\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">For more on this, check out my article <a class=\"af mv\" href=\"https:\/\/towardsdatascience.com\/computer-vision-for-tracking-8220759eee85\" target=\"_blank\" rel=\"noopener\">Computer Vision for Tracking<\/a>!<\/p>\n<blockquote class=\"qv\"><p id=\"9bf9\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\"><strong class=\"al\">I just released an online course on Multi Object Tracking called <\/strong><a class=\"af mv\" href=\"https:\/\/courses.thinkautonomous.ai\/obstacle-tracking\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"al\">LEARN OBSTACLE TRACKING \u2014 The Master Key to become a Self-Driving Car Professional<\/strong><\/a><strong class=\"al\">.<\/strong><\/p><p id=\"0eef\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\">It\u2019s a course that doesn\u2019t exist anywhere else, and where you\u2019ll learn how to code this exact tracking system.<br>\n\ud83d\ude80 <a class=\"af mv\" href=\"https:\/\/courses.thinkautonomous.ai\/obstacle-tracking\" target=\"_blank\" rel=\"noopener ugc nofollow\">Join here<\/a><\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"b98b\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\"><em class=\"ru\">To summarize:<\/em> The first family of video analysis systems is obstacle tracking.It includes optical flow estimation, visual object tracking, and multi-object tracking.<\/p>\n<p id=\"e98e\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\">All these algorithms can be detection-free or detection-based and all include one idea: track an object in a video.<\/p>\n<p id=\"6e2c\" class=\"qw qx fo be qy qz ra rb rc rd re mu dv\" data-selectable-paragraph=\"\">Next up, we\u2019ll take a closer look at action classification.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"42c8\" class=\"pk on fo be oo pl pm pn os po pp pq ow pr ps pt pu pv pw px py pz qa qb qc qd bj\" data-selectable-paragraph=\"\">2. Action Classification<\/h1>\n<p id=\"607b\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">Action classification is the second family of tasks involved in building computer vision-based surveillance systems. Once we know how many people we have in the store, and once we know what they\u2019ve been doing, we must analyze their actions.<\/p>\n<p id=\"9ce8\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Action classification depends directly on object detection and tracking\u2014this is because we first need to understand a given situation or scene. Once we have that understanding, we can classify the actions inside the bounding box.<\/p>\n<p id=\"c780\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">First, we must choose the camera that sees them with the best angle. Some angles might be better than others. If we choose the correct camera every time\u2014for example, the camera that shows a face\u2014then we can be sure we have a workable image.<\/p>\n<p id=\"f4ee\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Actions can be really simple, like walking, running, clapping, or waving. They can also be more complex, like making a sandwich, which involves a series of multiple actions (cutting bred, washing tomatoes, etc).<\/p>\n<h2 id=\"409a\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Datasets<\/h2>\n<p id=\"8734\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">Labeling is much easier for classification than for tracking\u2014we can simply assign a label to a set of images.<\/p>\n<p id=\"7c8f\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The KTH Actions Dataset is good for gathering videos and associated labels. The <a class=\"af mv\" href=\"https:\/\/www.crcv.ucf.edu\/data\/UCF_Sports_Action.php\" target=\"_blank\" rel=\"noopener ugc nofollow\">UCF Sport Actions<\/a> is a dataset that\u2019s sports-oriented, but it includes useful samples.<\/p>\n<p id=\"fd1e\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">More recently, datasets like Hollywood2 for movies scenes, HMDB, or UCF 101 have been released.<\/p>\n<h2 id=\"9201\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Optical flow<\/h2>\n<p id=\"4fa0\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">Since optical flow is used to determine a vector between two frames. It can be used as an input for a classification neural network.<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*urTVRZuE7GScrR83Nu6tgA.png\" alt=\"\" width=\"700\" height=\"226\"><\/figure><div class=\"nf ng rv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*urTVRZuE7GScrR83Nu6tgA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*urTVRZuE7GScrR83Nu6tgA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*urTVRZuE7GScrR83Nu6tgA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*urTVRZuE7GScrR83Nu6tgA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*urTVRZuE7GScrR83Nu6tgA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*urTVRZuE7GScrR83Nu6tgA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*urTVRZuE7GScrR83Nu6tgA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*urTVRZuE7GScrR83Nu6tgA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/arxiv.org\/pdf\/1612.03052.pdf\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"0d0a\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\">From the optical flow, we define actions and stack a neural network classifier.<\/p>\n<h2 id=\"63be\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Action Classification with Machine Learning (End-To-End)<\/h2>\n<p id=\"20b5\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">The more recent and modern solution would be to use CNNs.<\/p>\n<p id=\"ef7c\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">Action happens in a video, not an image. This means that we must send multiple frames to the CNN, which must then perform a classification task on what\u2019s called a <strong class=\"be od\">space-time volume<\/strong>.<\/p>\n<p id=\"a53d\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">With an image, it\u2019s hard enough to do object detection or classification due to the image size, its rotation, etc. In a video, it\u2019s even more difficult.<br>\nHere\u2019s an example of a two-scale model working to classify actions from image streams.<\/p>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png\" alt=\"\" width=\"700\" height=\"238\"><\/figure><div class=\"nf ng rw\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*Z_NvXLogrn-A1Ii3A1D-XQ.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"http:\/\/blog.qure.ai\/assets\/images\/actionrec\/2stream_high.png\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<p id=\"b9d5\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">These neural networks work on 2 inputs and output an action. <strong class=\"be od\">The spatial stream is working on a single image<\/strong>; it\u2019s stacked with the <strong class=\"be od\">temporal stream working on an input optical flow<\/strong>. A linear classifier is applied here.<\/p>\n<p id=\"72f5\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">There are a lot of action classification networks that already exist\u2014it\u2019s a hard problem to solve.<\/p>\n<p id=\"7e78\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\" data-selectable-paragraph=\"\">For a complete overview of action classification neural networks, I encourage you to read <a class=\"af mv\" href=\"http:\/\/blog.qure.ai\/assets\/images\/actionrec\/2stream_high.png\" target=\"_blank\" rel=\"noopener ugc nofollow\">this blog post<\/a>.<\/p>\n<h2 id=\"c2e1\" class=\"om on fo be oo op oq or os ot ou ov ow oa ox oy oz ob pa pb pc oc pd pe pf pg bj\" data-selectable-paragraph=\"\">Pose Estimation<\/h2>\n<p id=\"3f61\" class=\"pw-post-body-paragraph lx ly fo be b ma qf mc md me qg mg mh oa qh mk ml ob qi mo mp oc qj ms mt mu fh bj\" data-selectable-paragraph=\"\">Finally, know that pose estimation is another deep learning technique used as a mean for action classification.<\/p>\n<p id=\"3a58\" class=\"pw-post-body-paragraph lx ly fo be b ma mb mc md me mf mg mh oa mj mk ml ob mn mo mp oc mr ms mt mu fh bj\" data-selectable-paragraph=\"\">The process of pose estimation includes:<\/p>\n<ul class=\"\">\n<li id=\"ae2e\" class=\"lx ly fo be b ma mb mc md me mf mg mh oa qk mk ml ob ql mo mp oc qm ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">Detecting keypoints <\/strong>(similar to facial landmarks)<\/li>\n<li id=\"12fe\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">Tracking these keypoints<\/strong><\/li>\n<li id=\"3567\" class=\"lx ly fo be b ma qq mc md me qr mg mh oa qs mk ml ob qt mo mp oc qu ms mt mu qn qo qp bj\" data-selectable-paragraph=\"\"><strong class=\"be od\">Classifying the keypoints\u2019 movement<\/strong><\/li>\n<\/ul>\n<figure class=\"ni nj nk nl nm mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*1Vx9-Yxk07gdQtzystnpXw.png\" alt=\"\" width=\"700\" height=\"487\"><\/figure><div class=\"nf ng rx\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*1Vx9-Yxk07gdQtzystnpXw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*1Vx9-Yxk07gdQtzystnpXw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*1Vx9-Yxk07gdQtzystnpXw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*1Vx9-Yxk07gdQtzystnpXw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*1Vx9-Yxk07gdQtzystnpXw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*1Vx9-Yxk07gdQtzystnpXw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*1Vx9-Yxk07gdQtzystnpXw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*1Vx9-Yxk07gdQtzystnpXw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/medium.com\/@masherov14\/pose-estimation-metrics-844c07ba0a78\" rel=\"noopener\">source<\/a>)<\/figcaption>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h6 id=\"5cf7\" class=\"lx ly lz be b ma mb mc md me mf mg mh mi mj mk ml mm mn mo mp mq mr ms mt mu fh bj\">Here is a complete overview of the algorithms.<\/h6>\n<\/div>\n<\/div>\n<div class=\"mw\">\n<div class=\"ab ca\">\n<div class=\"mx my mz na nb nc ce nd cf ne ch bg\">\n<figure class=\"ni nj nk nl nm mw nn no paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*vAhTRWkkjqX6__t1xN9HcA.png\" alt=\"\" width=\"1000\" height=\"565\"><\/figure><div class=\"nf ng ry\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2000\/format:webp\/1*vAhTRWkkjqX6__t1xN9HcA.png 2000w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*vAhTRWkkjqX6__t1xN9HcA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*vAhTRWkkjqX6__t1xN9HcA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*vAhTRWkkjqX6__t1xN9HcA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*vAhTRWkkjqX6__t1xN9HcA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*vAhTRWkkjqX6__t1xN9HcA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*vAhTRWkkjqX6__t1xN9HcA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*vAhTRWkkjqX6__t1xN9HcA.png 2000w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 1000px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<p id=\"49f8\" class=\"qw qx fo be qy qz rz sa sb sc sd mu dv\" data-selectable-paragraph=\"\"><strong class=\"al\">Video Analysis is the next step in Computer Vision. Our algorithms will now need to understand sequences of images, 6D inputs, and time related scenes.<\/strong><\/p>\n<figure class=\"rg rh ri rj rk mw nf ng paragraph-image\">\n<div class=\"np nq eb nr bg ns\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg nt nu c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png\" alt=\"\" width=\"700\" height=\"438\"><\/figure><div class=\"nf ng se\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*hNL2ZhVZ7m4CWkJVVhaHuA.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"nv nw nx nf ng ny nz be b bf z dv\" data-selectable-paragraph=\"\">(<a class=\"af mv\" href=\"https:\/\/umbrellatech.co\/wp-content\/uploads\/2019\/10\/Cameras-for-Business-Security-and-Key-Facts-of-Video-Surveillance-Systems-min-1080x675.png\" target=\"_blank\" rel=\"noopener ugc nofollow\">source<\/a>)<\/figcaption>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Photo by Denise Jans on Unsplash Computer vision, at its core, is about understanding images. The field has seen rapid growth over the last few years, especially due to deep learning and the ability to detect obstacles, segment images, or extract relevant context from a given scene. Using computer vision, we can build autonomous cars, [&hellip;]<\/p>\n","protected":false},"author":62,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[162],"class_list":["post-7465","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Computer Vision and Deep Learning: From Image to Video Analysis - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Computer Vision and Deep Learning: From Image to Video Analysis\" \/>\n<meta property=\"og:description\" content=\"Photo by Denise Jans on Unsplash Computer vision, at its core, is about understanding images. The field has seen rapid growth over the last few years, especially due to deep learning and the ability to detect obstacles, segment images, or extract relevant context from a given scene. Using computer vision, we can build autonomous cars, [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-13T00:11:25+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:08+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg\" \/>\n<meta name=\"author\" content=\"Jeremy Cohen\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jeremy Cohen\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Computer Vision and Deep Learning: From Image to Video Analysis - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/","og_locale":"en_US","og_type":"article","og_title":"Computer Vision and Deep Learning: From Image to Video Analysis","og_description":"Photo by Denise Jans on Unsplash Computer vision, at its core, is about understanding images. The field has seen rapid growth over the last few years, especially due to deep learning and the ability to detect obstacles, segment images, or extract relevant context from a given scene. Using computer vision, we can build autonomous cars, [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-09-13T00:11:25+00:00","article_modified_time":"2025-04-24T17:14:08+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg","type":"","width":"","height":""}],"author":"Jeremy Cohen","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Jeremy Cohen","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/"},"author":{"name":"Jeremy Cohen","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/1e64ca8044dcb12997aabe6d1d38c5a7"},"headline":"Computer Vision and Deep Learning: From Image to Video Analysis","datePublished":"2023-09-13T00:11:25+00:00","dateModified":"2025-04-24T17:14:08+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/"},"wordCount":1765,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/","url":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/","name":"Computer Vision and Deep Learning: From Image to Video Analysis - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg","datePublished":"2023-09-13T00:11:25+00:00","dateModified":"2025-04-24T17:14:08+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*-ajPIIkeIBzMY-050e96kA.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/computer-vision-and-deep-learning-from-image-to-video-analysis\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Computer Vision and Deep Learning: From Image to Video Analysis"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/1e64ca8044dcb12997aabe6d1d38c5a7","name":"Jeremy Cohen","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2cf8c0f1d0bef51059a3a80ededdf00a","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1686841399905-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1686841399905-96x96.jpg","caption":"Jeremy Cohen"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/hellothinkautonomous-ai\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7465","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/62"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7465"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7465\/revisions"}],"predecessor-version":[{"id":15545,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7465\/revisions\/15545"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7465"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7465"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7465"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7465"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}