{"id":7389,"date":"2023-09-07T10:03:37","date_gmt":"2023-09-07T18:03:37","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7389"},"modified":"2025-04-24T17:14:24","modified_gmt":"2025-04-24T17:14:24","slug":"pre-trained-machine-learning-models-vs-models-trained-from-scratch","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/","title":{"rendered":"Pre-Trained Machine Learning Models vs Models Trained from Scratch"},"content":{"rendered":"\n<div class=\"eo ep eq er es\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lw lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg\" alt=\"\" width=\"2400\" height=\"1667\"><\/figure><div class=\"lq bg\">\n<figure class=\"lr ls lt lu lv lq bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<blockquote class=\"ly lz ma\"><p id=\"25fd\" class=\"mb mc md be b ft me mf mg fw mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw eo bj\" data-selectable-paragraph=\"\"><strong class=\"be mx\">From the Abstract<\/strong>: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.The results are no worse than their ImageNet pre-training counterparts, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics.<\/p><\/blockquote>\n<h1 id=\"4ec9\" class=\"my mz ev be na nb nc fv nd ne nf fy ng nh ni nj nk nl nm nn no np nq nr ns nt bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Contents<\/strong><\/h1>\n<ul class=\"\">\n<li id=\"9f83\" class=\"mb mc ev be b ft nu mf mg fw nv mi mj mk nw mm mn mo nx mq mr ms ny mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#665f\" rel=\"noopener ugc nofollow\">ImageNet Pre-training &amp; fine-tune paradigm<\/a><\/li>\n<li id=\"c7a5\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#f3f6\" rel=\"noopener ugc nofollow\">Different modes of training<\/a><\/li>\n<li id=\"89a3\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#00b8\" rel=\"noopener ugc nofollow\">Feature Representations and Random Initialization<\/a><\/li>\n<li id=\"895d\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#3b12\" rel=\"noopener ugc nofollow\">Related Work<\/a><\/li>\n<li id=\"5977\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#7ee3\" rel=\"noopener ugc nofollow\">Normalization and Convergence Comparison<\/a><\/li>\n<li id=\"b980\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#435f\" rel=\"noopener ugc nofollow\">Experimental Settings &amp; Results<\/a><\/li>\n<li id=\"44b1\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#7e71\" rel=\"noopener ugc nofollow\">Enhanced Baselines<\/a><\/li>\n<li id=\"da76\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#076b\" rel=\"noopener ugc nofollow\">Experiments with less data<\/a><\/li>\n<li id=\"35c1\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/pre-trained-machine-learning-models-vs-models-trained-from-scratch-63e079ed648f#eb16\" rel=\"noopener ugc nofollow\">Summary<\/a><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"2ea7\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Transfer Learning \u2014 the Pre-train &amp; Fine-tune Paradigm<\/h1>\n<p id=\"31d4\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">Deep learning has seen a lot of progress in recent years. It\u2019s hard to think of an industry that doesn\u2019t use deep learning. The availability of large amounts of data along with increased computation resources have fueled this progress. There have been many well known and novel methods responsible for the growth of deep learning.<\/p>\n<p id=\"5310\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">One of those is <a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/transfer-learning-with-pytorch-cfcb69016c72\" target=\"_blank\" rel=\"noopener ugc nofollow\">transfer learning<\/a>, which is the method of using the representations\/information learned by one trained model for another model that needs to be trained on different data and for a similar\/different task. Transfer learning uses pre-trained models (i.e. models already trained on some larger benchmark datasets like ImageNet).<\/p>\n<p id=\"bc6a\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">Training a neural network can take anywhere from minutes to months, depending on the data and the target task. Until a few years ago, due to computational constraints, this was possible only for research institutes and tech organizations.<\/p>\n<p id=\"b3b6\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">But with pre-trained models readily available (along with a few other factors), that scenario has changed. Using transfer learning, we can now build deep learning applications that solve vision-related tasks much quicker.<\/p>\n<p id=\"5968\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">With recent developments in the past year, transfer learning is now possible for language-related tasks as well. All of this proves that Andrew Ng was right about what he said few years ago \u2014 that transfer learning will be the next driver of commercial ML success.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<blockquote class=\"ox\"><p id=\"c697\" class=\"oy oz ev be pa pb pc pd pe pf pg mw gi\" data-selectable-paragraph=\"\">A newsletter for machine learners \u2014 by machine learners. <a class=\"af gj\" href=\"https:\/\/www.deeplearningweekly.com\/?utm_campaign=dlweekly-newsletter-expertise3&amp;utm_source=heartbeat\" target=\"_blank\" rel=\"noopener ugc nofollow\">Sign up to receive our weekly dive into all things ML<\/a>, curated by our experts in the field.<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"6735\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Different modes of training<\/h1>\n<figure class=\"lr ls lt lu lv lq ph pi paragraph-image\">\n<figure><img decoding=\"async\" class=\"lw bg lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png\" alt=\"\" width=\"700\"><\/figure><div class=\"ab cm ca pj\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 640w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 720w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 750w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 786w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 828w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 1100w, https:\/\/miro.medium.com\/v2\/format:webp\/1*9t7Po_ZFsT5_lZj445c-Lw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 640w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 720w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 750w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 786w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 828w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 1100w, https:\/\/miro.medium.com\/v2\/1*9t7Po_ZFsT5_lZj445c-Lw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div><figcaption class=\"pk pl pm ph pi pn po be b bf z gi\" data-selectable-paragraph=\"\"><a class=\"af gj\" href=\"https:\/\/towardsdatascience.com\/transfer-learning-from-pre-trained-models-f2393f124751\" target=\"_blank\" rel=\"noopener\">Image Credit<\/a><\/figcaption><\/figure>\n<p id=\"3784\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">In the blog <a class=\"af gj\" href=\"https:\/\/blog.keras.io\/building-powerful-image-classification-models-using-very-little-data.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">Building powerful image classification models using very little data,<\/a> Francois Chollet walks through the process of training a model with limited data. He starts with training a model from scratch for 50 epochs and gets an accuracy of 80% on dogs vs cats classification. Using the bottleneck features of a pre-trained model, the accuracy jumps to 90% using the same data. As a last step, on fine-tuning the top layers of the network, an accuracy of 94% is reported.<\/p>\n<p id=\"8e04\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">As is evident here, using transfer learning and pre-trained models can boost accuracy without taking much time to converge, as compared to a model trained from scratch. Does this mean the pre-train and fine-tune paradigm is a clear winner against training from scratch?<\/p>\n<p id=\"3ec1\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">The above image helps visualize a few possibilities when training a model. The image on the right represents a model in which the convolution base of a trained model is frozen and the bottleneck features obtained from it are used to retrain the further layers. This is a typical scenario with pre-trained models. The image in the middle represents a model where, except for the few initial layers, the rest of the network is trained. And the final model on the left represents training a model from scratch\u2014that\u2019s the method we\u2019ll be looking into in this blog.<\/p>\n<p id=\"8003\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">In the pre-train and fine-tune paradigm, model training starts with some learned weights that come from a pre-trained model. This has become more standard, especially when it comes to vision-related tasks like object detection and image segmentation.<\/p>\n<p id=\"9096\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">Models that are pre-trained on ImageNet are good at detecting high-level features like edges, patterns, etc. These models understand certain feature representations, which can be reused. This helps in quicker convergence and is used in state-of-the-art approaches to tasks like object detection, segmentation, and activity recognition. But how good are these representations?<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"aaf9\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Feature Representations and Random Initialization<\/h1>\n<p id=\"06ba\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">Feature representations learned by pre-trained models are domain dependent. They learn from the benchmark dataset they\u2019re trained on. Can we achieve universal feature representations by building much larger datasets? Some work is already done in this area, where datasets which are almost 3000 times the size of ImageNet are annotated.<\/p>\n<p id=\"8f7f\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">However, the improvements on target tasks scale poorly with the size of the datasets used for pre-training. It shows that simply building much larger datasets doesn\u2019t always lead to better results on the target tasks. The other alternative is to train a model from scratch with random weight initialization.<\/p>\n<p id=\"de9e\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">To train a model from scratch, all the parameters or weights in the network are randomly initialized. The experiments carried out in the paper <a class=\"af gj\" href=\"https:\/\/arxiv.org\/abs\/1811.08883\" target=\"_blank\" rel=\"noopener ugc nofollow\"><em class=\"md\">Rethinking ImageNet Pre-training<\/em><\/a> use Mask R-CNN as the baseline. This baseline model was trained on the COCO dataset both with and without pre-training, and the results were compared.<\/p>\n<p id=\"74a0\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">The results obtained prove that by training the model for a sufficient number of iterations and by using appropriate techniques, the model trained from scratch also gives comparative and close results to that of the fine-tuned model.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"95b6\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Related Work<\/h1>\n<p id=\"d06f\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">If you\u2019re used to working with pre-trained models, training from scratch might sound time- and resource-consuming. To interrogate this assumption, we can look at prior research done on training models from scratch instead of using pre-trained models.<\/p>\n<p id=\"a410\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">DetNet and CornerNet use specialized model architectures to accommodate training models from scratch. But there was no evidence that these special architectures have given any comparative results to that of the pre-train &amp; fine-tune paradigm.<\/p>\n<p id=\"5fc8\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">In this work, the authors considered using existing baseline architectures with a couple of changes. One is to train the model for more iterations, and the other is to use batch normalization alternatives like group normalization and synchronized batch normalization. With this the authors were able to produce results that were close to that of the fine-tune approach.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"a9ae\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Normalization and Convergence Comparison<\/h1>\n<p id=\"f960\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">If models are trained from scratch without proper normalization, it can produce misleading results, which might mean that training from scratch is not optimal at all.<\/p>\n<p id=\"f172\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">For vision-related tasks, training data consists of images of high resolution. This means the batch size has to be adjusted accordingly to meet the memory constraints. Batch normalization works well with bigger batch sizes. The bigger the batch size, the better. But with high resolution images and memory constraints, model training has to limit its batch size to a smaller number. This leads to bad results with BN.<\/p>\n<p id=\"9a89\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">To avoid this, group normalization and synchronized batch normalization are used. Group normalization is independent of the batch size. Synchronized BN uses multiple devices. This increases the effective batch size and avoids small batches, which enables training models from scratch.<\/p>\n<p id=\"0a33\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">The fine-tuned models get kind of a head start, as the pre-trained model has already learned high-level features. That means the models trained from scratch cannot converge as fast as the fine-tuned models. Though this makes fine-tuned models better, one should also consider the time and resources it takes to pre-train a model on large benchmark datasets like ImageNet. Over a million images are trained for many iterations during ImageNet pre-training. So for the random initialized training to catch up, the model needs many training iterations.<\/p>\n<figure class=\"lr ls lt lu lv lq ph pi paragraph-image\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lw lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:621\/1*YBssNo_EyHtADpe3RMsloQ.png\" alt=\"\" width=\"621\" height=\"493\"><\/figure><div class=\"ph pi pp\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1242\/format:webp\/1*YBssNo_EyHtADpe3RMsloQ.png 1242w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 621px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*YBssNo_EyHtADpe3RMsloQ.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*YBssNo_EyHtADpe3RMsloQ.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*YBssNo_EyHtADpe3RMsloQ.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*YBssNo_EyHtADpe3RMsloQ.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*YBssNo_EyHtADpe3RMsloQ.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*YBssNo_EyHtADpe3RMsloQ.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1242\/1*YBssNo_EyHtADpe3RMsloQ.png 1242w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 621px\" data-testid=\"og\"><\/picture><\/div>\n<\/figure>\n<p id=\"4442\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">The image here summarizes the number of training samples seen in both cases\u2014with and without pre-training. Depending on the target task, the samples can be images, instances, or pixels. For a segmentation task, the model works at the pixel level. For object detection what matters is the instances of objects in each image. We see that except in segmentation (pixel-level task), training from scratch takes a substantially lower number of training images.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"3188\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Experimental Settings and Results<\/h1>\n<p id=\"1de8\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">Here are the experiment settings\u2014the architecture and <a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/tuning-machine-learning-hyperparameters-40265a35c9b8\" target=\"_blank\" rel=\"noopener ugc nofollow\">hyperparameters<\/a>used. The experiments use Mask R-CNN with ResNet and ResNext architectures as the baseline. GN or SyncBN are used for normalization. The model is trained with an initial <a class=\"af gj\" href=\"https:\/\/heartbeat.comet.ml\/introduction-to-learning-rates-in-machine-learning-6ed685c16506\" target=\"_blank\" rel=\"noopener ugc nofollow\">learning rate<\/a> of 0.02, and it\u2019s reduced by 10 times in the last 60k and 20k iterations respectively. The training data is flipped horizontally and there is no test time augmentation for the baseline model. A total of 8 GPUs were used for training.<\/p>\n<figure class=\"lr ls lt lu lv lq ph pi paragraph-image\">\n<div class=\"pr ps hc pt bg pu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lw lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*eCZ-f8hhCkE8UVfds5gMbg.png\" alt=\"\" width=\"700\" height=\"337\"><\/figure><div class=\"ph pi pq\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*eCZ-f8hhCkE8UVfds5gMbg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*eCZ-f8hhCkE8UVfds5gMbg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*eCZ-f8hhCkE8UVfds5gMbg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*eCZ-f8hhCkE8UVfds5gMbg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*eCZ-f8hhCkE8UVfds5gMbg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*eCZ-f8hhCkE8UVfds5gMbg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*eCZ-f8hhCkE8UVfds5gMbg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*eCZ-f8hhCkE8UVfds5gMbg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"b7fd\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">By training a model with these settings on the COCO dataset with 118K training and 5k validation samples, the model trained from scratch was able to catch up in accuracy with that of a pre-trained model.<\/p>\n<p id=\"5621\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">In this case, object detection and image segmentation are the two target tasks. Average precision for bounding boxes and masks are the metrics. As we can see in the plot above, the one on the left was trained with ResNet 101 and GN, whereas the one on right shows results for Mask RCNN with ResNet50 and SyncBN.<\/p>\n<p id=\"2c2f\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">We see that when fine tuning, pre-training gives the model a head start, as we see the AP starts with a value close to 20. Whereas when training from scratch, the model starts with an AP value of close to 5. But the important thing to note is that, the model trained from scratch goes on to give close results. These spikes here indicate the results of applying different schedules and learning rates, all merged into the same plot.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"b600\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Enhanced Baselines<\/h1>\n<p id=\"4bc6\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">The authors also tried making enhancements to their baseline model. Better results were reported by adding scale augmentation during training. Similarly, using Cascade RCNN and test time augmentation also improved the results.<\/p>\n<p id=\"48e3\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\"><strong class=\"be mx\">We see that with train and test time augmentation, models trained from scratch give better results than the pre-trained models<\/strong>. These plots show the results with enhanced baseline models.<\/p>\n<figure class=\"lr ls lt lu lv lq ph pi paragraph-image\">\n<div class=\"pr ps hc pt bg pu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lw lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*10ab8pEOq4_17U_lTAVjOg.png\" alt=\"\" width=\"700\" height=\"322\"><\/figure><div class=\"ph pi pv\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*10ab8pEOq4_17U_lTAVjOg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*10ab8pEOq4_17U_lTAVjOg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*10ab8pEOq4_17U_lTAVjOg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*10ab8pEOq4_17U_lTAVjOg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*10ab8pEOq4_17U_lTAVjOg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*10ab8pEOq4_17U_lTAVjOg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*10ab8pEOq4_17U_lTAVjOg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*10ab8pEOq4_17U_lTAVjOg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"932a\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Experiments with Less Data<\/h1>\n<p id=\"34c5\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">The final experiment was to try different amounts of training data. While the first interesting finding from this work is that we can get comparable results even with models trained from scratch, the other surprising discovery is that even when there\u2019s less data, training from scratch can still yield close results to that of the fine tuned models.<\/p>\n<p id=\"76bb\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">When working with only 1\/3rd of the whole COCO training data, i.e. close to 35K images, experiments show that the fine-tune approach starts overfitting after some iterations. That shows that ImageNet pre-training doesn&#8217;t automatically help reduce overfitting. But despite less data, training from scratch still catches up with fine-tuned results.<\/p>\n<p id=\"88c2\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">When only one-tenth of training data (close to 10k images) is used for training, a similar trend is noticed. We can see in the image on left\u2014with the pre-train &amp; fine-tune approach, the model starts to overfit after some iterations.<\/p>\n<p id=\"5f07\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">We see in the plots on the middle and right that training from scratch gives pretty close results to that of the fine-tuned models. What if we try with much less data? Like using one hundredth of the entire training data? When only 1k images are used, training from scratch still converges rather slowly. But it produces worse results. While the pre-trained models give AP of 9.9, the approach in consideration gives only 3.5. This is a sign that the model has overfitted due to a lack of data.<\/p>\n<figure class=\"lr ls lt lu lv lq ph pi paragraph-image\">\n<div class=\"pr ps hc pt bg pu\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg lw lx c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*mWkkgkeWj2rro65b6SPKqg.png\" alt=\"\" width=\"700\" height=\"299\"><\/figure><div class=\"ph pi pw\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*mWkkgkeWj2rro65b6SPKqg.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*mWkkgkeWj2rro65b6SPKqg.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*mWkkgkeWj2rro65b6SPKqg.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*mWkkgkeWj2rro65b6SPKqg.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*mWkkgkeWj2rro65b6SPKqg.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*mWkkgkeWj2rro65b6SPKqg.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*mWkkgkeWj2rro65b6SPKqg.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*mWkkgkeWj2rro65b6SPKqg.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h2 id=\"e34d\" class=\"px mz ev be na py pz qa nd qb qc qd ng ou qe qf qg ov qh qi qj ow qk ql qm qn bj\" data-selectable-paragraph=\"\">Summary:<\/h2>\n<ul class=\"\">\n<li id=\"2ead\" class=\"mb mc ev be b ft nu mf mg fw nv mi mj mk nw mm mn mo nx mq mr ms ny mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">Training from scratch on target tasks is possible without architectural changes or specialized networks.<\/li>\n<li id=\"d07e\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">Training from scratch requires more iterations to sufficiently converge.<\/li>\n<li id=\"9d3c\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">Training from scratch can be no worse than its ImageNet pre-training counterparts under many circumstances, down to as few as 10k COCO images.<\/li>\n<li id=\"233b\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">ImageNet pre-training speeds up convergence on the target task but does not necessarily help reduce overfitting unless we enter a very small data regime.<\/li>\n<li id=\"4013\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">ImageNet pre-training helps less if the target task is more sensitive to localization than classification.<\/li>\n<li id=\"2143\" class=\"mb mc ev be b ft oc mf mg fw od mi mj mk oe mm mn mo of mq mr ms og mu mv mw nz oa ob bj\" data-selectable-paragraph=\"\">Pre-training can help with learning universal representations, but we should be careful when evaluating the pre-trained features.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"eo ep eq er es\">\n<div class=\"ab ca\">\n<div class=\"ch bg dx dy dz ea\">\n<h1 id=\"7584\" class=\"my mz ev be na nb op fv nd ne oq fy ng nh or nj nk nl os nn no np ot nr ns nt bj\" data-selectable-paragraph=\"\">Conclusion<\/h1>\n<p id=\"47bc\" class=\"pw-post-body-paragraph mb mc ev be b ft nu mf mg fw nv mi mj ou nw mm mn ov nx mq mr ow ny mu mv mw eo bj\" data-selectable-paragraph=\"\">The paper doesn\u2019t claim that the pre-train and fine-tune approach is not recommended in anyway. But the experiments included have shown that for some scenarios, training a model from scratch gave slightly better results than the fine-tune\/pre-train approach. What this means is that if computation is not a constraint, then for certain scenarios and configuration settings, the model trained from scratch gives better results than the fine-tuned ones.<\/p>\n<p id=\"b654\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\">This is an interesting study, especially because the pre-train and fine-tune paradigm is being used more as a standard procedure. And considering where deep learning is being applied\u2014including use cases for automobiles, health, retail, etc., where even slight improvements in accuracy can make huge differences\u2014 it\u2019s essential for research to not only aim for novel and innovative methods, but also to study existing methods in more detail. This could lead to better insights and new discoveries.<\/p>\n<blockquote class=\"ly lz ma\"><p id=\"6cff\" class=\"mb mc md be b ft me mf mg fw mh mi mj mk ml mm mn mo mp mq mr ms mt mu mv mw eo bj\" data-selectable-paragraph=\"\"><strong class=\"be mx\">Did you find this post useful? Feel free to leave any feedback\/comments. Thanks for reading!!<\/strong><\/p><\/blockquote>\n<p id=\"137a\" class=\"pw-post-body-paragraph mb mc ev be b ft me mf mg fw mh mi mj ou ml mm mn ov mp mq mr ow mt mu mv mw eo bj\" data-selectable-paragraph=\"\"><strong class=\"be mx\"><em class=\"md\">To connect : <\/em><\/strong><a class=\"af gj\" href=\"https:\/\/www.linkedin.com\/in\/avinash-kappa\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be mx\"><em class=\"md\">LinkedIn<\/em><\/strong><\/a><strong class=\"be mx\"><em class=\"md\">, <\/em><\/strong><a class=\"af gj\" href=\"http:\/\/twitter.com\/avinashso13\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be mx\"><em class=\"md\">Twitter<\/em><\/strong><\/a><strong class=\"be mx\"><em class=\"md\"> and my <\/em><\/strong><a class=\"af gj\" href=\"https:\/\/theimgclist.github.io\/\" target=\"_blank\" rel=\"noopener ugc nofollow\"><strong class=\"be mx\"><em class=\"md\">Blog<\/em><\/strong><\/a><strong class=\"be mx\"><em class=\"md\">.<\/em><\/strong><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>From the Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.The results are no worse than their ImageNet pre-training counterparts, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization [&hellip;]<\/p>\n","protected":false},"author":86,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[6],"tags":[],"coauthors":[183],"class_list":["post-7389","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Pre-Trained Machine Learning Models vs Models Trained from Scratch - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Pre-Trained Machine Learning Models vs Models Trained from Scratch\" \/>\n<meta property=\"og:description\" content=\"From the Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.The results are no worse than their ImageNet pre-training counterparts, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-07T18:03:37+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg\" \/>\n<meta name=\"author\" content=\"Avinash\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Avinash\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Pre-Trained Machine Learning Models vs Models Trained from Scratch - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/","og_locale":"en_US","og_type":"article","og_title":"Pre-Trained Machine Learning Models vs Models Trained from Scratch","og_description":"From the Abstract: We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization.The results are no worse than their ImageNet pre-training counterparts, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-09-07T18:03:37+00:00","article_modified_time":"2025-04-24T17:14:24+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg","type":"","width":"","height":""}],"author":"Avinash","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Avinash","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/"},"author":{"name":"Avinash","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/74366c178abfe23ce556820bb678b37a"},"headline":"Pre-Trained Machine Learning Models vs Models Trained from Scratch","datePublished":"2023-09-07T18:03:37+00:00","dateModified":"2025-04-24T17:14:24+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/"},"wordCount":2156,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/","url":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/","name":"Pre-Trained Machine Learning Models vs Models Trained from Scratch - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg","datePublished":"2023-09-07T18:03:37+00:00","dateModified":"2025-04-24T17:14:24+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:2500\/1*fyM5WMsMfLjGCrB1tovshw.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/pre-trained-machine-learning-models-vs-models-trained-from-scratch\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Pre-Trained Machine Learning Models vs Models Trained from Scratch"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/74366c178abfe23ce556820bb678b37a","name":"Avinash","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/e314f1b295110907dfc5a61ee678ec26","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/09\/1553148340349-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/09\/1553148340349-96x96.jpg","caption":"Avinash"},"sameAs":["https:\/\/theimgclist.github.io\/","https:\/\/www.linkedin.com\/in\/avinash-kappa\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/avinashgmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7389","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/86"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7389"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7389\/revisions"}],"predecessor-version":[{"id":15560,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7389\/revisions\/15560"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7389"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7389"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7389"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7389"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}