{"id":7059,"date":"2023-08-08T11:54:07","date_gmt":"2023-08-08T19:54:07","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=7059"},"modified":"2025-04-24T17:14:54","modified_gmt":"2025-04-24T17:14:54","slug":"deep-learning-has-a-size-problem","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/","title":{"rendered":"Deep Learning Has a Size Problem"},"content":{"rendered":"\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\">\n\n\n\n<div class=\"fh fi fj fk fl\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png\" alt=\"\" width=\"1500\" height=\"843\"><\/figure><div class=\"fm bg\">\n<figure class=\"fn fm bg paragraph-image\"><picture><\/picture><\/figure>\n<\/div>\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<blockquote class=\"mj mk ml\"><p id=\"78df\" class=\"mm mn mo be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The following is an adaptation of two talks I recently gave at the O\u2019Reilly AI Conference and DroidCon in London. <a class=\"af ni\" href=\"https:\/\/www.slideshare.net\/jamesontoole\/creating-smaller-faster-productionready-mobile-machine-learning-models\" target=\"_blank\" rel=\"noopener ugc nofollow\">Slides are available<\/a> at the end of this post.<\/p><\/blockquote>\n<p id=\"ca89\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Earlier this year, researchers at <a class=\"af ni\" href=\"https:\/\/nv-adlr.github.io\/MegatronLM\" target=\"_blank\" rel=\"noopener ugc nofollow\">NVIDIA announced MegatronLM<\/a>, a massive transformer model with 8.3 billion parameters (24 times larger than BERT) that achieved state-of-the-art performance on a variety of language tasks. While this was an undoubtedly impressive technical achievement, I couldn\u2019t help but ask myself: is deep learning going in the right direction?<\/p>\n<p id=\"646e\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The parameters alone weigh in at just over 33 GB on disk. Training the final model took 512 V100 GPUs running continuously for 9.2 days. Given the power requirements per card, a back of the envelope estimate put the amount of energy used to train this model at over 3X the yearly energy consumption of the average American.<\/p>\n<p id=\"c30d\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">I don\u2019t mean to single out this particular project. There are <a class=\"af ni\" href=\"https:\/\/medium.com\/syncedreview\/the-staggering-cost-of-training-sota-ai-models-e329e80fa82\" rel=\"noopener\">many examples<\/a> of <a class=\"af ni\" href=\"https:\/\/ai.googleblog.com\/2019\/10\/exploring-massively-multilingual.html\" target=\"_blank\" rel=\"noopener ugc nofollow\">massive models<\/a> <a class=\"af ni\" href=\"https:\/\/github.com\/facebookresearch\/FixRes\" target=\"_blank\" rel=\"noopener ugc nofollow\">being trained<\/a> to achieve ever-so-slightly higher accuracy on various benchmarks. Despite being 24X larger than BERT, MegatronLM is only 34% better at its language modeling task. As a one-off experiment to demonstrate the performance of new hardware, there isn\u2019t much harm here. But in the long term, this trend is <a class=\"af ni\" href=\"https:\/\/www.technologyreview.com\/s\/613630\/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">going to cause a few problems<\/a>.<\/p>\n<p id=\"4d5e\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">First, it hinders democratization. If we believe in a world where millions of engineers are going to use deep learning to make every application and device better, we won\u2019t get there with massive models that take large amounts of time and money to train.<\/p>\n<p id=\"651f\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Second, it restricts scale. There are probably less than 100 million processors in every public and private cloud in the world. But there are already 3 billion mobile phones, 12 billion IoT devices, and 150 billion micro-controllers out there. In the long term, it\u2019s these small, low power devices that will consume the most deep learning, and massive models simply won\u2019t be an option.<\/p>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*xe-7MFH9uVOfQjs7xo5nnw.png\" alt=\"\" width=\"700\" height=\"394\"><\/figure><div class=\"nm nn no\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*xe-7MFH9uVOfQjs7xo5nnw.png 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/1*xe-7MFH9uVOfQjs7xo5nnw.png 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/1*xe-7MFH9uVOfQjs7xo5nnw.png 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/1*xe-7MFH9uVOfQjs7xo5nnw.png 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/1*xe-7MFH9uVOfQjs7xo5nnw.png 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/1*xe-7MFH9uVOfQjs7xo5nnw.png 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/1*xe-7MFH9uVOfQjs7xo5nnw.png 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*xe-7MFH9uVOfQjs7xo5nnw.png 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"5dee\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">To make sure deep learning lives up to its promise, we need to re-orient research away from state-of-the-art accuracy and towards state-of-the-art efficiency. We need to ask if models enable the largest number of people to iterate as fast as possible using the fewest amount of resources on the most devices.<\/p>\n<p id=\"f97b\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The good news is that work is being done to make deep learning models smaller, faster, and more efficient. Early returns are incredible. Take, for example, one result from <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1510.00149\" target=\"_blank\" rel=\"noopener ugc nofollow\">a 2015 paper by Han et al<\/a>.<\/p>\n<p id=\"4530\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">\u201cOn the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy.\u201d<\/p>\n<p id=\"c519\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">To achieve results like this, we have to consider the entire machine learning lifecycle\u2014from model selection to training to deployment. For the rest of this article, we\u2019ll dive into those phases and look at ways to make smaller, faster, more efficient models.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<h1 id=\"7621\" class=\"oq or fs be os ot ou gs ov ow ox gv oy oz pa pb pc pd pe pf pg ph pi pj pk pl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Model Selection<\/strong><\/h1>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*syWD6EzOjNDDjG95\" alt=\"\" width=\"700\" height=\"322\"><\/figure><div class=\"nm nn pm\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*syWD6EzOjNDDjG95 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*syWD6EzOjNDDjG95 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*syWD6EzOjNDDjG95 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*syWD6EzOjNDDjG95 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*syWD6EzOjNDDjG95 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*syWD6EzOjNDDjG95 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*syWD6EzOjNDDjG95 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*syWD6EzOjNDDjG95 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*syWD6EzOjNDDjG95 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*syWD6EzOjNDDjG95 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*syWD6EzOjNDDjG95 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*syWD6EzOjNDDjG95 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*syWD6EzOjNDDjG95 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*syWD6EzOjNDDjG95 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"e4ce\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The best way to end up with a smaller, more efficient model is to start with one. The graph above plots the rough size (in megabytes) of various model architectures. I\u2019ve overlaid lines denoting the typical size of mobile applications (code and assets included), as well as the amount of SRAM that might be available in an embedded device.<\/p>\n<p id=\"b6db\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The logarithmic scale on the Y-axis softens the visual blow, but the unfortunate truth is that the majority of model architectures are orders of magnitude too large for deployment anywhere but the larger corners of a datacenter.<\/p>\n<p id=\"2667\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Incredibly, the smaller architectures to the right <a class=\"af ni\" href=\"https:\/\/medium.com\/@culurciello\/analysis-of-deep-neural-networks-dcf398e71aae\" rel=\"noopener\">don\u2019t perform much worse than the large ones<\/a> to the left. An architecture like VGG-16 (300\u2013500MB) performs about as well as a MobileNet (20MB) model, despite being nearly 25X smaller.<\/p>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*00c7M7rKYFdDN1rc\" alt=\"\" width=\"700\" height=\"469\"><\/figure><div class=\"nm nn pn\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*00c7M7rKYFdDN1rc 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*00c7M7rKYFdDN1rc 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*00c7M7rKYFdDN1rc 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*00c7M7rKYFdDN1rc 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*00c7M7rKYFdDN1rc 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*00c7M7rKYFdDN1rc 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*00c7M7rKYFdDN1rc 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*00c7M7rKYFdDN1rc 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*00c7M7rKYFdDN1rc 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*00c7M7rKYFdDN1rc 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*00c7M7rKYFdDN1rc 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*00c7M7rKYFdDN1rc 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*00c7M7rKYFdDN1rc 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*00c7M7rKYFdDN1rc 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"13dc\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">What makes smaller architectures like MobileNet and SqueezeNet so efficient? Based on experiments by <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1602.07360\" target=\"_blank\" rel=\"noopener ugc nofollow\">Iandola et al<\/a> (SqueezeNet), <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1905.02244\" target=\"_blank\" rel=\"noopener ugc nofollow\">Howard et al<\/a>(MobileNetV3), and <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1706.05587\" target=\"_blank\" rel=\"noopener ugc nofollow\">Chen et al<\/a> (DeepLab V3), some answers lie in the macro- and micro-architectures of models.<\/p>\n<p id=\"c268\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Macro-architecture refers the types of layers used by a model and how they are arranged into modules and blocks. To produce efficient macro-architectures:<\/p>\n<ul class=\"\">\n<li id=\"da24\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Keep activation maps large by downsampling later or using atrous (dilated) convolutions<\/li>\n<li id=\"81b4\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Use more channels, but fewer layers<\/li>\n<li id=\"4fca\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Use skip connections and residual connections to improve accuracy and re-use parameters during calculation<\/li>\n<li id=\"2bec\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Replace standard convolutions with <a class=\"af ni\" href=\"https:\/\/heartbeat.fritz.ai\/building-an-image-recognition-model-for-mobile-using-depthwise-convolutions-643d70e0f7e2\" target=\"_blank\" rel=\"noopener ugc nofollow\">depthwise separable ones<\/a><\/li>\n<\/ul>\n<p id=\"9ad8\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">A model\u2019s micro-architecture is defined by choices related to individual layers. Best practices include:<\/p>\n<ul class=\"\">\n<li id=\"0c7d\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Making input and output blocks as efficient as possible, as they are typically 15\u201325% of a model\u2019s computation cost<\/li>\n<li id=\"c023\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Decreasing the size of convolution kernels<\/li>\n<li id=\"b2ed\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Adding a width multiplier to control the number of channels per convolution with a hyperparameter, alpha<\/li>\n<li id=\"8d6e\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Arranging layers so that parameters can be fused (e.g. bias and batch normalization)<\/li>\n<\/ul>\n<h1 id=\"f8a0\" class=\"oq or fs be os ot pw gs ov ow px gv oy oz py pb pc pd pz pf pg ph qa pj pk pl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Model Training<\/strong><\/h1>\n<p id=\"56b5\" class=\"pw-post-body-paragraph mm mn fs be b gq qb mq mr gt qc mt mu nj qd mx my nk qe nb nc nl qf nf ng nh fh bj\" data-selectable-paragraph=\"\">After a model architecture has been selected, there\u2019s still a lot that can be done to shrink it and make it more efficient during training. In case it wasn\u2019t already obvious, most neural networks are over-parameterized. Many trained weights have little impact on overall accuracy and can be removed. <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1803.03635\" target=\"_blank\" rel=\"noopener ugc nofollow\">Frankle et al<\/a> find that in many networks, 80\u201390% of network weights can be removed \u2014 along with most of the precision in those weights \u2014 with little loss in accuracy.<\/p>\n<p id=\"5da7\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">There are three main strategies for finding and removing these parameters: knowledge distillation, pruning, and quantization. They can be applied together or separately.<\/p>\n<h2 id=\"6834\" class=\"qg or fs be os qh qi qj ov qk ql qm oy nj qn qo qp nk qq qr qs nl qt qu qv qw bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Knowledge Distillation<\/strong><\/h2>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*5NzPZ8U76AMid5cv\" alt=\"\" width=\"700\" height=\"317\"><\/figure><div class=\"nm nn qx\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*5NzPZ8U76AMid5cv 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*5NzPZ8U76AMid5cv 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*5NzPZ8U76AMid5cv 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*5NzPZ8U76AMid5cv 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*5NzPZ8U76AMid5cv 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*5NzPZ8U76AMid5cv 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*5NzPZ8U76AMid5cv 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*5NzPZ8U76AMid5cv 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*5NzPZ8U76AMid5cv 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*5NzPZ8U76AMid5cv 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*5NzPZ8U76AMid5cv 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*5NzPZ8U76AMid5cv 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*5NzPZ8U76AMid5cv 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*5NzPZ8U76AMid5cv 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"5b92\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Knowledge distillation uses a larger \u201cteacher\u201d model to train a smaller \u201cstudent\u201d model. <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1503.02531\" target=\"_blank\" rel=\"noopener ugc nofollow\">First conceived by Hinton et al in 2015<\/a>, the keys to this technique are two loss terms: one for the hard predictions of the student model and a second based on the ability of the student to produce the same distribution of scores across all output classes.<\/p>\n<p id=\"fe5a\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1802.05668v1\" target=\"_blank\" rel=\"noopener ugc nofollow\">Polino et al<\/a> were able to achieve a 46X reduction in size for ResNet models trained on CIFAR10 with only 10% loss in accuracy, and a 2X reduction in size on ImageNet with only a 2% loss in accuracy. More recently, <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1909.10351v2\" target=\"_blank\" rel=\"noopener ugc nofollow\">Jiao et al<\/a> distilled BERT to create TinyBERT: 7.5X smaller, 9.4X faster, and only 3% less accurate. There are a few great open source libraries with implementations of distillation frameworks including <a class=\"af ni\" href=\"https:\/\/github.com\/NervanaSystems\/distiller\" target=\"_blank\" rel=\"noopener ugc nofollow\">Distiller<\/a> and <a class=\"af ni\" href=\"https:\/\/github.com\/huggingface\/transformers\/tree\/master\/examples\/distillation\" target=\"_blank\" rel=\"noopener ugc nofollow\">Distil* for transformers<\/a>.<\/p>\n<h2 id=\"f4ae\" class=\"qg or fs be os qh qi qj ov qk ql qm oy nj qn qo qp nk qq qr qs nl qt qu qv qw bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Pruning<\/strong><\/h2>\n<p id=\"7bd7\" class=\"pw-post-body-paragraph mm mn fs be b gq qb mq mr gt qc mt mu nj qd mx my nk qe nb nc nl qf nf ng nh fh bj\" data-selectable-paragraph=\"\">The second technique to shrink models is pruning. Pruning involves assessing the importance of weights in a model and removing those that contribute the least to overall model accuracy. Pruning can be done at multiple scales in a network. The smallest models are achieved by pruning at the individual weight level. Weights with small magnitudes are set to zero. When models are compressed or stored in a sparse format, these zeros are very efficient to store.<\/p>\n<p id=\"b595\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1506.02626\" target=\"_blank\" rel=\"noopener ugc nofollow\">Han et al<\/a> use this approach to shrink common computer vision architectures by 9\u201313X with negligible changes in accuracy. Unfortunately, a lack of support for fast sparse matrix operations means that weight-level pruning doesn\u2019t also increase runtime speeds.<\/p>\n<p id=\"c3cc\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">To create models that are both smaller and faster, pruning needs to be done at filter or layer levels\u2014for example, removing the filters of a convolution layer that contribute least to overall prediction accuracy. Models pruned at the filter level aren\u2019t quite as small but are typically faster. <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1608.08710\" target=\"_blank\" rel=\"noopener ugc nofollow\">Li et al <\/a>were able to reduce the size and runtime of a VGG model by 34% with no loss in accuracy using this technique.<\/p>\n<p id=\"9729\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Finally, it\u2019s worth noting that <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1810.05270v2\" target=\"_blank\" rel=\"noopener ugc nofollow\">Liu et al<\/a> have shown mixed results as to whether it\u2019s better to start from a larger model and prune or train a smaller model from scratch.<\/p>\n<h2 id=\"1a0d\" class=\"qg or fs be os qh qi qj ov qk ql qm oy nj qn qo qp nk qq qr qs nl qt qu qv qw bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Quantization<\/strong><\/h2>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*iOZv9VyOCg9Xwb6c\" alt=\"\" width=\"700\" height=\"170\"><\/figure><div class=\"nm nn qy\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*iOZv9VyOCg9Xwb6c 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*iOZv9VyOCg9Xwb6c 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*iOZv9VyOCg9Xwb6c 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*iOZv9VyOCg9Xwb6c 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*iOZv9VyOCg9Xwb6c 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*iOZv9VyOCg9Xwb6c 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*iOZv9VyOCg9Xwb6c 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*iOZv9VyOCg9Xwb6c 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*iOZv9VyOCg9Xwb6c 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*iOZv9VyOCg9Xwb6c 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*iOZv9VyOCg9Xwb6c 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*iOZv9VyOCg9Xwb6c 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*iOZv9VyOCg9Xwb6c 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*iOZv9VyOCg9Xwb6c 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div><figcaption class=\"qz ra rb nm nn rc rd be b bf z dv\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/medium.com\/@kaustavtamuly\/compressing-and-accelerating-high-dimensional-neural-networks-6b501983c0c8\" rel=\"noopener\">source<\/a><\/figcaption><\/figure>\n<p id=\"1831\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">After a model has been trained, it needs to be prepared for deployment. Here, too, there are techniques to squeeze even more optimizations out of a model. Typically, the weights of a models are stored as 32-bit floating point numbers, but for most applications, this is far more precision than necessary. We can save space and (sometimes) time by quantizing these weights, again with minimal impact on accuracy.<\/p>\n<p id=\"a48b\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Quantization maps each floating point weight to a fixed precision integer containing fewer bits than the original. While there are a number of quantization techniques, the two most important factors are the bit depth of the final model and whether weights are quantized during or after training (quantization-aware training and post-training quantization, respectively).<\/p>\n<p id=\"51b4\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Finally, it\u2019s important to quantize both weights and activations to speed up model runtime. Activation functions are mathematical operations that will naturally produce floating point numbers. If these functions aren\u2019t modified to produce quantized outputs, models can even run slower due to the necessary conversion.<\/p>\n<p id=\"004f\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">In a fantastic review, <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1806.08342\" target=\"_blank\" rel=\"noopener ugc nofollow\">Krishnamoorthi<\/a> tests a number of quantization schemes and configurations to provide a set of best practices:<\/p>\n<p id=\"1dea\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\"><strong class=\"be re\">Results<\/strong>:<\/p>\n<ul class=\"\">\n<li id=\"dce5\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Post-training can generally be applied down to 8 bits, resulting in 4X smaller models with &lt;2% accuracy loss<\/li>\n<li id=\"1e0c\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Training-aware quantization allows a reduction of bit depth to 4 or 2 bits (8\u201316X smaller models) with minimal accuracy loss<\/li>\n<li id=\"4719\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Quantizing weights and activations can result in a 2\u20133X speed increase on CPUs<\/li>\n<\/ul>\n<h1 id=\"25cc\" class=\"oq or fs be os ot pw gs ov ow px gv oy oz py pb pc pd pz pf pg ph qa pj pk pl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Deployment<\/strong><\/h1>\n<p id=\"39f3\" class=\"pw-post-body-paragraph mm mn fs be b gq qb mq mr gt qc mt mu nj qd mx my nk qe nb nc nl qf nf ng nh fh bj\" data-selectable-paragraph=\"\">A common thread among these techniques is that they generate a continuum of models, each with different shapes, sizes, and accuracies. While this creates a bit of a management and organization problem, it maps nicely onto the wide variety of hardware and software conditions models will face in the wild.<\/p>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*FFFZJWI9OZa4u8lv\" alt=\"\" width=\"700\" height=\"328\"><\/figure><div class=\"nm nn rf\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*FFFZJWI9OZa4u8lv 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*FFFZJWI9OZa4u8lv 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*FFFZJWI9OZa4u8lv 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*FFFZJWI9OZa4u8lv 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*FFFZJWI9OZa4u8lv 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*FFFZJWI9OZa4u8lv 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*FFFZJWI9OZa4u8lv 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*FFFZJWI9OZa4u8lv 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*FFFZJWI9OZa4u8lv 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*FFFZJWI9OZa4u8lv 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*FFFZJWI9OZa4u8lv 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*FFFZJWI9OZa4u8lv 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*FFFZJWI9OZa4u8lv 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*FFFZJWI9OZa4u8lv 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<\/figure>\n<p id=\"6ede\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The graph above shows the runtime speed of a MobileNetV2 model across various smartphones. There can be an 80X speed difference between the lowest and highest end devices. In order to deliver users a consistent experience, it\u2019s important to put the right model on the right device. This means training multiple models and deploying them to different devices based on available resources.<\/p>\n<p id=\"fb3b\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">Typically, the best on-device performance is achieved by:<\/p>\n<ul class=\"\">\n<li id=\"f608\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Using native formats and frameworks (e.g Core ML on iOS and TFLite on Android)<\/li>\n<li id=\"4a1e\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Leveraging any available accelerators like GPUs or DSPs by using supported operations only<\/li>\n<li id=\"af83\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Monitoring performance across devices, identifying model bottlenecks, and iterating architectures for specific hardware<\/li>\n<\/ul>\n<h1 id=\"abce\" class=\"oq or fs be os ot pw gs ov ow px gv oy oz py pb pc pd pz pf pg ph qa pj pk pl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Putting it all together<\/strong><\/h1>\n<p id=\"1ac8\" class=\"pw-post-body-paragraph mm mn fs be b gq qb mq mr gt qc mt mu nj qd mx my nk qe nb nc nl qf nf ng nh fh bj\" data-selectable-paragraph=\"\">By applying these techniques, it\u2019s possible to shrink and speed up most models by at least an order of magnitude. To quote just a few papers discussed thus far:<\/p>\n<ul class=\"\">\n<li id=\"26b3\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">\u201cTinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference.\u201d \u2014 <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1909.10351\" target=\"_blank\" rel=\"noopener ugc nofollow\">Jiao et al<\/a><\/li>\n<li id=\"b1ff\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">\u201cOur method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy.\u201d \u2014 <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1510.00149v5\" target=\"_blank\" rel=\"noopener ugc nofollow\">Han et al<\/a><\/li>\n<li id=\"7a3c\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">\u201cThe model itself takes up less than 20KB of Flash storage space \u2026 and it only needs 30KB of RAM to operate.\u201d \u2014 <a class=\"af ni\" href=\"https:\/\/petewarden.com\/2019\/03\/07\/launching-tensorflow-lite-for-microcontrollers\/\" target=\"_blank\" rel=\"noopener ugc nofollow\">Peter Warden at TensorFlow Dev Summit 2019<\/a><\/li>\n<\/ul>\n<p id=\"7b9d\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">To prove that it can be done by mere mortals, I took the liberty of <a class=\"af ni\" href=\"https:\/\/heartbeat.fritz.ai\/creating-a-17kb-style-transfer-model-with-layer-pruning-and-quantization-864d7cc53693\" target=\"_blank\" rel=\"noopener ugc nofollow\">creating a tiny 17KB style transfer model<\/a> that contains just 11,686 parameters, yet still produces results that look as good as a 1.6 million parameter model.<\/p>\n<figure class=\"np nq nr ns nt fm nm nn paragraph-image\">\n<div class=\"nu nv eb nw bg nx\" tabindex=\"0\" role=\"button\">\n<figure><img loading=\"lazy\" decoding=\"async\" class=\"bg fo fp c\" role=\"presentation\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*6XCx7B4zirWZUtQZ\" alt=\"\" width=\"700\" height=\"312\"><\/figure><div class=\"nm nn rg\"><picture><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*6XCx7B4zirWZUtQZ 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*6XCx7B4zirWZUtQZ 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*6XCx7B4zirWZUtQZ 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*6XCx7B4zirWZUtQZ 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*6XCx7B4zirWZUtQZ 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*6XCx7B4zirWZUtQZ 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*6XCx7B4zirWZUtQZ 1400w\" type=\"image\/webp\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\"><source srcset=\"https:\/\/miro.medium.com\/v2\/resize:fit:640\/0*6XCx7B4zirWZUtQZ 640w, https:\/\/miro.medium.com\/v2\/resize:fit:720\/0*6XCx7B4zirWZUtQZ 720w, https:\/\/miro.medium.com\/v2\/resize:fit:750\/0*6XCx7B4zirWZUtQZ 750w, https:\/\/miro.medium.com\/v2\/resize:fit:786\/0*6XCx7B4zirWZUtQZ 786w, https:\/\/miro.medium.com\/v2\/resize:fit:828\/0*6XCx7B4zirWZUtQZ 828w, https:\/\/miro.medium.com\/v2\/resize:fit:1100\/0*6XCx7B4zirWZUtQZ 1100w, https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*6XCx7B4zirWZUtQZ 1400w\" sizes=\"(min-resolution: 4dppx) and (max-width: 700px) 50vw, (-webkit-min-device-pixel-ratio: 4) and (max-width: 700px) 50vw, (min-resolution: 3dppx) and (max-width: 700px) 67vw, (-webkit-min-device-pixel-ratio: 3) and (max-width: 700px) 65vw, (min-resolution: 2.5dppx) and (max-width: 700px) 80vw, (-webkit-min-device-pixel-ratio: 2.5) and (max-width: 700px) 80vw, (min-resolution: 2dppx) and (max-width: 700px) 100vw, (-webkit-min-device-pixel-ratio: 2) and (max-width: 700px) 100vw, 700px\" data-testid=\"og\"><\/picture><\/div>\n<\/div>\n<figcaption class=\"qz ra rb nm nn rc rd be b bf z dv\" data-selectable-paragraph=\"\">Left: Original image. Middle: Stylized image from the our small, 17KB model. Right: Stylized Image from the larger 7MB model.<\/figcaption>\n<\/figure>\n<p id=\"0d73\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">I am consistently floored that results like this are easily achievable, yet aren\u2019t done as a standard process in every paper. If we don\u2019t change our practices, I worry we\u2019ll waste time, money, and resources, while failing to bring deep learning to applications and devices that could benefit from it.<\/p>\n<p id=\"3025\" class=\"pw-post-body-paragraph mm mn fs be b gq mp mq mr gt ms mt mu nj mw mx my nk na nb nc nl ne nf ng nh fh bj\" data-selectable-paragraph=\"\">The good news, though, is that the marginal benefits of bigger models seem to be falling, and thanks to the techniques outlined here, we can make optimizations to size and speed that don\u2019t sacrifice much accuracy. We can have our cake and eat it, too.<\/p>\n<h1 id=\"46c4\" class=\"oq or fs be os ot pw gs ov ow px gv oy oz py pb pc pd pz pf pg ph qa pj pk pl bj\" data-selectable-paragraph=\"\"><strong class=\"al\">Some open questions about what\u2019s next<\/strong><\/h1>\n<p id=\"6df5\" class=\"pw-post-body-paragraph mm mn fs be b gq qb mq mr gt qc mt mu nj qd mx my nk qe nb nc nl qf nf ng nh fh bj\" data-selectable-paragraph=\"\">Thus far, I believe we\u2019ve only scratched the surface of what\u2019s possible in terms of model optimization. With more research and experimentation, I think it\u2019s possible to go even further. To that end, here are some areas that I think are ripe for additional work:<\/p>\n<ul class=\"\">\n<li id=\"bb2f\" class=\"mm mn fs be b gq mp mq mr gt ms mt mu mv mw mx my mz na nb nc nd ne nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Better framework support for quantized operations and quantized-aware training<\/li>\n<li id=\"da3e\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">A more rigorous study of model optimization vs task complexity<\/li>\n<li id=\"4cf0\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Additional work to determine the usefulness of <a class=\"af ni\" href=\"https:\/\/arxiv.org\/abs\/1807.11626\" target=\"_blank\" rel=\"noopener ugc nofollow\">platform-aware neural architecture search<\/a><\/li>\n<li id=\"6ba9\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\">Continued investment in a multi-level intermediate representation (<a class=\"af ni\" href=\"https:\/\/ai.google\/research\/pubs\/pub48035\" target=\"_blank\" rel=\"noopener ugc nofollow\">MLIR<\/a>)<\/li>\n<\/ul>\n<h1 id=\"bb5a\" class=\"oq or fs be os ot pw gs ov ow px gv oy oz py pb pc pd pz pf pg ph qa pj pk pl bj\" data-selectable-paragraph=\"\">Additional Resources<\/h1>\n<ul class=\"\">\n<li id=\"0bf3\" class=\"mm mn fs be b gq qb mq mr gt qc mt mu mv qd mx my mz qe nb nc nd qf nf ng nh po pp pq bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/github.com\/NervanaSystems\/distiller\" target=\"_blank\" rel=\"noopener ugc nofollow\">Distiller<\/a> \u2014 A library for optimizing PyTorch models<\/li>\n<li id=\"b22d\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/www.tensorflow.org\/model_optimization\" target=\"_blank\" rel=\"noopener ugc nofollow\">TensorFlow Model Optimization Toolkit<\/a><\/li>\n<li id=\"68cc\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/github.com\/keras-team\/keras-tuner\" target=\"_blank\" rel=\"noopener ugc nofollow\">Keras Tuner<\/a> \u2014 Hyperparameter optimization for Keras<\/li>\n<li id=\"20e9\" class=\"mm mn fs be b gq pr mq mr gt ps mt mu mv pt mx my mz pu nb nc nd pv nf ng nh po pp pq bj\" data-selectable-paragraph=\"\"><a class=\"af ni\" href=\"https:\/\/tinymlsummit.org\/#home\" target=\"_blank\" rel=\"noopener ugc nofollow\">TinyML<\/a> \u2014 Group dedicated to embedded ML<\/li>\n<\/ul>\n<div class=\"fh fi fj fk fl\">\n<div class=\"ab ca\">\n<div class=\"ch bg et eu ev ew\">\n<figure class=\"np nq nr ns nt fm\">\n<div class=\"rh iw l eb\">\n<div class=\"ri rj l\"><iframe loading=\"lazy\" class=\"ek n fc dx bg\" title=\"Creating smaller, faster, production-ready mobile machine learning models.\" src=\"https:\/\/cdn.embedly.com\/widgets\/media.html?src=https%3A%2F%2Fwww.slideshare.net%2Fslideshow%2Fembed_code%2Fkey%2FK1PuSPCL2PLPV7&amp;url=https%3A%2F%2Fwww.slideshare.net%2Fjamesontoole%2Fcreating-smaller-faster-productionready-mobile-machine-learning-models&amp;image=https%3A%2F%2Fcdn.slidesharecdn.com%2Fss_thumbnails%2Foreillyailondon20191-191029193817-thumbnail-4.jpg%3Fcb%3D1572378063&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;type=text%2Fhtml&amp;schema=slideshare\" width=\"600\" height=\"500\" frameborder=\"0\" scrolling=\"no\" allowfullscreen=\"allowfullscreen\" data-mce-fragment=\"1\"><\/iframe><\/div>\n<\/div>\n<\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The following is an adaptation of two talks I recently gave at the O\u2019Reilly AI Conference and DroidCon in London. Slides are available at the end of this post. Earlier this year, researchers at NVIDIA announced MegatronLM, a massive transformer model with 8.3 billion parameters (24 times larger than BERT) that achieved state-of-the-art performance on [&hellip;]<\/p>\n","protected":false},"author":69,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","footnotes":""},"categories":[6],"tags":[],"coauthors":[167],"class_list":["post-7059","post","type-post","status-publish","format-standard","hentry","category-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Learning Has a Size Problem - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Learning Has a Size Problem\" \/>\n<meta property=\"og:description\" content=\"The following is an adaptation of two talks I recently gave at the O\u2019Reilly AI Conference and DroidCon in London. Slides are available at the end of this post. Earlier this year, researchers at NVIDIA announced MegatronLM, a massive transformer model with 8.3 billion parameters (24 times larger than BERT) that achieved state-of-the-art performance on [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-08T19:54:07+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:14:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png\" \/>\n<meta name=\"author\" content=\"Jameson Toole\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jameson Toole\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Deep Learning Has a Size Problem - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/","og_locale":"en_US","og_type":"article","og_title":"Deep Learning Has a Size Problem","og_description":"The following is an adaptation of two talks I recently gave at the O\u2019Reilly AI Conference and DroidCon in London. Slides are available at the end of this post. Earlier this year, researchers at NVIDIA announced MegatronLM, a massive transformer model with 8.3 billion parameters (24 times larger than BERT) that achieved state-of-the-art performance on [&hellip;]","og_url":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-08-08T19:54:07+00:00","article_modified_time":"2025-04-24T17:14:54+00:00","og_image":[{"url":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png","type":"","width":"","height":""}],"author":"Jameson Toole","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Jameson Toole","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/"},"author":{"name":"Jameson Toole","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/bd1d8200ec883a28460980cb71cc3386"},"headline":"Deep Learning Has a Size Problem","datePublished":"2023-08-08T19:54:07+00:00","dateModified":"2025-04-24T17:14:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/"},"wordCount":2050,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png","articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/","url":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/","name":"Deep Learning Has a Size Problem - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#primaryimage"},"thumbnailUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png","datePublished":"2023-08-08T19:54:07+00:00","dateModified":"2025-04-24T17:14:54+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#primaryimage","url":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png","contentUrl":"https:\/\/miro.medium.com\/v2\/resize:fit:1500\/1*W1wdSrVbZANv2W57yKyGyw.png"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/deep-learning-has-a-size-problem\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Deep Learning Has a Size Problem"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/bd1d8200ec883a28460980cb71cc3386","name":"Jameson Toole","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/a0991bebec29a5f5147c9620233a2612","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1631128623025-96x96.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2023\/08\/1631128623025-96x96.jpg","caption":"Jameson Toole"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/jamesontoolegmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7059","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/69"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=7059"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7059\/revisions"}],"predecessor-version":[{"id":15587,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/7059\/revisions\/15587"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=7059"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=7059"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=7059"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=7059"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}