{"id":2296,"date":"2021-04-28T17:58:34","date_gmt":"2021-04-29T01:58:34","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/"},"modified":"2021-04-28T17:58:34","modified_gmt":"2021-04-29T01:58:34","slug":"how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/","title":{"rendered":"How to 10x Throughput When Serving Hugging Face Models Without a GPU"},"content":{"rendered":"\n<blockquote class=\"wp-block-quote is-style-large is-layout-flow wp-block-quote-is-layout-flow\">\n<p>In less than 50 lines of code, you can deploy a Bert-like model from the Hugging Face library and achieve over 100 requests per second with latencies below 100 milliseconds for less than $250 a\u00a0month.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The code for this blog post available here: <\/em><a href=\"https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models\">https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Simple models and simple inference pipelines are much more likely to generate business value than complex approaches. When it comes to deploying NLP models, nothing is as simple as creating a FastAPI server to make real-time predictions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While GPU accelerated inference has its place, this blog post will focus on how to optimize your CPU inference service to achieve sub 100 millisecond latency and over 100 requests per second throughput. One key advantage of using a Python inference service rather than more complex GPU accelerated deployment options is that we will be able to have the tokenization built-in further reducing the complexity of the deployment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In order to achieve good performance for CPU inference we need to make optimisations to our serving framework. We breakdown down the post into:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Benchmarking setup<\/strong><\/li>\n<li><strong>Baseline:<\/strong> FastAPI web server using default options<\/li>\n<li><strong>Pytorch and FastAPI optimizations:<\/strong> Tuning FastAPI for ML inference<\/li>\n<li><strong>Model optimizations:<\/strong> Using model distillation and quantisation to improve performance<\/li>\n<li><strong>Hardware optimization:<\/strong> 3x performance improvement by choosing the right cloud instances to use<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Benchmarking setup<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Benchmarks are notoriously difficult [1], we highly recommend you create your own based on your specific requirements. We provide all the code used to reproduce the numbers presented below on Github <a href=\"https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models\">here<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As we can\u2019t test everything, we have had to make a number of decisions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We will be using GCP, similar results can be expected on other Cloud providers<\/li>\n<li>We will not be implementing batching on prediction requests<\/li>\n<li>Each user we simulate send as many requests as they can, as soon as they get a response they will send another request<\/li>\n<li>The input request to our model is a string with between 45 and 55 words (~3 sentences), if your input text is longer then latencies will increase.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Baseline<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The code for the baseline inference service is available on GitHub <\/em><a href=\"https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models\/tree\/main\/python-api\/baseline\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\"><em>here<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The baseline approach relies on the default parameters for FastAPI, PyTorch and Hugging Face. As we start optimising these libraries for our inference task, we will be able to compare the impact on the performance metrics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our baseline approach will use:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Machine: GCP e2-standard-4 = 4 virtual CPUs\u200a\u2014\u200a16 GB memory [2]<\/li>\n<li>Inference service: FastAPI service with default Gunicorn arguments<\/li>\n<li>Model: Hugging Face implementation of Bert [3]<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Thanks to the awesome work of both the Hugging Face and FastAPI teams, we can create an API in just a few lines of code:<\/p>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/jacques-comet\/b4cec23c927270894a301f66700d80fa.js\"><\/script><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can then start the FastAPI server using: <code>gunicorn main:app<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Using this approach we obtain the following performance metrics:<\/p>\n\n\n\n<p><iframe loading=\"lazy\" src=\"https:\/\/sheetsu.com\/tables\/e3e9731426\" width=\"100%\" height=\"100\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n\n\n\n<p style=\"color: #000000ad; font-size: 0.8em;\"><em>* for this benchmark we used 2 concurrent users in the load testing software<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Learnings<\/strong><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>A simple Python API can serve up to 6 predictions a second, that is over 15 million predictions a month\u00a0!<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">PyTorch and FastAPI optimizations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The code for the PyTorch and FastAPI optimized inference service is available on GitHub <\/em><a href=\"https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models\/tree\/main\/python-api\/model-optimised\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\"><em>here<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the baseline server we used the default configuration settings for both PyTorch and FastAPI, by making some small changes we can increase throughput by 25%.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most of these optimisations come from a really great blog post by the Roblox team on how they scaled Bert to 1 billion requests a day [4].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Changes to PyTorch configuration:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>torch.set_grad_enabled(False)<\/code>\u00a0: During inference we don\u2019t need to compute the gradients<\/li>\n<li><code>torch.set_num_threads(1)<\/code>\u00a0: We would like to configure the parallelism using Gunicorn workers rather than through PyTorch. This will maximise CPU usage<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Changes to FastAPI configuration:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Turn off asynchronous processing: Our application is CPU bound and therefore asynchronous processing can hurt performance [needs reference]<\/li>\n<li><code>gunicorn main:app --workers $NB_WORKERS<\/code>\u00a0: Load a new model for each worker that will each use one CPU so that we can process requests in parallel<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">In order to understand the impact of these changes, we run a couple of benchmarks with the same number of concurrent users as we used for the baseline approach:<\/p>\n\n\n\n<p><iframe loading=\"lazy\" src=\"https:\/\/sheetsu.com\/tables\/01a3b8428b\" width=\"100%\" height=\"310\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n\n\n\n<p style=\"color: #000000ad; font-size: 0.8em;\"><em>* for this benchmark we used 2 concurrent users in the load testing software<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Looking at the benchmark above, we find that having the same number of workers as we have CPU cores is a good rule of thumb when configuring Gunicorn. Going forward we will be using this rule of thumb for all machine types.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By making some small changes to the way our models are served we have achieved a 25% increase in throughput compared to our baseline. In addition both the median latency and 95th percentile latency have decreased.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Learnings<\/strong><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>When serving ML models, we should not be using PyTorch parallelism or FastAPI asynchronous processes and instead manage the parallelism using Gunicorn workers.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Model Optimizations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The code for the Model optimized inference service is available on GitHub <\/em><a href=\"https:\/\/github.com\/comet-ml\/blog-serving-hugging-face-models\/tree\/main\/python-api\/model-optimised\" target=\"_blank\" rel=\"noreferrer noopener\" aria-label=\" (opens in a new tab)\"><em>here<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While Bert is a very versatile model, it is also a large model. In order to decrease latency and improve throughput there are two main strategies we can use:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Distillation: Use Bert to train a smaller model that mimics the outcome of Bert [5]<\/li>\n<li>Quantization: Reduce the size of the weights by converting them from float32 to 8 bit integers [6]<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">While both options will improve inference latency, it will impact the accuracy of the model. We haven\u2019t looked into the impact on accuracy but we can expect then drop in accuracy to be small [7].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Moving from Bert to a distilled version on Bert is very straightforward given we are using HuggingFace, all we need to do is change <code>BertForSequenceClassification.from_pretrained('bert-base-uncased')<\/code>to <code>DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When using PyTorch, quantization is very easy to implement, all we need to do is call <code>model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)<\/code>. For Tensorflow models quantization is not as straightforward as you have to use either Tensorflow Lite or TensorRT which is much more temperamental. For this benchmark, we will use the PyTorch version of the model.<\/p>\n\n\n\n<p><iframe loading=\"lazy\" src=\"https:\/\/sheetsu.com\/tables\/c57bc2d5d4\" width=\"100%\" height=\"190\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n\n\n\n<p style=\"color: #000000ad; font-size: 0.8em;\"><em>* for this benchmark we used 2 concurrent users in the load testing software<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Learnings<\/strong><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Using quantization and distillation leads to a 300% increase in throughput and 300% decrease in latency<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Hardware optimization<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The code for the hardware optimized inference service is available <\/em><a href=\"https:\/\/github.com\/comet-ml\/comet-deployment-blog\/tree\/main\/python-api\/model-hardware-optimised\" target=\"_blank\" rel=\"noreferrer noopener\"><em>here<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The hardware used to make the inference can also have a big impact on performance, having more CPUs allows us to process more concurrent requests for example.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition recent versions of Intel CPUs include optimisations for ML inference thanks to the newly released Intel Deep Learning Boost [4].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To understand the impact of this new instruction set, we run a new set of benchmarks using the <code>Compute Optimized<\/code> machines on GCP running the new generation of Intel CPUs:<\/p>\n\n\n\n<p><iframe loading=\"lazy\" src=\"https:\/\/sheetsu.com\/tables\/47c8b1a316\" width=\"100%\" height=\"210\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n\n\n\n<p style=\"color: #000000ad; font-size: 0.8em;\"><em>* for this benchmark the number of concurrent users was equal to the number of vCPUs<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Learnings<\/strong><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>By optimizing the hardware we use to run our ML inference server, we can increase throughput by 300% and decrease latency by 30%<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Conclusion<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Our baseline inference server could make up to 6 predictions per second with each prediction taking around 320 milliseconds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By optimising how we made the predictions, utilizing quantization and distillation as well as the hardware used, we created an inference service that could make up to 68 predictions per second with each prediction taking about 60 milliseconds\u00a0!<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>By optimizing our Python inference service, we have increased throughput by a factor of 10 (to 70 requests per second) and divided latency by 5 (to 60 milliseconds)!<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">If you would like to optimise your serving framework further, check out the series that Hugging Face have released: <a href=\"https:\/\/huggingface.co\/blog\/bert-cpu-scaling-part-1\" target=\"_blank\" rel=\"noreferrer noopener\">Scaling up BERT-like model Inference on modern CPU<\/a><\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n<h4 class=\"wp-block-heading\">References:<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">[0]: <a href=\"https:\/\/github.com\/comet-ml\/comet-deployment-blog\" target=\"_blank\" rel=\"noreferrer noopener\">Code used for these benchmarks<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[1]: <a href=\"https:\/\/jbd.dev\/benchmarks-are-hard\/#:~:text=Benchmarking%20generally%20mean%20producing%20some,Optimizing%20costly%20workloads.\" target=\"_blank\" rel=\"noreferrer noopener\">Benchmark are hard<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2]: <a href=\"https:\/\/cloud.google.com\/compute\/docs\/machine-types#e2_machine_types\" target=\"_blank\" rel=\"noreferrer noopener\">GCP instance types<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[3]: <a href=\"https:\/\/huggingface.co\/transformers\/model_doc\/bert.html\" target=\"_blank\" rel=\"noreferrer noopener\">Hugging Face implementation of Bert<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[4]: <a href=\"https:\/\/blog.roblox.com\/2020\/05\/scaled-bert-serve-1-billion-daily-requests-cpus\/\" target=\"_blank\" rel=\"noreferrer noopener\">How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[5]: <a href=\"https:\/\/arxiv.org\/abs\/1910.01108\" target=\"_blank\" rel=\"noreferrer noopener\">DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[6]: <a href=\"https:\/\/pytorch.org\/docs\/stable\/quantization.html#modules-that-provide-quantization-functions-and-classes\" target=\"_blank\" rel=\"noreferrer noopener\">PyTorch quantization<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[7]: <a href=\"https:\/\/pytorch.org\/blog\/introduction-to-quantization-on-pytorch\/\" target=\"_blank\" rel=\"noreferrer noopener\">Introduction to Quantization on PyTorch<\/a><\/p>\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n<h2 class=\"wp-block-heading\"><em>Want to stay in the loop?\u00a0<a href=\"https:\/\/info.comet.ml\/newsletter-signup\/?utm_campaign=tensorboard-integration&amp;utm_source=blog&amp;utm_medium=CTA\">Subscribe to the Comet Newsletter<\/a>\u00a0for weekly insights and perspective on the latest ML news, projects, and more.<\/em><\/h2>\n","protected":false},"excerpt":{"rendered":"<p>By optimising how a model is served, we serve over 100 predictions per second with a simply Python API using CPU inference<\/p>\n","protected":false},"author":1,"featured_media":2297,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[],"coauthors":[129],"class_list":["post-2296","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to 10x Throughput When Serving Hugging Face Models Without a GPU - Comet<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to 10x Throughput When Serving Hugging Face Models Without a GPU\" \/>\n<meta property=\"og:description\" content=\"By optimising how a model is served, we serve over 100 predictions per second with a simply Python API using CPU inference\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-29T01:58:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1707\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Jacques Verre\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Jacques Verre\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How to 10x Throughput When Serving Hugging Face Models Without a GPU - Comet","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/","og_locale":"en_US","og_type":"article","og_title":"How to 10x Throughput When Serving Hugging Face Models Without a GPU","og_description":"By optimising how a model is served, we serve over 100 predictions per second with a simply Python API using CPU inference","og_url":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2021-04-29T01:58:34+00:00","og_image":[{"width":2560,"height":1707,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","type":"image\/jpeg"}],"author":"Jacques Verre","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Jacques Verre","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/"},"author":{"name":"engineering@atre.net","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/550ac35e8e821db8064c5bd1f0a04e6b"},"headline":"How to 10x Throughput When Serving Hugging Face Models Without a GPU","datePublished":"2021-04-29T01:58:34+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/"},"wordCount":1291,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/","url":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/","name":"How to 10x Throughput When Serving Hugging Face Models Without a GPU - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","datePublished":"2021-04-29T01:58:34+00:00","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","width":2560,"height":1707,"caption":"turbo"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/how-to-10x-throughput-when-serving-hugging-face-models-without-a-gpu\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"How to 10x Throughput When Serving Hugging Face Models Without a GPU"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/550ac35e8e821db8064c5bd1f0a04e6b","name":"engineering@atre.net","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/027c18177377edf459980f0cfb83706c","url":"https:\/\/secure.gravatar.com\/avatar\/d002a459a297e0d1779329318029aee19868c312b3e1f3c9ec9b3e3add2740de?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/d002a459a297e0d1779329318029aee19868c312b3e1f3c9ec9b3e3add2740de?s=96&d=mm&r=g","caption":"engineering@atre.net"},"sameAs":["https:\/\/live-cometml.pantheonsite.io"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/engineeringatre-net\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2022\/06\/charlotte-coneybeer-L9VXW4A9QZM-unsplash-scaled-1.jpg","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/2296","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=2296"}],"version-history":[{"count":0,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/2296\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/2297"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=2296"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=2296"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=2296"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=2296"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}