skip to Main Content
Join Us for Comet's Annual Convergence Conference on May 8-9:

How to 10x Throughput When Serving Hugging Face Models Without a GPU

In less than 50 lines of code, you can deploy a Bert-like model from the Hugging Face library and achieve over 100 requests per second with latencies below 100 milliseconds for less than $250 a month.

The code for this blog post available here:

Simple models and simple inference pipelines are much more likely to generate business value than complex approaches. When it comes to deploying NLP models, nothing is as simple as creating a FastAPI server to make real-time predictions.

While GPU accelerated inference has its place, this blog post will focus on how to optimize your CPU inference service to achieve sub 100 millisecond latency and over 100 requests per second throughput. One key advantage of using a Python inference service rather than more complex GPU accelerated deployment options is that we will be able to have the tokenization built-in further reducing the complexity of the deployment.

In order to achieve good performance for CPU inference we need to make optimisations to our serving framework. We breakdown down the post into:

  1. Benchmarking setup
  2. Baseline: FastAPI web server using default options
  3. Pytorch and FastAPI optimizations: Tuning FastAPI for ML inference
  4. Model optimizations: Using model distillation and quantisation to improve performance
  5. Hardware optimization: 3x performance improvement by choosing the right cloud instances to use

Benchmarking setup

Benchmarks are notoriously difficult [1], we highly recommend you create your own based on your specific requirements. We provide all the code used to reproduce the numbers presented below on Github here.

As we can’t test everything, we have had to make a number of decisions:

  • We will be using GCP, similar results can be expected on other Cloud providers
  • We will not be implementing batching on prediction requests
  • Each user we simulate send as many requests as they can, as soon as they get a response they will send another request
  • The input request to our model is a string with between 45 and 55 words (~3 sentences), if your input text is longer then latencies will increase.


The code for the baseline inference service is available on GitHub here.

The baseline approach relies on the default parameters for FastAPI, PyTorch and Hugging Face. As we start optimising these libraries for our inference task, we will be able to compare the impact on the performance metrics.

Our baseline approach will use:

  • Machine: GCP e2-standard-4 = 4 virtual CPUs — 16 GB memory [2]
  • Inference service: FastAPI service with default Gunicorn arguments
  • Model: Hugging Face implementation of Bert [3]

Thanks to the awesome work of both the Hugging Face and FastAPI teams, we can create an API in just a few lines of code:

We can then start the FastAPI server using: gunicorn main:app

Using this approach we obtain the following performance metrics:

* for this benchmark we used 2 concurrent users in the load testing software


A simple Python API can serve up to 6 predictions a second, that is over 15 million predictions a month !

PyTorch and FastAPI optimizations

The code for the PyTorch and FastAPI optimized inference service is available on GitHub here.

In the baseline server we used the default configuration settings for both PyTorch and FastAPI, by making some small changes we can increase throughput by 25%.

Most of these optimisations come from a really great blog post by the Roblox team on how they scaled Bert to 1 billion requests a day [4].

Changes to PyTorch configuration:

  • torch.set_grad_enabled(False) : During inference we don’t need to compute the gradients
  • torch.set_num_threads(1) : We would like to configure the parallelism using Gunicorn workers rather than through PyTorch. This will maximise CPU usage

Changes to FastAPI configuration:

  • Turn off asynchronous processing: Our application is CPU bound and therefore asynchronous processing can hurt performance [needs reference]
  • gunicorn main:app --workers $NB_WORKERS : Load a new model for each worker that will each use one CPU so that we can process requests in parallel

In order to understand the impact of these changes, we run a couple of benchmarks with the same number of concurrent users as we used for the baseline approach:

* for this benchmark we used 2 concurrent users in the load testing software

Looking at the benchmark above, we find that having the same number of workers as we have CPU cores is a good rule of thumb when configuring Gunicorn. Going forward we will be using this rule of thumb for all machine types.

By making some small changes to the way our models are served we have achieved a 25% increase in throughput compared to our baseline. In addition both the median latency and 95th percentile latency have decreased.


When serving ML models, we should not be using PyTorch parallelism or FastAPI asynchronous processes and instead manage the parallelism using Gunicorn workers.

Model Optimizations

The code for the Model optimized inference service is available on GitHub here.

While Bert is a very versatile model, it is also a large model. In order to decrease latency and improve throughput there are two main strategies we can use:

  • Distillation: Use Bert to train a smaller model that mimics the outcome of Bert [5]
  • Quantization: Reduce the size of the weights by converting them from float32 to 8 bit integers [6]

While both options will improve inference latency, it will impact the accuracy of the model. We haven’t looked into the impact on accuracy but we can expect then drop in accuracy to be small [7].

Moving from Bert to a distilled version on Bert is very straightforward given we are using HuggingFace, all we need to do is change BertForSequenceClassification.from_pretrained('bert-base-uncased')to DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased').

When using PyTorch, quantization is very easy to implement, all we need to do is call model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8). For Tensorflow models quantization is not as straightforward as you have to use either Tensorflow Lite or TensorRT which is much more temperamental. For this benchmark, we will use the PyTorch version of the model.

* for this benchmark we used 2 concurrent users in the load testing software


Using quantization and distillation leads to a 300% increase in throughput and 300% decrease in latency

Hardware optimization

The code for the hardware optimized inference service is available here.

The hardware used to make the inference can also have a big impact on performance, having more CPUs allows us to process more concurrent requests for example.

In addition recent versions of Intel CPUs include optimisations for ML inference thanks to the newly released Intel Deep Learning Boost [4].

To understand the impact of this new instruction set, we run a new set of benchmarks using the Compute Optimized machines on GCP running the new generation of Intel CPUs:

* for this benchmark the number of concurrent users was equal to the number of vCPUs


By optimizing the hardware we use to run our ML inference server, we can increase throughput by 300% and decrease latency by 30%


Our baseline inference server could make up to 6 predictions per second with each prediction taking around 320 milliseconds.

By optimising how we made the predictions, utilizing quantization and distillation as well as the hardware used, we created an inference service that could make up to 68 predictions per second with each prediction taking about 60 milliseconds !

By optimizing our Python inference service, we have increased throughput by a factor of 10 (to 70 requests per second) and divided latency by 5 (to 60 milliseconds)!

If you would like to optimise your serving framework further, check out the series that Hugging Face have released: Scaling up BERT-like model Inference on modern CPU


[0]: Code used for these benchmarks

[1]: Benchmark are hard

[2]: GCP instance types

[3]: Hugging Face implementation of Bert

[4]: How We Scaled Bert To Serve 1+ Billion Daily Requests on CPUs

[5]: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[6]: PyTorch quantization

[7]: Introduction to Quantization on PyTorch

Want to stay in the loop? Subscribe to the Comet Newsletter for weekly insights and perspective on the latest ML news, projects, and more.

Jacques Verre | Comet ML

Jacques Verre

Jacques is a technical product manager leading the development of Comet MPM (Model Production Monitoring). Prior to joining Comet, Jacques founded the machine learning monitoring startup Stakion based on his experience building and deploying real-time models in the Fintech space.
Back To Top