skip to Main Content

Building a fully reproducible machine learning pipeline with and Quilt

Classifying fruits using a Keras multi-class image classification model and Google Open Images


Photo by Luke Michael on Unsplash

This post was written in collaboration with Aleksey Boligur from the Quilt Data team. Follow Aleksey on Twitter and his personal website here. Follow Quilt here

The term machine learning ‘pipeline’ can suggest a one-way flow of data and transformations, but in reality, machine learning pipelines are cyclical and iterative. For a given project, a data scientist can try hundreds and thousands of experiments before arriving at a champion model to put in production.

With each iteration, it becomes harder to manage subsets and variations of your data and models. Keeping track of which model iteration ran on which dataset is key to reproducibility.

Having a proper machine learning pipeline that tracks specific versions of data, code, and environment details can not only help you easily reproduce your own model results, but also allow you to share your work with fellow data scientists or machine learning engineers who need to deploy your model.

In this article, we’ll show you how to build a simple and reproducible end-to-end machine learning pipeline using a Keras image multi-class classification model and a custom dataset crafted from Google Open Images using Quilt T4 and

You can access the full tutorial in this Github repository. For a walk-through of the tutorial, continue reading below ⬇.️

Creating your custom dataset

The Open Images Dataset is an attractive target for building image recognition algorithms because it is one of the largest, most accurate, and most easily accessible image recognition datasets. For image recognition tasks, Open Images contains 15 million bounding boxes for 600 categories of objects on 1.75 million images. Image labeling tasks meanwhile enjoy 30 million labels across almost 20,000 categories.

The images come from Flickr and are of highly variable quality, as would be realistic in an applied machine learning setting.

Downloading the entire Google Open Images corpus is possible and potentially necessary if you want to build a general purpose image classifier or bounding box algorithm. However downloading everything is a waste if you just want a small categorical subset of the data in the corpus. For this tutorial, we are just interested in downloading and working with fruit images.

View an interactive version of this plot on Quilt T4 here

The src/openimager subfolder in the Github repository provided contains a small module that handles downloading a categorical subset of the Open Images corpus: just the images corresponding with a user-selected group of labels, and just from the set of images with bounding box information attached. Instead of using the zipped blob files it does so by downloading the source images from Flickr directly.

This script will allow you to download any subset of the 600 labels that do. Here’s a taste of what’s possible:


For the purposes of this article, we’ll limit ourselves to just fruit classes including:


For more information on Open Images, check out the article ‘How to classify photos in 600 classes using nine million Open Images’.

Preprocessing your data — and packaging it

This annotated Jupyter notebook in the demo GitHub repository does this work. After running the notebook code, we will have an images_cropped folder on disk containing all of the cropped images.

It’s easy to access the package of fruit class data along with the pre-processed images is via the Quilt T4 package . In order to access the data, simply run this command:

! pip install t4

t4.Package.install('quilt/open_fruit', registry='s3://quilt-example', dest='some/path/some/where')

Looking closely at the fruit data, we can see that there is a class imbalance. There are over 26,000 samples of bananas but then only a few hundred labelled common fig or pear examples. This skew is important to note as we approach building our image classifier.

View this plot on Quilt T4 here

Building your image classification model

Now that we’ve downloaded our fruit data from Quilt, we can begin building our image classification model! As with any machine learning project, we’ll go through a few experiments to try to maximize our model’s validation accuracy:

  • First we’ll start with a baseline simple convolution neural network (CNN) model.
  • Then, we’ll try to leverage a pre-trained network (VGG architecture, pre-trained on the ImageNet dataset) whose learned features can help us reach a higher accuracy more effectively than just relying on our fruits dataset. We’ll use transfer learning by fine-tune the top layers of our pre-trained network.
  • Finally, we’ll do a quick overview of different approaches for optimization including changing parameters like the amount of dropout, learning rate, and weight decay to see how they could contribute to model performance.

The material for this tutorial was inspired by Francois Chollet’s excellent post ‘Building powerful image classification models using very little data’. We’ve expanded upon Chollet’s example and adjusted to reflect our multi-class classification problem space.

Along with having proper data versioning from Quilt, we’ll also make sure to track our results, code, and environment for our different model iterations as this is critical to building a reproducible machine learning model pipeline.

Note: We’ll be using Jupyter notebooks for this tutorial, but has native support for both Jupyter notebooks and scripts.

Baseline model — Simple CNN

For our baseline model, we are using a small CNN with three convolution layers, using a ReLU activation, followed by a max-pooling layer. We’ll include data augmentation and fairly aggressive dropout to prevent overfitting. Remember, we’re not expecting our best accuracy here, so if you’d like to skip this section and go straight to the pre-trained model, simply proceed to the next section below.

Here’s the experiment details for our small CNN model:

Not surprisingly, our simple CNN model did not perform that well on the multi-classification task (which puts us in a multi-dimensional space). The model was originally meant to support a binary classification task, so having more than three times the number of classes means trivially you need more nodes to get the same performance. Here are the metrics for one run of our model (link here):

To log your experiment results from training, set up your account here. For each run of the model, we initialize the Comet experiment object and provide our API Key and project name.

Once you run, you’ll be able to see your different model runs in through the direct experiment URL. As an example for this tutorial, we have created a Comet project that you can view and interact with here.

Since we’re using Keras, Comet’s auto-logging for popular machine learning frameworks allows us to automatically capture model details such as metrics like accuracy and loss, the model’s graph definition, and package dependencies — this significantly reduces the amount of manual logging we have to do from our end.

Using a pre-trained model with transfer learning: InceptionV3

A popular starting point for building image classifiers these days is to use a pre-trained network and fine-tune it with new classes of data. Let’s use this approach to build our image classifier (just make sure to take note of these implementation details for pre-trained models).

There are several popular CNN architectures such as VGGNet, ResNet, and AlexNet along with a wealth of resources to read more about CNNs (see here and here). Keras enables users to easily access these pre-trained models (ie. their weights pre-trained on ImageNet) through keras.applications.

We selected InceptionV3 since it’s both a smaller model compared to VGGNet and because it’s documented to provide a higher accuracy for benchmark datasets. Transfer learning with InceptionV3 essentially means that we re-use the feature extraction portion of the model that has been trained with the ImageNet dataset and re-train the classification portion on our fruit dataset.

See this helpful diagram on transfer learning with an Inception V3 architecture

Here’s the code plus experiment details for our fine-tuned InceptionV3 model:

Once we begin training with, we can use Comet to track how the model is performing in real-time. We can also check to make sure that we’re properly using our GPUs in the System Metrics tab. The experiment charts in Comet update with our model’s accuracy and loss metrics:

We’ll make sure to log our model weights at the end of the training process to Comet so we can reproduce the model in the future if we need to.

# save locally

# save to Comet Asset Tab
# you can retrieve these weights later via the REST API 
experiment.log_asset(file_path='./inceptionv3_tuned.h5', file_name='inceptionv3_tuned.h5')

If you want to retrieve the model code and have trained your model from a git directory, simply use the Reproduce button in the Comet experiment view.

The Reproduce dropdown will surface key pieces of information about your environment, git commit, and everything you need to reproduce your experiment, including the actual run commands or notebook file. If you have uncommitted changes, we also provide you with a patch for applying your changes later.

Evaluating the model

In order to evaluate our image classifier model, it’s useful to generate a few sample predictions and plot a confusion matrix so we can see where our model classified certain fruits correctly and incorrectly.

These images and figures would also be useful to share with teammates, so we can log them to Comet even after the experiment is complete using the Experiment.log_figure() and Experiment.log_image() methods (see more here).

For this experiment, we’ve logged some random samples from our fruit dataset. You can see this sample image is hardly a very clear image of a strawberry (in fact, there was some preprocessing!).

See this great resource on evaluating machine learning models from Jeremy Jordan:

Further optimizations

There are several ways we could approach improving our model. Here is an non-exhaustive list of things we could try to adjust:

  • Type of architecture— we also provide the code for VGG16 here
  • Number of Layers — Increase network depth to give it more capacity. Try adding more layers, one at a time, and check for improvements
  • Number of Neurons in a layer
  • Adding regularization and adjusting those parameters
  • Learning Rate — you can incorporate the Keras LearningRateScheduler through the callback (see here and here)
  • Type of optimization / back-propagation technique to use
  • Dropout rate
  • Progressive Resizing
  • Hyperparameter optimization services.

As you try these different optimizations, allows you to create visualizations like bar charts and line plots to track your experiments with along with parallel coordinate charts. These experiment-level and project-level visualizations help you quickly identify your best-performing models and understand your parameter space.

Your full machine learning pipeline

If you had to share your model results or intermediate work with your fellow data scientist today. How would you do it?

The benefits of using Quilt for data versioning and Comet for model versioning is that by combining these best-in-breed tools you can simultaneously make your machine learning model experiments easily accessible, trackable, and reproducible.

Sharing a model and the code used to generate it? Link your collaborator to the Comet experiment page. Sharing the data you used? Share a link to the Quilt T4 package.

Reproducing the result locally, or using an old experiment as the starting point for a new one? Get back to where you left off with this code:

git clone
cd open_fruit/

python -c "import t4; t4.Package.install('quilt/open_fruit', 's3://quilt-example', dest='keras-fruit-classifier/')"

# There are a *lot* of ways to do this: a pip requirements.txt, a
# conda environment.yml, a Docker container...

# Here's one cool way - cloning the Comet runtime  
PY_VERSION=$(python -c "import comet_ml; print(comet_ml.API().get_experiment_system_details('01e427cedce145f8bc69f19ae9fb45bb')['python_version'])")

conda create -n my_test_env python=$PY_VERSION
conda activate my_test_env

python -c "import comet_ml; print('n'.join(comet_ml.API().get_experiment_installed_packages('01e427cedce145f8bc69f19ae9fb45bb')))" > requirements.txt
pip install -r requirements.txt

# You can also get this from by clicking on the Download button

jupyter notebook

Congratulations! You’ve gone beyond building a multi-class image classifier model to building a fully reproducible (and shareable) machine learning pipeline with data, code, and environment details ⭐️

Thanks to Gideon Mendels and Aleksey Bilogur. 

Gideon Mendels | Comet ML

Gideon Mendels

Gideon Mendels is the CEO and co-founder of, a leading provider of machine learning operations (MLOps) solutions that accelerate getting machine learning models into production. Before Gideon founded GroupWize where they trained and deployed over 50 Natural Language Processing (NLP) models on 15 different languages. His journey with NLP and Speech Recognition models began at Columbia University and Google where he worked on hate speech and deception detection.
Back To Top