Skip to content

Run Experiments in distributed environments

The Comet Python SDK can be used when the training process is distributed across several processes and potentially several GPUs and servers. Each training process can be assigned to a single or multiple GPUs on a given machine in a cluster.

We recommend creating a single Experiment object in each process to avoid unexpected behaviors from the SDK.

There are two recommended ways for running distributed training with Comet.

  • Log a distributed training run to a single Experiment
  • Log a distributed training run with multiple Experiments

Log a Distributed Training Run to a single Experiment

If you want to log the training metrics from each worker as a single experiment, refer to this guide.

When logging to a single experiment, you need to manually set an experiment key for your distributed training run. This key will be used by each process to log data to Comet.

An example project is available here.

Log a Distributed Training Run to multiple experiments

To capture training metrics from each machine while running distributed training, we recommend creating an Experiment object per distributed worker and grouping these experiments under a user-provided run ID.

An example project is available here. You can use the code provided here to reproduce this project.

Setup API Key

Setting up the API Key in a distributed cluster can be challenging. You can find best practices below but each installation is unique and might require a separate solution.

Using your scheduler features

If you are using a job scheduler, it may support secure sharing of secrets. You can search for "JOB_SCHEDULER secret sharing" on your preferred search engine to see if this feature is available. If it is, we recommend using it.

If you are using Kubernetes, you can use Kubernetes secrets to distribute and retrieve the Comet API Key.

Using a cloud-provider Secret Management System

The Comet Python SDK supports retrieving the API Key from either GCP Secret Manager or AWS Secret Manager. For this to work, each worker must have permission to access the corresponding secret in the Secret Manager. See Secret Management for more details and instruction how to set it up.

Using a personal API Key

If your cluster is only used by a sole user, you can setup the user's API Key to each worker node, either as an environment variable COMET_API_KEY or in a configuration file. See Configure Comet for more details.

With that solution, the API Key is stored in clear text and people accessing a worker node will likely have access to the API Key.

Using a Service Account

If your cluster is shared across multiple users but only used for a single Comet Organization, you can setup a Service Account API Key to each worker node, either as an environment variable COMET_API_KEY or in a configuration file. See Configure Comet for more details.

With that solution, the API Key is stored in clear text and people accessing a worker node will likely have access to the API Key.

On-premise solution

If you are using Comet on-premise, there is an additional possibility that relies on AWS Key Management Service or Google Cloud Key Management. Please contact your Comet Representative about how to set it up.

May. 17, 2024