Skip to content

Use Comet in distributed systems

The Comet Python SDK can be used when the training process is distributed across several processes and potentially several GPUs and servers. Each training process can be assigned to a single or multiple GPUs on a given machine in a cluster.

We recommend creating a single Experiment object in each process to avoid unexpected behaviors from the SDK.

There are two recommended ways for running distributed training with Comet.

  • Log a distributed training run to a single Experiment
  • Log a distributed training run with multiple Experiments

Log a distributed training run to a single Experiment

If you would like to log the metrics from each worker as a single experiment, see how to do that here.

When logging to a single experiment, you must manually set an experiment key for your distributed training run that each process can then use to log data to Comet.

Keep in mind that logging system metrics (CPU/GPU usage, RAM, and so on) from multiple workers as a single experiment is not currently supported. We recommend using an Experiment per GPU process, instead.

An example project is provided here.

Log a distributed training run with multiple Experiments

To capture model metrics and system metrics (GPU/CPU usage, RAM, and so on) from each machine, while running distributed training, we recommend creating an Experiment object per GPU process, and grouping these experiments under a user provided run ID.

An example project is provided here.

To reproduce the project, you must run this example.

Jul. 7, 2022