Distributed Mode

Using the Python SDK in distributed mode

The Comet Python SDK can be used when the training process is distributed across several processes and potentially several GPUs and/or servers. This usually reduces the training time.

Distributed Data Parallel Training

In data parallel mode, most machine learning libraries run X independent processes (on the same server or different ones) that are synced at various points during the training process. The same model is copied across multiple GPU processes, and each process receives a different subset of the data. The gradients across all these processes are averaged and the weights are synced across the models in each process.

You can learn more about data parallelism in this blog post by Tim Dettmers: data parallelism

Common caveats

  • The "native" console logging can sometimes be incompatible with various combinations of distributed mode configuration or when the training script forks itself. If you see incorrect behavior on your training script, switch to the "simple" console logging by passing auto_output_logging="simple" when creating the Experiment object.
  • Creating only one Experiment reduces the amount of duplicated information logged on the Comet Dashboard, but will limit System Details logging (CPU metrics / GPU metrics /Python Version...) to the main server.

Pytorch Distributed Data Parallel

You can find an example of Pytorch DDP + Comet Python SDK in the comet-example repository here: https://github.com/comet-ml/comet-examples/tree/master/pytorch#using-cometml-with-pytorch-parallel-data-training.