Use Comet in distributed systems¶
The Comet Python SDK can be used when the training process is distributed across several processes and potentially several GPUs and servers. Each training process can be assigned to a single or multiple GPUs on a given machine in a cluster.
We recommend creating a single Experiment object in each process to avoid unexpected behaviors from the SDK.
There are two recommended ways for running distributed training with Comet.
- Log a distributed training run to a single Experiment
- Log a distributed training run with multiple Experiments
Log a distributed training run to a single Experiment¶
If you would like to log the metrics from each worker as a single experiment, see how to do that here.
When logging to a single experiment, you must manually set an experiment key for your distributed training run that each process can then use to log data to Comet.
Keep in mind that logging system metrics (CPU/GPU usage, RAM, and so on) from multiple workers as a single experiment is not currently supported. We recommend using an Experiment per GPU process, instead.
An example project is provided here.
Log a distributed training run with multiple Experiments¶
To capture model metrics and system metrics (GPU/CPU usage, RAM, and so on) from each machine, while running distributed training, we recommend creating an Experiment object per GPU process, and grouping these experiments under a user provided run ID.
An example project is provided here. You can use the code available here to reproduce this project.
Setup API Key¶
Setting up the API Key in a distributed cluster can be challenging. You can find best practices below but each installation is unique and might require a separate solution.
Using your scheduler features¶
If you are using a job scheduler, it might support secure sharing of secrets. Try searching for "JOB_SCHEDULE secret sharing" in your favorite search engine. We recommend using it if it exists.
If you are using Kubernetes, you can also rely on Kubernetes secrets to distribute and retrieve the Comet API Key.
Using a personal API Key¶
If your cluster is only used by a sole user, you can setup the user's API Key to each worker node, either as an environment variable COMET_API_KEY
or in a configuration file. See Configure Comet for more details.
With that solution, the API Key is stored in clear text and people accessing a worker node will likely have access to the API Key.
Using a Service Account¶
If your cluster is shared across multiple users but only used for a single Comet Organization, you can setup a Service Account API Key to each worker node, either as an environment variable COMET_API_KEY
or in a configuration file. See Configure Comet for more details.
With that solution, the API Key is stored in clear text and people accessing a worker node will likely have access to the API Key.
On-premise solution¶
If you are using Comet on-premise, there is an additional possibility that relies on AWS Key Management Service or Google Cloud Key Management. Please contact your Comet Representative about how to set it up.