Skip to content

integration.ray

CometTrainLoggerCallback

class comet_ml.integration.ray.CometTrainLoggerCallback(self, ray_config: Dict[str, Any], tags: Optional[List[str]] = None, save_checkpoints: bool = False, share_api_key_to_workers: bool = False, experiment_name: Optional[str] = None, **experiment_kwargs)

Ray Callback for logging Train results to Comet.

This Ray Train LoggerCallback sends metrics and parameters to Comet for tracking.

This callback is based on the Ray native Comet callback and has been modified to allow to track resource usage on all distributed workers when running a distributed training job. It cannot be used with Ray Tune.

Args:

  • ray_config: dict (required), ray configuration dictionary to share with workers. It must be the same dictionary instance, not a copy.
  • tags: list of string (optional), tags to add to the logged Experiment. Defaults to None.
  • save_checkpoints: boolean (optional), if True, model checkpoints will be saved to Comet ML as artifacts. Defaults to False.
  • share_api_key_to_workers: boolean (optional), if True, Comet API key will be shared with workers via ray_config dictionary. This is an unsafe solution and we recommend you uses a more secure way to set up your API Key in your cluster.
  • experiment_name: string (optional) = Custom name for the Comet experiment. If None, a name is generated automatically. Defaults to None
  • experiment_kwargs: Other keyword arguments will be passed to the constructor for comet_ml.Experiment.

Example:

config = {"lr": 1e-3, "batch_size": 64, "epochs": 20}

comet_callback = CometTrainLoggerCallback(
    config,
    tags=["torch_ray_callback"],
    save_checkpoints=True,
    share_api_key_to_workers=True,
)

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    run_config=RunConfig(callbacks=[comet_callback]),
)
result = trainer.fit()

Return: None

comet_worker_logger

comet_ml.integration.ray.comet_worker_logger(ray_config, api_key=None,
    **experiment_kwargs)

This context manager allows you to track resource usage from each distributed worker when running a distributed training job. It must be used in conjunction with comet_ml.integration.ray.CometTrainLoggerCallback callback.

Args:

  • ray_config: dict (required) ray configuration dictionary from ray driver node.
  • api_key: str (optional), If not None it will be passed to ExistingExperiment. This argument has priority over api_key in ray_config dict and api key in environment.
  • ****experiment_kwargs**: Other keyword arguments will be passed to the constructor for comet_ml.ExistingExperiment.

Example:

def train_func(ray_config: Dict):
    with comet_worker_logger(ray_config) as experiment:
        # ray worker training code

If some required information is missing (like the API Key) or something wrong happens, this will return a disabled Experiment, all methods calls will succeed but no data is gonna be logged.

Returns: An Experiment object.

May. 17, 2024