Skip to content

comet_ml.integration.ray ΒΆ

CometTrainLoggerCallback ΒΆ

CometTrainLoggerCallback(
    ray_config: Dict[str, Any],
    tags: Optional[List[str]] = None,
    save_checkpoints: bool = False,
    share_api_key_to_workers: bool = False,
    experiment_name: Optional[str] = None,
    api_key: Optional[str] = None,
    workspace: Optional[str] = None,
    project_name: Optional[str] = None,
    experiment_key: Optional[str] = None,
    mode: Optional[str] = None,
    online: Optional[bool] = None,
    **experiment_kwargs
)

Ray Callback for logging Train results to Comet.

This Ray Train LoggerCallback sends metrics and parameters to Comet for tracking.

This callback is based on the Ray native Comet callback and has been modified to allow to track resource usage on all distributed workers when running a distributed training job. It cannot be used with Ray Tune.

Parameters:

  • ray_config (Dict[str, Any]) –

    Ray configuration dictionary to share with workers. It must be the same dictionary instance, not a copy.

  • tags (Optional[List[str]], default: None ) –

    Tags to add to the logged Experiment.

  • save_checkpoints (bool, default: False ) –

    If True, model checkpoints will be saved to Comet ML as artifacts.

  • share_api_key_to_workers (bool, default: False ) –

    If True, Comet API key will be shared with workers via ray_config dictionary. This is an unsafe solution and we recommend you uses a more secure way to set up your API Key in your cluster.

  • experiment_name (Optional[str], default: None ) –

    Custom name for the Comet experiment. If None, a name is generated automatically.

  • api_key (string, default: None ) –

    Comet API key.

  • workspace (string, default: None ) –

    Comet workspace name.

  • project_name (string, default: None ) –

    Comet project name.

  • experiment_key (string, default: None ) –

    Experiment key to be used for logging.

  • mode (string, default: None ) –

    Controls how the Comet experiment is started, 3 options are possible:

    • "get": Continue logging to an existing experiment identified by the experiment_key value.
    • "create": Always creates of a new experiment, useful for HPO sweeps.
    • "get_or_create" (default): Starts a fresh experiment if required, or persists logging to an existing one.
  • online (bool, default: None ) –

    if True, the data will be logged to Comet server, otherwise it will be stored locally in offline experiment.

  • experiment_kwargs –

    Other keyword arguments will be passed to the constructor for comet_ml.Experiment.

Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
config = {"lr": 1e-3, "batch_size": 64, "epochs": 20}

comet_callback = CometTrainLoggerCallback(
    config,
    tags=["torch_ray_callback"],
    save_checkpoints=True,
    share_api_key_to_workers=True,
)

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    run_config=RunConfig(callbacks=[comet_callback]),
)
result = trainer.fit()

on_experiment_end ΒΆ

on_experiment_end(trials: List[Trial], **info)

comet_ray_train_logger ΒΆ

comet_ray_train_logger(
    trainer: DataParallelTrainer,
    tags: Optional[List[str]] = None,
    save_checkpoints: bool = False,
    share_api_key_to_workers: bool = False,
    experiment_name: Optional[str] = None,
    api_key: Optional[str] = None,
    workspace: Optional[str] = None,
    project_name: Optional[str] = None,
    experiment_key: Optional[str] = None,
    mode: Optional[str] = None,
    online: Optional[bool] = None,
    **experiment_kwargs
) -> None

Enables the registration of a Comet Ray callback with the specified trainer to collect and send training metrics and parameters to Comet for experiment tracking.

This callback is adapted from the native Ray Comet callback and modified to monitor resource usage across all distributed workers during distributed training jobs. Note that it is not compatible with Ray Tune.

Parameters:

  • trainer (DataParallelTrainer) –

    Ray Trainer object.

  • tags (Optional[List[str]], default: None ) –

    Tags to add to the logged Experiment.

  • save_checkpoints (bool, default: False ) –

    If True, model checkpoints will be saved to Comet ML as artifacts.

  • share_api_key_to_workers (bool, default: False ) –

    If True, Comet API key will be shared with workers via ray_config dictionary. This is an unsafe solution, and we recommend you to use a more secure way to set up your API Key in your cluster.

  • experiment_name (Optional[str], default: None ) –

    Custom name for the Comet experiment. If None, a name is generated automatically.

  • api_key (string, default: None ) –

    Comet API key.

  • workspace (string, default: None ) –

    Comet workspace name.

  • project_name (string, default: None ) –

    Comet project name.

  • experiment_key (string, default: None ) –

    Experiment key to be used for logging.

  • mode (string, default: None ) –

    Controls how the Comet experiment is started, 3 options are possible:

    • "get": Continue logging to an existing experiment identified by the experiment_key value.
    • "create": Always creates of a new experiment, useful for HPO sweeps.
    • "get_or_create" (default): Starts a fresh experiment if required, or persists logging to an existing one.
  • online (bool, default: None ) –

    if True, the data will be logged to Comet server, otherwise it will be stored locally in offline experiment.

  • experiment_kwargs –

    Other keyword arguments will be passed to the constructor for comet_ml.Experiment.

Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
config = {"lr": 1e-3, "batch_size": 64, "epochs": 20}

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    run_config=RunConfig(callbacks=[comet_callback]),
)
comet_ray_train_logger(
    trainer=trainer,
    tags=["torch_ray_callback"],
    save_checkpoints=True,
    share_api_key_to_workers=True
)

result = trainer.fit()

comet_worker ΒΆ

comet_worker(func)

This decorator enables you to monitor resource usage for each distributed worker during a distributed training job. By applying this decorator, you can annotate any training function to integrate Comet’s resource tracking.

Note: This should be used together with the comet_ml.integration.ray.CometTrainLoggerCallback callback, and the training function must accept a configuration dictionary as an input argument.

Parameters:

  • func (Callable) –

    The training function to be wrapped which should have configuration dictionary as an input argument. The training function is a user-defined Python function that contains the end-to-end model training loop logic. When launching a distributed training job, each worker executes this training function.

Example
1
2
3
@comet_worker
def train_func(config: Dict):
    # ray worker training code

comet_worker_logger ΒΆ

comet_worker_logger(
    ray_config: Dict[str, Any],
    api_key: Optional[str] = None,
    **experiment_kwargs
)

This context manager allows you to track resource usage from each distributed worker when running a distributed training job. It must be used in conjunction with comet_ml.integration.ray.CometTrainLoggerCallback callback.

Parameters:

  • ray_config (dict) –

    Ray configuration dictionary from ray driver node.

  • api_key (str, default: None ) –

    Comet API key. If not None it will be passed to ExistingExperiment. This argument has priority over api_key in ray_config dict and api key in environment.

  • **experiment_kwargs –

    Other keyword arguments will be passed to the constructor for comet_ml.ExistingExperiment.

Example
1
2
3
def train_func(ray_config: Dict):
    with comet_worker_logger(ray_config) as experiment:
        # ray worker training code

If some required information is missing (like the API Key) or something wrong happens, this will return a disabled Experiment, all methods calls will succeed but no data is gonna be logged.

Apr. 17, 2025