Skip to content

comet_ml.integration.ray ¶

CometTrainLoggerCallback ¶

CometTrainLoggerCallback(
    ray_config: Dict[str, Any],
    tags: Optional[List[str]] = None,
    save_checkpoints: bool = False,
    share_api_key_to_workers: bool = False,
    experiment_name: Optional[str] = None,
    api_key: Optional[str] = None,
    workspace: Optional[str] = None,
    project_name: Optional[str] = None,
    experiment_key: Optional[str] = None,
    mode: Optional[str] = None,
    online: Optional[bool] = None,
    **experiment_kwargs
)

Ray Callback for logging Train results to Comet.

This Ray Train LoggerCallback sends metrics and parameters to Comet for tracking.

This callback is based on the Ray native Comet callback and has been modified to allow to track resource usage on all distributed workers when running a distributed training job. It cannot be used with Ray Tune.

Parameters:

  • ray_config (Dict[str, Any]) –

    Ray configuration dictionary to share with workers. It must be the same dictionary instance, not a copy.

  • tags (Optional[List[str]], default: None ) –

    Tags to add to the logged Experiment.

  • save_checkpoints (bool, default: False ) –

    If True, model checkpoints will be saved to Comet ML as artifacts.

  • share_api_key_to_workers (bool, default: False ) –

    If True, Comet API key will be shared with workers via ray_config dictionary. This is an unsafe solution and we recommend you uses a more secure way to set up your API Key in your cluster.

  • experiment_name (Optional[str], default: None ) –

    Custom name for the Comet experiment. If None, a name is generated automatically.

  • api_key (string, default: None ) –

    Comet API key.

  • workspace (string, default: None ) –

    Comet workspace name.

  • project_name (string, default: None ) –

    Comet project name.

  • experiment_key (string, default: None ) –

    Experiment key to be used for logging.

  • mode (string, default: None ) –

    Controls how the Comet experiment is started, 3 options are possible:

    • "get": Continue logging to an existing experiment identified by the experiment_key value.
    • "create": Always creates of a new experiment, useful for HPO sweeps.
    • "get_or_create" (default): Starts a fresh experiment if required, or persists logging to an existing one.
  • online (bool, default: None ) –

    if True, the data will be logged to Comet server, otherwise it will be stored locally in offline experiment.

  • experiment_kwargs –

    Other keyword arguments will be passed to the constructor for comet_ml.Experiment.

Example
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
config = {"lr": 1e-3, "batch_size": 64, "epochs": 20}

comet_callback = CometTrainLoggerCallback(
    config,
    tags=["torch_ray_callback"],
    save_checkpoints=True,
    share_api_key_to_workers=True,
)

trainer = TorchTrainer(
    train_func,
    train_loop_config=config,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    run_config=RunConfig(callbacks=[comet_callback]),
)
result = trainer.fit()

on_experiment_end ¶

on_experiment_end(trials: List[Trial], **info)

comet_worker_logger ¶

comet_worker_logger(
    ray_config: Dict[str, Any],
    api_key: Optional[str] = None,
    **experiment_kwargs
)

This context manager allows you to track resource usage from each distributed worker when running a distributed training job. It must be used in conjunction with comet_ml.integration.ray.CometTrainLoggerCallback callback.

Parameters:

  • ray_config (dict) –

    Ray configuration dictionary from ray driver node.

  • api_key (str, default: None ) –

    Comet API key. If not None it will be passed to ExistingExperiment. This argument has priority over api_key in ray_config dict and api key in environment.

  • **experiment_kwargs –

    Other keyword arguments will be passed to the constructor for comet_ml.ExistingExperiment.

Example
1
2
3
def train_func(ray_config: Dict):
    with comet_worker_logger(ray_config) as experiment:
        # ray worker training code

If some required information is missing (like the API Key) or something wrong happens, this will return a disabled Experiment, all methods calls will succeed but no data is gonna be logged.

Sep. 12, 2024