Skip to content

Remote Artifacts

Comet Artifacts is a tool that provides a convenient way to log, version, and browse data from all parts of the experimentation pipeline.

There are cases where uploading data to Comet directly might not serve your purposes but you still want to maintain data lineage and reproducibility of your training runs. Here are some typical scenarios where you might prefer to store these Assets in a remote location:

  • Your files are very large and uploading them to Comet takes too much time.
  • Your data is already stored in a centralized location, such as shared storage or cloud storage. It is convient to continue working within those locations.

In such cases, use remote Artifacts, instead. Instead of tracking your file content, you track a reference to that file instead. While this reference can be any string, if the remote Artifact is stored in S3 or GCS Comet can facilate the saving and downloading of these assets by simply tracking some additional metadata.

Integration with S3 and GCS

If you sync to files in a cloud object storage like Amazon S3 or Google Cloud Storage, the Comet Python SDK logs additional data automatically and facilitates the download of those files.

When you log a remote Artifact Asset starting with s3:// or gcs://, the Comet SDK queries the cloud object storage, finds matching objects and stores every one of them as a remote Artifact Asset. For each object found, Comet saves the object URI, its checksum and its size.

In addition, if object versioning has been enabled on the bucket, Comet also logs the remote object version ID as an opaque string, specific to the cloud provider.

Syncing is enabled by default but requires the user to have configured the authentication credentials following the cloud providers recommendation. You can learn more at:

Note

If the authentication credentials are not set, Comet will fallback to simply logging the bucket name and path as a string.

Log a remote Artifact Asset

Logging a remote Artifact Asset relies on the Artifact.add_remote method.

Assuming that you have the following files in S3:

s3://bucket/my-model/run-4b6d6ab025a14f0593c9c25289be7e9f/
├── keras_metadata.pb
├── saved_model.pb
└── variables
    ├── variables.data-00000-of-00001
    └── variables.index

Logging a remote Artifact Asset is as simple as:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from comet_ml import Artifact, Experiment

experiment = Experiment(
    api_key="<Your API Key>",
    project_name="<Your Project Name>"
)

artifact = Artifact("my-inference-model", "model")
artifact.add_remote(
    "s3://bucket/my-model/run-4b6d6ab025a14f0593c9c25289be7e9f/"
)

experiment.log_artifact(artifact)
experiment.end()

Access (download) an S3 or GCS remote Artifact Asset

If the Artifact version you are trying to download contains S3 or GCS Remote Assets, the Comet SDK tries to download the files from the cloud object storage.

We can download the Artifact logged above using:

from comet_ml import Experiment, Artifact

experiment = Experiment(
    api_key="<Your API Key>",
    project_name="<Your Project Name>"
)

logged_artifact = experiment.get_artifact("my-inference-model")
local_artifact = logged_artifact.download("./data")

That code downloads all files from S3 into the ./data folder, so:

./data/
├── keras_metadata.pb
├── saved_model.pb
└── variables
    ├── variables.data-00000-of-00001
    └── variables.index

Access (download) a remote Artifact Asset

Because a generic remote Artifact Asset is a link to a file, the Comet Python SDK cannot download the file automatically -- you need to access the link and process it on your side:

from comet_ml import Experiment, Artifact

experiment = Experiment(
    api_key="<Your API Key>",
    project_name="<Your Project Name>"
)

logged_artifact = experiment.get_artifact("artifact-name")
for asset in logged_artifact.assets:
    if asset.logical_path == "my_dataset":
        my_dataset_link = asset.link

        # TODO: Download the asset based on the link

Object Versioning

Both AWS S3 and Google Cloud Storage support object versioning. Object versioning allows you to store several copies of each object. Whenever an object is overwritten or removed, a new version is created and the older version is still accessible.

Whenever object versioning is enabled on the cloud object storage, the Comet SDK stores additional metadata to be able to download the correct version. All this happens automatically, so you don't have to worry about getting incorrect content.

Learn more about object versioning for the supported cloud object storage:

Try it out!

Try out remote Artifacts for yourself in this Colab Notebook.

Open In Colab

Learn more

Mar. 27, 2024