Integrate with Hugging Face Transformers¶

Hugging Face Transformers provide general-purpose Machine Learning models for Natural Language Understanding (NLP). Transformers give you easy access to pre-trained model weights, and interoperability between PyTorch and TensorFlow.

Instrument Transformers with Comet to start managing experiments, create dataset versions and track hyperparameters for faster and easier reproducibility and collaboration.

Start logging¶

Connect Comet to your existing Hugging Face Trainer code by configuring it through environment variables.

Add the following lines of code to your script or notebook:

import comet_ml
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

comet_ml.init(project_name='comet-examples-transfomers-trainer')

# 1. Enable logging of model checkpoints
os.environ["COMET_LOG_ASSETS"] = "True"

# 2. Define your model
model = AutoModelForSequenceClassification.from_pretrained(
   ...
)

# 3. Train your model
trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=train_dataset,
  eval_dataset=test_dataset,
  compute_metrics=compute_metrics,
)

trainer.train()

Log automatically¶

By integrating with Hugging Face's Trainer object, Comet automatically logs the following items, with no additional configuration:

Metrics (such as loss and accuracy)
Hyperparameters
Assets (such as checkpoints and log files)

End-to-end example¶

Get started with a basic example of using Comet with the Hugging Face Trainer.

You can check out the results of this example Transformers experiment for a preview of what's to come.

Install dependencies¶

python -m pip install comet_ml datasets torch transformers scikit-learn

Run the example¶

import os

import comet_ml
from datasets import load_dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)

# Enable logging of model checkpoints
os.environ["COMET_LOG_ASSETS"] = "True"

comet_ml.init(project_name="comet-examples-transformers-trainer")


PRE_TRAINED_MODEL_NAME = "distilbert-base-uncased"

raw_datasets = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME, num_labels=2
)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


def get_example(index):
    return eval_dataset[index]["text"]


def compute_metrics(pred):
    experiment = comet_ml.get_global_experiment()

    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average="macro"
    )
    acc = accuracy_score(labels, preds)

    if experiment:
        epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
        experiment.set_epoch(epoch)
        experiment.log_confusion_matrix(
            y_true=labels,
            y_predicted=preds,
            file_name=f"confusion-matrix-epoch-{epoch}.json",
            labels=["negative", "postive"],
            index_to_example_function=get_example,
        )

    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(200))
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(200))


training_args = TrainingArguments(
    seed=42,
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=25,
    save_strategy="steps",
    save_total_limit=10,
    save_steps=25,
    per_device_train_batch_size=8,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()

Try it out!¶

Here's an example for using Comet with Hugging Face.

Configure Comet for Hugging Face¶

You can control which Hugging Face items are logged automatically, by setting the following environment variables:

export COMET_MODE=ONLINE # Set to OFFLINE to run an Offline Experiment or DISABLE to turn off logging
export COMET_LOG_ASSET=True # Set to False to disable logging model checkpoints
export COMET_PROJECT_NAME=<your project name> # Configure your project name
export COMET_OFFLINE_DIRECTORY=<path to offline directory> # Folder to use for saving offline experiments when `COMET_MODE` is "OFFLINE"

For more information about using environment parameters in Comet, see Configure Comet.

Apr. 25, 2024