Observability for Hugging Face Datasets with Opik
Hugging Face Datasets is a library that provides easy access to thousands of datasets for machine learning and natural language processing tasks.
This guide explains how to integrate Opik with Hugging Face Datasets to convert and import datasets into Opik for model evaluation and optimization.
Account Setup
Comet provides a hosted version of the Opik platform, simply create an account and grab your API Key.
You can also run the Opik platform locally, see the installation guide for more information.
Getting Started
Installation
To use Hugging Face Datasets with Opik, you’ll need to have both the datasets
and opik
packages installed:
Configuring Opik
Configure the Opik Python SDK for your deployment type. See the Python SDK Configuration guide for detailed instructions on:
- CLI configuration:
opik configure
- Code configuration:
opik.configure()
- Self-hosted vs Cloud vs Enterprise setup
- Configuration files and environment variables
Configuring Hugging Face
In order to access private datasets on Hugging Face, you will need to have your Hugging Face token. You can create and manage your Hugging Face tokens on this page.
You can set it as an environment variable:
Or set it programmatically:
HuggingFaceToOpikConverter
The integration provides a utility class to convert Hugging Face datasets to Opik format:
Basic Usage
Convert and Upload a Dataset
Here’s how to convert a Hugging Face dataset to Opik format and upload it:
Convert to Opik Format
The converter provides a method to transform Hugging Face datasets into Opik’s expected format:
Using with @track decorator
Use the @track
decorator to create comprehensive traces when working with your converted datasets:
Popular Dataset Examples
SQuAD (Question Answering)
GLUE (General Language Understanding)
Common Crawl (Text Classification)
Results viewing
Once your Hugging Face datasets are converted and uploaded to Opik, you can view them in the Opik UI. Each dataset will contain:
- Input data from specified columns
- Expected output from specified columns
- Metadata from additional columns
- Source information (Hugging Face dataset name and split)
Feedback Scores and Evaluation
Once your Hugging Face datasets are in Opik, you can evaluate your LLM applications using Opik’s evaluation framework:
Environment Variables
Make sure to set the following environment variables:
Troubleshooting
Common Issues
- Authentication Errors: Ensure your Hugging Face token is correct for private datasets
- Dataset Not Found: Verify the dataset name and configuration are correct
- Memory Issues: Use
subset_size
parameter to limit large datasets - Data Type Conversion: The converter handles most data types, but complex nested structures may need custom handling
Getting Help
- Check the Hugging Face Datasets Documentation for dataset loading
- Review the Hugging Face Hub Documentation for authentication
- Contact Hugging Face support for dataset-specific problems
- Check Opik documentation for tracing and evaluation features
Next Steps
Once you have Hugging Face Datasets integrated with Opik, you can:
- Evaluate your LLM applications using Opik’s evaluation framework
- Create datasets to test and improve your models
- Set up feedback collection to gather human evaluations
- Monitor performance across different models and configurations