Changelog

November 18, 2025

Here are the most relevant improvements we’ve made since the last release:

📊 More Metrics!

We have shipped 37 new built-in metrics, faster & more reliable LLM judging, plus robustness fixes.

New Metrics Added - We’ve expanded the evaluation metrics library with a comprehensive set of out-of-the-box metrics including:

Classic NLP Heuristics - BERTScore, Sentiment analysis, Bias detection, Conversation drift, and more
Lightweight Heuristics - Fast, non-LLM based metrics perfect for CI/CD pipelines and large-scale evaluations
LLM-as-a-Judge Presets - More out-of-the-box presets you can use without custom configuration

LLM-as-a-Judge & G-Eval Improvements:

Compatible with newer models - Now works seamlessly with the latest model versions
Faster default judge - Default judge is now gpt-5-nano for faster, more accurate evals
LLM Jury support - Aggregate scores across multiple models/judges into a single ensemble score for more reliable evaluations

Enhanced Preprocessing:

Improved English text handling - Better processing of English text to reduce false negatives
Better emoji handling - Enhanced emoji processing for more accurate evaluations

Robustness Improvements:

Automatic retries - LLM judge will retry on transient failures to avoid flaky test results
More reliable evaluation runs - Faster, more consistent evaluation runs for CI and experiments

Evaluation metrics overview showing conversation heuristic metrics and LLM as a Judge metrics

👉 Access the metrics docs here: Evaluation Metrics Overview

🔒 Anonymizers - PII Information Redaction

We’ve added support for PII (Personally Identifiable Information) redaction before sending data to Opik. This helps you protect sensitive information while still getting the observability insights you need.

With anonymizers, you can:

Automatically redact PII from traces and spans before they’re sent to Opik
Configure custom anonymization rules to match your specific privacy requirements
Maintain compliance with data protection regulations
Protect sensitive data without losing observability

👉 Read the full docs: Anonymizers

🚨 New Alert Types

We’ve expanded our alerting capabilities with new alert types and improved functionality:

Experiment Finished Alert - Get notified when an experiment completes, so you can review results immediately or trigger your CI/CD pipelines.
Cost Alerts - Set thresholds for cost metrics and receive alerts when spending exceeds your limits
Latency Alerts - Monitor response times and get notified when latency exceeds configured thresholds

These new alert types help you stay on top of your LLM application’s performance and costs, enabling proactive monitoring and faster response to issues.

👉 Read more: Alerts Guide

🎥 Multimodal Support

We’ve significantly enhanced multimodal capabilities across the platform:

Video LLM-as-a-Judge - Added support for Video LLM-as-a-Judge, enabling evaluation of video content in your traces
Video Cost Tracking - Added cost tracking for video models, so you can monitor spending on video processing operations
Image support in LLM-as-a-Judge - Both Python and TypeScript SDKs now support image processing in LLM-as-a-Judge evaluations, allowing you to evaluate traces containing images

These enhancements make it easier to build and evaluate multimodal applications that work with images and video content.

🔌 Custom AI Providers

We’ve improved support for custom AI providers with enhanced configuration options:

Multiple Custom Providers - Set up multiple custom AI providers for use in the Playground and online scoring
Custom Headers Support - Configure custom headers for your custom providers, giving you more flexibility in how you connect to enterprise AI services

🧪 Enhanced Evals & Observability

We’ve added several improvements to make evaluation and observability more powerful:

Trace and Span Metadata in Datasets - Ability to add trace and span metadata to datasets for advanced agent evaluation, enabling more sophisticated evaluation workflows
Tokens Breakdown Display - Display tokens breakdown (input/output) in the trace view, giving you detailed visibility into token usage for each span and trace
Binary (Boolean) Feedback Scores - New support for binary (Boolean) feedback scores, allowing you to capture simple yes/no or pass/fail evaluations

🎨 UX Improvements

We’ve made several user experience enhancements across the platform:

Improved Pretty Mode - Enhanced pretty mode for traces, threads, and annotation queues, making it easier to read and understand your data
Date Filtering for Traces, Threads, and Spans - Added date filtering capabilities, allowing you to focus on specific time ranges when analyzing your data
New Optimization Runs Section - Added a new optimization runs section to the home page, giving you quick access to your optimization results
Comet Debugger Mode - Added Comet Debugger Mode with app version and connectivity status, helping you troubleshoot issues and understand your application’s connection status. Read more about it here

Comet Debugger Mode showing RTT, connectivity status, and Opik version

And much more! 👉 See full commit log on GitHub

Releases: 1.8.98, 1.8.99, 1.8.100, 1.8.101, 1.8.102, 1.9.0, 1.9.1, 1.9.2, 1.9.3, 1.9.4, 1.9.5, 1.9.6, 1.9.7, 1.9.8, 1.9.9, 1.9.10, 1.9.11, 1.9.12, 1.9.13, 1.9.14, 1.9.15, 1.9.16, 1.9.17

November 4, 2025

Here are the most relevant improvements we’ve made since the last release:

🚨 Native Slack and PagerDuty Alerts

We now offer native Slack and PagerDuty alert integrations, eliminating the need for any middleware configuration. Set up alerts directly in Opik to receive notifications when important events happen in your workspace.

With native integrations, you can:

Configure Slack channels directly from Opik settings
Set up PagerDuty incidents without additional webhook setup
Receive real-time notifications for errors, feedback scores, and critical events
Streamline your monitoring workflow with built-in integrations

👉 Read the full docs here - Alerts Guide

🖼️ Multimodal LLM-as-a-Judge Support for Visual Evaluation

LLM as a Judge metrics can now evaluate traces that contain images when using vision-capable models. This is useful for:

Evaluating image generation quality - Assess the quality and relevance of generated images
Analyzing visual content in multimodal applications - Evaluate how well your application handles visual inputs
Validating image-based responses - Ensure your vision models produce accurate and relevant outputs

To reference image data from traces in your evaluation prompts:

In the prompt editor, click the “Images +” button to add an image variable
Map the image variable to the trace field containing image data using the Variable Mapping section

👉 Read more: Evaluating traces with images

✨ Prompt Generator & Improver

We’ve launched the Prompt Generator and Prompt Improver — two AI-powered tools that help you create and refine prompts faster, directly inside the Playground.

Designed for non-technical users, these features automatically apply best practices from OpenAI, Anthropic, and Google, helping you craft clear, effective, and production-grade prompts without leaving the Playground.

Why it matters

Prompt engineering is still one of the biggest bottlenecks in LLM development. With these tools, teams can:

Generate high-quality prompts from simple task descriptions
Improve existing prompts for clarity, specificity, and consistency
Iterate and test prompts seamlessly in the Playground

How it works

Prompt Generator → Describe your task in plain language; Opik creates a complete system prompt following proven design principles
Prompt Improver → Select an existing prompt; Opik enhances it following best practices

👉 Read the full docs: Prompt Generator & Improver

🔗 Advanced Prompt Integration in Spans & Traces

We’ve implemented prompt integration into spans and traces, creating a seamless connection between your Prompt Library, Traces, and the Playground.

You can now associate prompts directly with traces and spans using the opik_context module — so every execution is automatically tied to the exact prompt version used.

Understanding which prompt produced a given trace is key for users building both simple and advanced multi-prompt and multi-agent systems.

With this integration, you can:

Track which prompt version was used in each function or span
Audit and debug prompts directly from trace details
Reproduce or improve prompts instantly in the Playground
Close the loop between prompt design, observability, and iteration

Once added, your prompts appear in the trace details view — with links back to the Prompt Library and the Playground, so you can iterate in one click.

👉 Read more: Adding prompts to traces and spans

🧪 Better No-Code Experiment Capabilities in the Playground

We’ve introduced a series of improvements directly in the Playground to make experimentation easier and more powerful:

Key enhancements:

Create or select datasets directly from the Playground
Create or select online score rules - Ability to choose the ones that you want to use on each run
Ability to pass dataset items to online score rules - This enables reference-based experiments, where outputs are automatically compared to expected answers or ground truth, making objective evaluation simple
One-click navigation to experiment results - From the Playground, users can now:
- Jump into the Single Experiment View to inspect metrics and examples in detail, or
- Go to the Compare Experiments View to benchmark multiple runs side-by-side

📊 On-Demand Online Evaluation on Existing Traces and Threads

We’ve added on-demand online evaluation in Opik, letting users run metrics on already logged traces and threads — perfect for evaluating historical data or backfilling new scores.

How it works

Select traces/threads, choose any online score rule (e.g., Moderation, Equals, Contains), and run evaluations directly from the UI — no code needed.

Results appear inline as feedback scores and are fully logged for traceability.

This enables:

Fast, no-code evaluation of existing data
Easy retroactive measurement of model and agent performance
Historical data analysis without re-running traces

👉 Read more: Manual Evaluation

🤖 Agent Evaluation Guides

We’ve added two new comprehensive guides on evaluating agents:

1. Evaluating Agent Trajectories

This guide helps you evaluate that your agent is making the right tool calls before returning the final answer. It’s fundamentally about evaluating and scoring what is happening within a trace.

👉 Read the full guide: Evaluating Agent Trajectories

2. Evaluating Multi-Turn Agents

Evaluating chatbots is tough because you need to evaluate not just a single LLM response but instead a conversation. This guide walks you through how you can use the new opik.simulation.SimulatedUser method to create simulated threads for your agent.

👉 Read the full guide: Evaluating Multi-Turn Agents

These new docs significantly strengthen our agent evaluation feature-set and include diagrams to visualize how each evaluation strategy works.

📦 Import/Export Commands

Added new command-line functions for importing and exporting Opik data: you can now export all traces, spans, datasets, prompts, and evaluation rules from a project to local JSON or CSV files. Also helps you import data from local JSON files into an existing project.

Top use cases it is useful for

Migrate - Move data between projects or environments
Backup - Create local backups of your project data
Version control - Track changes to your prompts and evaluation rules
Data portability - Easily transfer your Opik workspace data

Read the full docs: Import/Export Commands

And much more! 👉 See full commit log on GitHub

Releases: 1.8.83, 1.8.84, 1.8.85, 1.8.86, 1.8.87, 1.8.88, 1.8.89, 1.8.90, 1.8.91, 1.8.92, 1.8.93, 1.8.94, 1.8.95, 1.8.96, 1.8.97

October 21, 2025

Here are the most relevant improvements we’ve made since the last release:

🚨 Alerts

We’ve launched Alerts — a powerful way to get automated webhook notifications from your Opik workspace whenever important events happen (errors, feedback scores, prompt changes, and more). Opik now sends an HTTP POST to your endpoint with rich, structured event data you can route anywhere.

Now, you can make Opik a seamless part of your end-to-end workflows! With the new Alerts you can:

Spot production errors in near-real time
Track feedback scores to monitor model quality and user satisfaction
Audit prompt changes across your workspace
Funnel events into your existing workflows and CI/CD pipelines

And this is just v1.0! We’ll keep adding events and advanced filtering, thresholds and more fine-grained control in future iterations, always based on community feedback.

Alerts configuration interface showing webhook setup and event types

Read the full docs here - Alerts Guide

🖼️ Expanded Multimodal Image Support

We’ve added a better image support across our platform!

What’s new?

1. Image Support in LLM as a Judge online Evaluations - LLM as a Judge evaluations now support images alongside text, enabling you to evaluate vision models and multimodal applications. Upload images and get comprehensive feedback on both text and visual content.

2. Enhanced Playground Experience - The playground now supports image inputs, allowing you to test prompts with images before running full evaluations. Perfect for experimenting with vision models and multimodal prompts.

3. Improved Data Display - Base64 image previews in data tables, better image handling in trace views, and enhanced pretty formatting for multimodal content.

LLM Judge evaluation interface showing image support for multimodal evaluations

Links to official docs: Evaluating traces with images and Using images in the Plaground

Opik Optimizer Updates

1. Support Multi-Metric Optimization - Support for optimizing multiple metrics simultaneously with comprehensive frontend and backend changes. Read more

2. Hierarchical Reflective Optimizer - New optimizer with self-reflective capabilities. Read more about it here

Enhanced Feedback & Annotation experience

1. Improved Annotation Queue Export - Enhanced export functionality for annotation queues: export your annotated data seamlessly for further analysis.

2. Annotation Queue UX Enhancements

Hotkeys Navigation - Improved keyboard navigation throughout the interface for a fast annotation experience
Return to Annotation Queue Button - Easy navigation back to annotation queues
Resume Functionality - Continue annotation work where you left off
Queue Creation from Traces - Create annotation queues directly from trace tables

3. Inline Feedback Editing - Quickly edit user feedback directly in data tables with our new inline editing feature. Hover over feedback cells to reveal edit options, making annotation workflows faster and more intuitive.

Inline feedback editing interface showing hover-triggered edit options in data tables

User Experience Enhancements

1. Dark Mode Refinements - Improved dark mode styling across UI components for better visual consistency and user experience.

Dark mode interface showing improved styling and visual consistency across UI components

2. Enhanced Prompt Readability - Better formatting and display of long prompts in the interface, making them easier to read and understand.

3. Improved Online Evaluation Page - Added search, filtering, and sorting capabilities to the online evaluation page for better data management.

4. Better token and cost control

Thread Cost Display - Show cost information in thread sidebar headers
Sum Statistics - Display sum statistics for cost and token columns in the traces table.

Total cost display showing cost information in thread sidebar headers and sum statistics

Total duration display showing duration statistics and timing information

5. Filter-Aware Metric Aggregation - Better experiment item filtering in the experiments details tables for better data control.

6. Pretty Mode Enhancements - Improved the Pretty mode for Input/Output display with better formatting and readability across the product.

TypeScript SDK Updates

Opik Configure Tool - New opik-ts configure tool with a guided developer experience and local flag support
Prompt Management - Comprehensive prompt management implementation
LangChain Integration - Aligned LangChain integration with Python architecture

Python SDK Improvements

Context Managers - New context managers for span and trace creation
Bedrock Integration - Enhanced Bedrock integration with invoke_model support
Trace Updates - New update_trace() method for easier trace modifications
Parallel Agent Support - Support for logging parallel agents in ADK integration
Enhanced feedback score handling with better category support

Integration updates

1. OpenTelemetry Improvements

Thread ID Support - Added support for thread_id in OpenTelemetry endpoint
System Information in Telemetry - Enhanced telemetry with system information

2. Model Support Updates - Added support for Claude Haiku 4.5 and updated model pricing information across the platform.

And much more! 👉 See full commit log on GitHub

Releases: 1.8.63, 1.8.64, 1.8.65, 1.8.66, 1.8.67, 1.8.68, 1.8.69, 1.8.70, 1.8.71, 1.8.72, 1.8.73, 1.8.74, 1.8.75, 1.8.76, 1.8.77, 1.8.78, 1.8.79, 1.8.80, 1.8.81, 1.8.82, 1.8.83

October 3, 2025

Here are the most relevant improvements we’ve made since the last release:

📝 Multi-Value Feedback Scores & Annotation Queues

We’re excited to announce major improvements to our evaluation and annotation capabilities!

What’s new?

1. Multi-Value Feedback Scores Multiple users can now independently score the same trace or thread. No more overwriting each other’s input—every reviewer’s perspective is preserved and is visible in the product. This enables richer, more reliable consensus-building during evaluation.

2. Annotation Queues Create queues of traces or threads that need expert review. Share them with SMEs through simple links. Organize work systematically, track progress, and collect both structured and unstructured feedback at scale.

3. Simplified Annotation Experience A clean, focused UI designed for non-technical reviewers. Support for clear instructions, predefined feedback metrics, and progress indicators. Lightweight and distraction-free, so SMEs can concentrate on providing high-quality feedback.

Annotation Queues interface showing SME workflow and feedback collection

Full Documentation: Annotation Queues

🚀 Opik Optimizer - GEPA Algorithm & MCP Tool Optimization

What’s new?

1. GEPA (Genetic-Pareto) Support GEPA is the new algorithm for optimizing prompts from Stanford. This bolsters our existing optimizers with the latest algorithm to give users more options.

2. MCP Tool Calling Optimization The ability to tune MCP servers (external tools used by LLMs). Our solution uses our existing algorithm (MetaPrompter) to use LLMs to tune how LLMs interact with an MCP tool. The final output is a new tool signature which you can commit back to your code.

GEPA Optimizer interface showing genetic-pareto algorithm for prompt optimization

Full Documentation: Tool Optimization | GEPA Optimizer

🔍 Dataset & Search Enhancements

Added dataset search and dataset items download functionality

🐍 Python SDK Improvements

Implement granular support for choosing dataset items in experiments
Better project name setting and onboarding
Implement calculation of mean/min/max/std for each metric in experiments
Update CrewAI to support CrewAI flows

🎨 UX Enhancements

Add clickable links in trace metadata
Add description field to feedback definitions

And much more! 👉 See full commit log on GitHub

Releases: 1.8.43, 1.8.44, 1.8.45, 1.8.46, 1.8.47, 1.8.48, 1.8.49, 1.8.50, 1.8.51, 1.8.52, 1.8.53, 1.8.54, 1.8.55, 1.8.56, 1.8.57, 1.8.58, 1.8.59, 1.8.60, 1.8.61, 1.8.62

September 5, 2025

Here are the most relevant improvements we’ve made since the last release:

🔍 Opik Trace Analyzer Beta is Live!

We’re excited to announce the launch of Opik Trace Analyzer on Opik Cloud!

What this means: faster debugging & analysis!

Our users can now easily understand, analyze, and debug their development and production traces.

Want to give it a try? All you need to do is go to one of your traces and click on “Inspect trace” to start getting valuable insights.

Opik Trace Analyzer Beta interface showing trace analysis and debugging features

✨ Features and Improvements

We’ve finally added dark mode support! This feature has been requested many times by our community members. You can now switch your theme in your account settings.

Dark mode theme toggle in Opik account settings showing light and dark theme options

Now you can filter the widgets in the metrics tab by trace and threads attributes

Metrics tab filters showing trace and thread attribute filtering options

Annotating tons of threads? We’ve added the ability to export feedback score comments for threads to CSV for easier analysis in external tools.
We have also improved the discoverability of the experiment comparison feature.
Added new filter operators to the Experiments table

Experiment table filter operators showing advanced filtering options

Adding assets as part of your experiment’s metadata? We now display clickable links in the experiment config tab for easier navigation.

Clickable assets and metadata links in experiment configuration tab showing improved navigation

📚 Documentation

We’ve released Opik University! This is a new section of the docs full of video guides explaining the product.

🔌 SDK & Integration Improvements

Enhanced LangChain integration with comprehensive tests and build fixes
Implemented new search_prompts method in the Python SDK
Added documentation for models, providers, and frameworks supported for cost tracking
Enhanced Google ADK integration to log error information to corresponding spans and traces

And much more! 👉 See full commit log on GitHub

Releases: 1.8.34, 1.8.35, 1.8.36, 1.8.37, 1.8.38, 1.8.39, 1.8.40, 1.8.41, 1.8.42

August 22, 2025

Here are the most relevant improvements we’ve made in the last couple of weeks:

🧪 Experiment Grouping

Instantly organize and compare experiments by model, provider, or custom metadata to surface top performers, identify slow configurations, and discover winning parameter combinations. The new Group by feature provides aggregated statistics for each group, making it easier to analyze patterns across hundreds of experiments.

🤖 Expanded Model Support

Added support for 144+ new models, including:

OpenAI’s GPT-5 and GPT-4.1-mini
Anthropic Claude Opus 4.1
Grok 4
DeepSeek v3
Qwen 3

🛫 Streamlined Onboarding

New quick start experience with AI-assisted installation, interactive setup guides, and instant access to team collaboration features and support.

🔌 Integrations

Enhanced support for leading AI frameworks including:

LangChain: Improved token usage tracking functionality
Bedrock: Comprehensive cost tracking for Bedrock models

🔍 Custom Trace Filters

Advanced filtering capabilities with support for list-like keys in trace and span filters, enabling precise data segmentation and analysis across your LLM operations.

⚡ Performance Optimizations

Python scoring performance improvements with pre-warming
Optimized ClickHouse async insert parameters
Improved deduplication for spans and traces in batches

🛠️ SDK Improvements

Python SDK configuration error handling improvements
Added dataset & dataset item ID to evaluate task inputs
Updated OpenTelemetry integration

And much more! 👉 See full commit log on GitHub

Releases: 1.8.16, 1.8.17, 1.8.18, 1.8.19, 1.8.20, 1.8.21, 1.8.22, 1.8.23, 1.8.24, 1.8.25, 1.8.26, 1.8.27, 1.8.28, 1.8.29, 1.8.30, 1.8.31, 1.8.32, 1.8.33

August 1, 2025

🎯 Advanced Filtering & Search Capabilities

We’ve expanded filtering and search capabilities to help you find and analyze data more effectively:

Custom Trace Filters: Support for custom filters on input/output fields for traces and spans, allowing more precise data filtering
Enhanced Search: Improved search functionality with better result highlighting and local search within code blocks
Better Search Results: Enhanced search result highlighting and improved local search functionality within code blocks
Crash Filtering: Fixed filtering issues for values containing special characters like % to prevent crashes
Dataset Filtering: Added support for experiments filtering by datasetId and promptId

📊 Metrics & Analytics Improvements

We’ve enhanced the metrics and analytics capabilities:

Thread Feedback Scores: Added comprehensive thread feedback scoring system for better conversation quality assessment
Thread Duration Monitoring: New duration widgets in the Metrics dashboard for monitoring conversation length trends
Online Evaluation Rules: Added ability to enable/disable online evaluation rules for more flexible monitoring
Cost Optimization: Reduced cost prompt queries to improve performance and reduce unnecessary API calls

🎨 UX Enhancements

We’ve made several UX improvements to make the platform more intuitive and efficient:

Full-Screen Popup Improvements: Enhanced the full-screen popup experience with better navigation and usability
Tag Component Optimization: Made tag components smaller and more compact for better space utilization
Column Sorting: Enabled sorting and filtering on all Prompt columns for better data organization
Multi-Item Tagging: Added ability to add tags to multiple items in the Traces and Spans tables simultaneously

🔌 SDK, integrations and docs

LangChain Integration: Enhanced LangChain integration with improved provider and model logging
Google ADK Integration: Updated Google ADK integration with better graph building capabilities
Bedrock Integration: Added comprehensive cost tracking support for ChatBedrock and ChatBedrockConverse

🔒 Security & Stability Enhancements

We’ve implemented several security and stability improvements:

Dependency Updates: Updated critical dependencies including MySQL connector, OpenTelemetry, and various security patches
Error Handling: Improved error handling and logging across the platform
Performance Monitoring: Enhanced NewRelic support for better performance monitoring
Sentry Integration: Added more metadata about package versions to Sentry events for better debugging

And much more! 👉 See full commit log on GitHub

Releases: 1.8.7, 1.8.8, 1.8.9, 1.8.10, 1.8.11, 1.8.12, 1.8.13, 1.8.14, 1.8.15, 1.8.16

July 18, 2025

🧵 Thread-level LLMs-as-Judge

We now support thread-level LLMs-as-a-Judge metrics!

We’ve implemented Online evaluation for threads, enabling the evaluation of entire conversations between humans and agents.

This allows for scalable measurement of metrics such as user frustration, goal achievement, conversational turn quality, clarification request rates, alignment with user intent, and much more.

We’ve also implemented Python metrics support for threads, giving you full code control over metric definitions.

To improve visibility into trends and to help detect spikes in these metrics when the agent is running in production, we’ve added Thread Feedback Scores and Thread Duration widgets to the Metrics dashboard. These additions make it easier to monitor changes over time in live environments.

🔍 Improved Trace Inspection Experience

Once you’ve identified problematic sessions or traces, we’ve made it easier to inspect and analyze them with the following improvements:

Field Selector for Trace Tree: Quickly choose which fields to display in the trace view.
Span Type Filter: Filter spans by type to focus on what matters.
Improved Agent Graph: Now supports full-page view and zoom for easier navigation.
Free Text Search: Search across traces and spans freely without constraints.
Better Search Usability: search results are now highlighted and local search is available within code blocks.

📊 Spans Tab Improvements

The Spans tab provides a clearer, more comprehensive view of agent activity to help you analyze tool and sub-agent usage across threads, uncover trends, and spot latency outliers more easily.

What’s New:

LLM Calls → Spans: we’ve renamed the LLM Calls tab to Spans to reflect broader coverage and richer insights.
Unified View: see all spans in one place, including LLM calls, tools, guardrails, and more.
Span Type Filter: quickly filter spans by type to focus on what matters most.
Customizable Columns: highlight key span types by adding them as dedicated columns.

These improvements make it faster and easier to inspect agent behavior and performance at a glance.

📈 Experiments Improvements

Slow model response times can lead to frustrating user experiences and create hidden bottlenecks in production systems. However, identifying latency issues early (during experimentation) is often difficult without clear visibility into model performance.

To help address this, we’ve added Duration as a key metric for monitoring model latency in the Experiments engine. You can now include Duration as a selectable column in both the Experiments and Experiment Details views. This makes it easier to identify slow-responding models or configurations early, so you can proactively address potential performance risks before they impact users.

📦 Enhanced Data Organization & Tagging

When usage grows and data volumes increase, effective data management becomes crucial. We’ve added several capabilities to make team workflows easier:

Tagging, filtering, and column sorting support for Prompts
Tagging, filtering, and column sorting support for Datasets
Ability to add tags to multiple items in the Traces and Spans tables

🤖 New Models Support

We’ve added support for:

OpenAI GPT-4.1 and GPT-4.1-mini models
Anthropic Claude 4 Sonnet model

🌐 Integration Updates

We’ve enhanced several integrations:

Build graph for Google ADK agents
Update Langchain integration to log provider, model and usage when using Google Generative AI models
Implement Groq LLM usage tracking support in the Langchain integration

And much more! 👉 See full commit log on GitHub

Releases: 1.8.0, 1.8.1, 1.8.2, 1.8.3, 1.8.4, 1.8.5, 1.8.6

July 4, 2025

🛠 Agent Optimizer 1.0 released!

The Opik Agent Optimizer now supports full agentic systems and not just single prompts.

With support for LangGraph, Google ADK, PydanticAI, and more, this release brings a simplified API, model customization for evaluation, and standardized interfaces to streamline optimization workflows. Learn more in the docs.

🧵 Thread-level improvements

Added Thread-Level Feedback, Tags & Comments: You can now add expert feedback scores directly at the thread level, enabling SMEs to review full agent conversations, flag risks, and collaborate with dev teams more effectively. Added support for thread-level tags and comments to streamline workflows and improve context sharing.

🖥️ UX improvements

We’ve redesigned the Opik Home Page to deliver a cleaner, more intuitive first-use experience, with a focused value proposition, direct access to key metrics, and a polished look. The demo data has also been upgraded to showcase Opik’s capabilities more effectively for new users. Additionally, we’ve added inter-project comparison capabilities for metrics and cost control, allowing you to benchmark and monitor performance and expenses across multiple projects.

Improved Error Visualization: Enhanced how span-level errors are surfaced across the project. Errors now bubble up to the project view, with quick-access shortcuts to detailed error logs and variation stats for better debugging and error tracking.
Improved Sidebar Hotkeys: Updated sidebar hotkeys for more efficient keyboard navigation between items and detail views.

🔌 SDK, integrations and docs

Added Langchain support in metric classes, allowing use of Langchain as a model proxy alongside LiteLLM for flexible LLM judge customization.
Added support for the Gemini 2.5 model family.
Updated pretty mode to support Dify and LangGraph + OpenAI responses.
Added the OpenAI agents integration cookbook (link).
Added a cookbook on how to import Huggingface Datasets to Opik

👉 See full commit log on GitHub

Releases: 1.7.37, 1.7.38, 1.7.39, 1.7.40, 1.7.41, 1.7.42

June 20, 2025

🔌 Integrations and SDK

Added CloudFlare’s WorkersAI integration (docs)
Google ADK integration: tracing is now automatically propagated to all sub-agents in agentic systems with the new track_adk_agent_recursive feature, eliminating the need to manually add tracing to each sub-agent.
Google ADK integration: now we retrieve session-level information from the ADK framework to enrich the threads data.
New in the SDK! Real-time tracking for long-running spans/traces is now supported. When enabled (set os.environ["OPIK_LOG_START_TRACE_SPAN"] = "True" in your environment), you can see traces and spans update live in the UI—even for jobs that are still running. This makes debugging and monitoring long-running agents much more responsive and convenient.

🧵 Threads improvements

Added Token Count and Cost Metrics in Thread table
Added Sorting on all Thread table columns
Added Navigation from Thread Detail to all related traces
Added support for “pretty mode” in OpenAI Agents threads

🧪 Experiments improvements

Added support for filtering by configuration metadata to experiments. It is now also possible to add a new column displaying the configuration in the experiments table.

🛠 Agent Optimizer improvements

New Public API for Agent Optimization
Added optimization run display link
Added optimization_context

🛡️ Security Fixes

Fixed: h11 accepted some malformed Chunked-Encoding bodies
Fixed: setuptools had a path traversal vulnerability in PackageIndex.download that could lead to Arbitrary File Write
Fixed: LiteLLM had an Improper Authorization Vulnerability

👉 See full commit log on GitHub

Releases: 1.7.32, 1.7.33, 1.7.34, 1.7.35, 1.7.36