🎯 Advanced Filtering & Search Capabilities
We’ve expanded filtering and search capabilities to help you find and analyze data more effectively:
- Custom Trace Filters: Support for custom filters on input/output fields for traces and spans, allowing more precise data filtering
- Enhanced Search: Improved search functionality with better result highlighting and local search within code blocks
- Better Search Results: Enhanced search result highlighting and improved local search functionality within code blocks
- Crash Filtering: Fixed filtering issues for values containing special characters like
%
to prevent crashes - Dataset Filtering: Added support for experiments filtering by datasetId and promptId

📊 Metrics & Analytics Improvements
We’ve enhanced the metrics and analytics capabilities:
- Thread Feedback Scores: Added comprehensive thread feedback scoring system for better conversation quality assessment
- Thread Duration Monitoring: New duration widgets in the Metrics dashboard for monitoring conversation length trends
- Online Evaluation Rules: Added ability to enable/disable online evaluation rules for more flexible monitoring
- Cost Optimization: Reduced cost prompt queries to improve performance and reduce unnecessary API calls
🎨 UX Enhancements
We’ve made several UX improvements to make the platform more intuitive and efficient:
- Full-Screen Popup Improvements: Enhanced the full-screen popup experience with better navigation and usability
- Tag Component Optimization: Made tag components smaller and more compact for better space utilization
- Column Sorting: Enabled sorting and filtering on all Prompt columns for better data organization
- Multi-Item Tagging: Added ability to add tags to multiple items in the Traces and Spans tables simultaneously
🔌 SDK, integrations and docs
- LangChain Integration: Enhanced LangChain integration with improved provider and model logging
- Google ADK Integration: Updated Google ADK integration with better graph building capabilities
- Bedrock Integration: Added comprehensive cost tracking support for ChatBedrock and ChatBedrockConverse
🔒 Security & Stability Enhancements
We’ve implemented several security and stability improvements:
- Dependency Updates: Updated critical dependencies including MySQL connector, OpenTelemetry, and various security patches
- Error Handling: Improved error handling and logging across the platform
- Performance Monitoring: Enhanced NewRelic support for better performance monitoring
- Sentry Integration: Added more metadata about package versions to Sentry events for better debugging
And much more! 👉 See full commit log on GitHub
Releases: 1.8.7
, 1.8.8
, 1.8.9
, 1.8.10
, 1.8.11
, 1.8.12
, 1.8.13
, 1.8.14
, 1.8.15
, 1.8.16
🧵 Thread-level LLMs-as-Judge
We now support thread-level LLMs-as-a-Judge metrics!
We’ve implemented Online evaluation for threads, enabling the evaluation of entire conversations between humans and agents.
This allows for scalable measurement of metrics such as user frustration, goal achievement, conversational turn quality, clarification request rates, alignment with user intent, and much more.
We’ve also implemented Python metrics support for threads, giving you full code control over metric definitions.

To improve visibility into trends and to help detect spikes in these metrics when the agent is running in production, we’ve added Thread Feedback Scores and Thread Duration widgets to the Metrics dashboard. These additions make it easier to monitor changes over time in live environments.

🔍 Improved Trace Inspection Experience
Once you’ve identified problematic sessions or traces, we’ve made it easier to inspect and analyze them with the following improvements:
- Field Selector for Trace Tree: Quickly choose which fields to display in the trace view.
- Span Type Filter: Filter spans by type to focus on what matters.
- Improved Agent Graph: Now supports full-page view and zoom for easier navigation.
- Free Text Search: Search across traces and spans freely without constraints.
- Better Search Usability: search results are now highlighted and local search is available within code blocks.

📊 Spans Tab Improvements
The Spans tab provides a clearer, more comprehensive view of agent activity to help you analyze tool and sub-agent usage across threads, uncover trends, and spot latency outliers more easily.
What’s New:
- LLM Calls → Spans: we’ve renamed the LLM Calls tab to Spans to reflect broader coverage and richer insights.
- Unified View: see all spans in one place, including LLM calls, tools, guardrails, and more.
- Span Type Filter: quickly filter spans by type to focus on what matters most.
- Customizable Columns: highlight key span types by adding them as dedicated columns.
These improvements make it faster and easier to inspect agent behavior and performance at a glance.

📈 Experiments Improvements
Slow model response times can lead to frustrating user experiences and create hidden bottlenecks in production systems. However, identifying latency issues early (during experimentation) is often difficult without clear visibility into model performance.
To help address this, we’ve added Duration as a key metric for monitoring model latency in the Experiments engine. You can now include Duration as a selectable column in both the Experiments and Experiment Details views. This makes it easier to identify slow-responding models or configurations early, so you can proactively address potential performance risks before they impact users.

📦 Enhanced Data Organization & Tagging
When usage grows and data volumes increase, effective data management becomes crucial. We’ve added several capabilities to make team workflows easier:
- Tagging, filtering, and column sorting support for Prompts
- Tagging, filtering, and column sorting support for Datasets
- Ability to add tags to multiple items in the Traces and Spans tables
🤖 New Models Support
We’ve added support for:
- OpenAI GPT-4.1 and GPT-4.1-mini models
- Anthropic Claude 4 Sonnet model
🌐 Integration Updates
We’ve enhanced several integrations:
- Build graph for Google ADK agents
- Update Langchain integration to log provider, model and usage when using Google Generative AI models
- Implement Groq LLM usage tracking support in the Langchain integration
And much more! 👉 See full commit log on GitHub
Releases: 1.8.0
, 1.8.1
, 1.8.2
, 1.8.3
, 1.8.4
, 1.8.5
, 1.8.6
🛠 Agent Optimizer 1.0 released!
The Opik Agent Optimizer now supports full agentic systems and not just single prompts.
With support for LangGraph, Google ADK, PydanticAI, and more, this release brings a simplified API, model customization for evaluation, and standardized interfaces to streamline optimization workflows. Learn more in the docs.
🧵 Thread-level improvements
Added Thread-Level Feedback, Tags & Comments: You can now add expert feedback scores directly at the thread level, enabling SMEs to review full agent conversations, flag risks, and collaborate with dev teams more effectively. Added support for thread-level tags and comments to streamline workflows and improve context sharing.

🖥️ UX improvements
- We’ve redesigned the Opik Home Page to deliver a cleaner, more intuitive first-use experience, with a focused value proposition, direct access to key metrics, and a polished look. The demo data has also been upgraded to showcase Opik’s capabilities more effectively for new users. Additionally, we’ve added inter-project comparison capabilities for metrics and cost control, allowing you to benchmark and monitor performance and expenses across multiple projects.


-
Improved Error Visualization: Enhanced how span-level errors are surfaced across the project. Errors now bubble up to the project view, with quick-access shortcuts to detailed error logs and variation stats for better debugging and error tracking.
-
Improved Sidebar Hotkeys: Updated sidebar hotkeys for more efficient keyboard navigation between items and detail views.
🔌 SDK, integrations and docs
- Added Langchain support in metric classes, allowing use of Langchain as a model proxy alongside LiteLLM for flexible LLM judge customization.
- Added support for the Gemini 2.5 model family.
- Updated pretty mode to support Dify and LangGraph + OpenAI responses.
- Added the OpenAI agents integration cookbook (link).
- Added a cookbook on how to import Huggingface Datasets to Opik
👉 See full commit log on GitHub
Releases: 1.7.37
, 1.7.38
, 1.7.39
, 1.7.40
, 1.7.41
, 1.7.42
🔌 Integrations and SDK
- Added CloudFlare’s WorkersAI integration (docs)
- Google ADK integration: tracing is now automatically propagated to all sub-agents in agentic systems with the new
track_adk_agent_recursive
feature, eliminating the need to manually add tracing to each sub-agent. - Google ADK integration: now we retrieve session-level information from the ADK framework to enrich the threads data.
- New in the SDK! Real-time tracking for long-running spans/traces is now supported. When enabled (set
os.environ["OPIK_LOG_START_TRACE_SPAN"] = "True"
in your environment), you can see traces and spans update live in the UI—even for jobs that are still running. This makes debugging and monitoring long-running agents much more responsive and convenient.
🧵 Threads improvements
- Added Token Count and Cost Metrics in Thread table
- Added Sorting on all Thread table columns
- Added Navigation from Thread Detail to all related traces
- Added support for “pretty mode” in OpenAI Agents threads
🧪 Experiments improvements
- Added support for filtering by configuration metadata to experiments. It is now also possible to add a new column displaying the configuration in the experiments table.
🛠 Agent Optimizer improvements
- New Public API for Agent Optimization
- Added optimization run display link
- Added
optimization_context
🛡️ Security Fixes
- Fixed: h11 accepted some malformed Chunked-Encoding bodies
- Fixed: setuptools had a path traversal vulnerability in PackageIndex.download that could lead to Arbitrary File Write
- Fixed: LiteLLM had an Improper Authorization Vulnerability
👉 See full commit log on GitHub
Releases: 1.7.32
, 1.7.33
, 1.7.34
, 1.7.35
, 1.7.36
💡 Product Enhancements
- Ability to upload CSV datasets directly through the user interface
- Add experiment cost tracking to the Experiments table
- Add hinters and helpers for onboarding new users across the platform
- Added “LLM calls count” to the traces table
- Pretty formatting for complex agentic threads
- Preview support for MP3 files in the frontend
🛠 SDKs and API Enhancements
- Good news for JS developers! We’ve released experiments support for the JS SDK (official docs coming very soon)
- New Experiments Bulk API: a new API has been introduced for logging Experiments in bulk.
- Rate Limiting improvements both in the API and the SDK
🔌 Integrations
- Support for OpenAI o3-mini and Groq models added to the Playground
- OpenAI Agents: context awareness implemented and robustness improved. Improve thread handling
- Google ADK: added support for multi-agent integration
- LiteLLM: token and cost tracking added for SDK calls. Integration now compatible with opik.configure(…)
👉 See full commit log on GitHub
Releases: 1.7.27
, 1.7.28
, 1.7.29
, 1.7.30
, 1.7.31
✨ New Features
-
Opik Agent Optimizer: A comprehensive toolkit designed to enhance the performance and efficiency of your Large Language Model (LLM) applications. Read more
-
Opik Guardrails: Guardrails help you protect your application from risks inherent in LLMs. Use them to check the inputs and outputs of your LLM calls, and detect issues like off-topic answers or leaking sensitive information. Read more
💡 Product Enhancements
- New Prompt Selector in Playground — Choose existing prompts from your Prompt Library to streamline your testing workflows.
- Improved “Pretty Format” for Agents — Enhanced readability for complex threads in the UI.
🔌 Integrations
- Vertex AI (Gemini) — Offline and online evaluation support integrated directly into Opik. Also available now in the Playground.
- OpenAI Integration in the JS/TS SDK
- AWS Strands Agents
- Agno Framework
- Google ADK Multi-agent support
🛠 SDKs and API Enhancements
- OpenAI LLM advanced configurations — Support for custom headers and base URLs.
- Span Timing Precision — Time resolution improved to microseconds for accurate monitoring.
- Better Error Messaging — More descriptive errors for SDK validation and runtime failures.
- Stream-based Tracing and Enhanced Streaming support
👉 See full commit log on GitHub
Releases: 1.7.19
, 1.7.20
, 1.7.21
, 1.7.22
, 1.7.23
, 1.7.24
, 1.7.25
, 1.7.26
Opik Dashboard:
Python and JS / TS SDK:
- Added support for streaming in ADK integration
- Add cost tracking for the ADK integration
- Add support for OpenAI
responses.parse
- Reduce the memory and CPU overhead of the Python SDK through various performance optimizations
Deployments:
- Updated port mapping when using
opik.sh
- Fixed persistence when using Docker compose deployments
Release: 1.7.15
, 1.7.16
, 1.7.17
, 1.7.18
Opik Dashboard:
- Updated the experiment page charts to better handle nulls, all metric values are now displayed.
- Added lazy loading for traces and span sidebar to better handle very large traces.
- Added support for trace and span attachments, you can now log pdf, video and audio files to your traces.

- Improved performance of some Experiment endpoints
Python and JS / TS SDK:
- Updated DSPy integration following latest DSPy release
- New Autogen integration based on Opik’s OpenTelemetry endpoints
- Added compression to request payload
Release: 1.7.12
, 1.7.13
, 1.7.14
Opik Dashboard:
- Released Python code metrics for online evaluations for both Opik Cloud and self-hosted deployments. This allows you to define python functions to evaluate your traces in production.

Python and JS / TS SDK:
- Fixed LLM as a judge metrics so they return an error rather than a score of 0.5 if the LLM returns a score that wasn’t in the range 0 to 1.
Deployments:
- Updated Dockerfiles to ensure all containers run as non root users.
Release: 1.7.11
Opik Dashboard:
- Updated the feedback scores UI in the experiment page to make it easier to annotate experiment results.
- Fixed an issue with base64 encoded images in the experiment sidebar.
- Improved the loading speeds of the traces table and traces sidebar for traces that have very large payloads (25MB+).
Python and JS / TS SDK:
- Improved the robustness of LLM as a Judge metrics with better parsing.
- Fix usage tracking for Anthropic models hosted on VertexAI.
- When using LiteLLM, we fallback to using the LiteLLM cost if no model provider or model is specified.
- Added support for
thread_id
in the LangGraph integration.
Releases: 1.7.4
, 1.7.5
, 1.7.6
. 1.7.7
and 1.7.8
.