Optimize, Annotate and Score Full Agent Systems

When multiple steps in an agentic system are contextually related, logging and evaluating individual LLM calls doesn’t tell the whole story. That’s where the latest round of Opik releases comes in, with a focus on evaluating groups of actions so you can quantify and improve your AI application’s performance at a higher level.

Working with AI chatbots? Now you can capture entire multi-turn conversations, invite human experts to review and score them, and run conversation-level eval metrics like user frustration and conversational coherence.

Opik’s agent optimizer SDK is ready for more complexity too, with the ability to go beyond singular prompts and perform automated optimization runs on multi-step agents.

Read on for more details and tips on using these new features – plus, check out how Zencoder relies on Opik to build and test their fully agentic software pipelines, and discover where to connect with fellow AI developers in the upcoming weeks!

Opik SDK Thread Evaluation

Now, when you run multi-turn conversations through Opik, the platform will automatically group related traces into conversation threads.

To evaluate and optimize your conversation threads using the new [evaluate_threads] function in the SDK, specify a filter to apply metrics like user frustration and conversational coherence to specific threads. You’ll receive an evaluation report generated locally within the SDK, containing all evaluated threads within your agent system, with immediate visibility of the report within the Opik UI.

View docs →

Thread-Level Feedback Scores

Opik’s new thread-level expert feedback feature is now available! This feature has been tailored for subject-matter experts to review entire chatbot conversations in context, flag insights and risks, and collaborate directly with dev teams. In addition to manual scoring, you can now tag threads and leave contextual comments to enhance collaboration and provide greater clarity within workflows.

View docs →

Agent Optimizer 1.0

You can now automatically optimize not just single prompts, but full agentic systems! With built-in support for LangGraph, Google ADK, PydanticAI, and more, this release simplifies the API, allows you to bring your own model to evaluation, and separates the optimizing LLM from the evaluation LLM for more control.

View docs →

Insights From the Comet Team

How Opik Heps Zencoder Build & Test Fully Agentic Software Pipelines

Dmitrii Krasnov, Engineering Manager at Zencoder, shares how his team utilizes Opik to build and scale Zencoder—an AI-powered code assistant capable of everything from real-time code repair to autonomous JIRA ticket resolution. Learn how Opik has improved research efficiency and provided full trace visibility and faster iteration across daily experiments for the Zencoder team:

Read here →

Connect & Learn with Fellow GenAI & ML Developers

Join us live in the coming weeks for the following conferences, workshops, prizes, and opportunities to connect with fellow AI builders:

ICML 2025 (International Conference on Machine Learning) – Vancouver, July 14th-17th
AI Tinkerers NYC Hackathon – NYC, July 19th-20th
Las Vegas Comet X Weaviate Hacknight – Las Vegas, August 12th

Major Releases: Auto-Optimize Multi-Step Agents, Annotate & Score Entire Chatbot Convos

Opik SDK Thread Evaluation

Thread-Level Feedback Scores

Agent Optimizer 1.0

Insights From the Comet Team

How Opik Heps Zencoder Build & Test Fully Agentic Software Pipelines

Connect & Learn with Fellow GenAI & ML Developers