Tracing Core Concepts | Opik Documentation

If you want to jump straight to logging traces, you can head to the Log traces or Log agents guides.

Tracing is the foundation of observability in Opik. It allows you to monitor, debug, and optimize your LLM applications by capturing detailed information about their execution. Understanding these core concepts is essential for effectively using Opik’s tracing capabilities.

Overview

When working with LLM applications, understanding what’s happening under the hood is crucial for debugging issues, optimizing performance, and ensuring reliability. Opik’s tracing system provides comprehensive observability by capturing detailed execution information at multiple levels.

In order to effectively use Opik’s tracing capabilities, it’s important to understand these key concepts:

Trace: A complete execution path representing a single interaction with an LLM or agent
Span: Individual operations or steps within a trace that represent specific actions or computations
Thread: A collection of related traces that form a coherent conversation or workflow
Metric: Quantitative measurements that provide objective assessments of your AI models’ performance
Optimization: The systematic process of refining and evaluating LLM prompts and configurations
Evaluation: A framework for systematically testing your prompts and models against datasets

Traces

A trace represents a complete execution path for a single interaction with an LLM or agent. Think of it as a detailed record of everything that happened during one request-response cycle. Each trace captures the full context of the interaction, including inputs, outputs, timing, and any intermediate steps.

Key Characteristics of Traces:

Unique Identity: Each trace has a unique identifier that allows you to track and reference it
Complete Context: Contains all the information needed to understand what happened during the interaction
Timing Information: Records when the interaction started, ended, and how long each part took
Input/Output Data: Captures the exact prompts sent to the LLM and the responses received
Metadata: Includes additional context like model used, temperature settings, and custom tags

Example Use Cases:

Debugging: When an LLM produces unexpected output, you can examine the trace to understand what went wrong
Performance Analysis: Identify bottlenecks and slow operations by analyzing trace timing
Cost Tracking: Monitor token usage and associated costs for each interaction
Quality Assurance: Review traces to ensure your application is behaving as expected

Spans

A span represents an individual operation or step within a trace. While a trace shows the complete picture, spans break down the execution into granular, measurable components. This hierarchical structure allows you to understand both the high-level flow and the detailed operations within your LLM application.

Key Characteristics of Spans:

Hierarchical Structure: Spans can contain other spans, creating a tree-like structure within a trace
Specific Operations: Each span represents a distinct action, such as a function call, API request, or data processing step
Detailed Timing: Precise start and end times for each operation
Context Preservation: Maintains the relationship between parent and child operations
Custom Attributes: Can include additional metadata specific to the operation

Common Span Types:

LLM Calls: Individual requests to language models
Function Calls: Tool or function invocations within an agent
Data Processing: Transformations or manipulations of data
External API Calls: Requests to third-party services
Custom Operations: Any user-defined operation you want to track

Example Span Hierarchy:

Trace: "Customer Support Chat"
├── Span: "Parse User Intent"
├── Span: "Query Knowledge Base"
│   ├── Span: "Search Vector Database"
│   └── Span: "Rank Results"
├── Span: "Generate Response"
│   ├── Span: "LLM Call: GPT-4"
│   └── Span: "Post-process Response"
└── Span: "Log Interaction"

Threads

A thread is a collection of related traces that form a coherent conversation or workflow. Threads are essential for understanding multi-turn interactions and maintaining context across multiple LLM calls. They provide a way to group related traces together, making it easier to analyze conversational patterns and user journeys.

Key Characteristics of Threads:

Conversation Context: Maintains the flow of multi-turn interactions
Trace Grouping: Organizes related traces under a single thread identifier
Temporal Ordering: Traces within a thread are ordered chronologically
Shared Context: Allows you to see how context evolves throughout a conversation
Cross-Trace Analysis: Enables analysis of patterns across multiple related interactions

When to Use Threads:

Chat Applications: Group all messages in a conversation
Multi-Step Workflows: Track complex processes that span multiple LLM calls
User Sessions: Organize all interactions from a single user session
Agent Conversations: Follow the complete interaction between an agent and a user

Thread Management:

Threads are created by defining a thread_id and referencing it in your traces. This allows you to:

Maintain Context: Keep track of conversation history and user state
Debug Conversations: Understand how a conversation evolved over time
Analyze Patterns: Identify common conversation flows and user behaviors
Optimize Performance: Find bottlenecks in multi-turn interactions

Metrics

Metrics provide quantitative assessments of your AI models’ outputs, enabling objective comparisons and performance tracking over time. They are essential for understanding how well your LLM applications are performing and identifying areas for improvement.

Key Characteristics of Metrics:

Quantitative Measurement: Provide numerical scores that can be compared and tracked
Objective Assessment: Remove subjective bias from performance evaluation
Trend Analysis: Enable tracking of performance changes over time
Comparative Analysis: Allow comparison between different models, prompts, or configurations
Automated Evaluation: Can be computed automatically without human intervention

Common Metric Types:

Accuracy Metrics: Measure how often the model produces correct outputs
Quality Metrics: Assess the quality of generated text (e.g., coherence, relevance)
Efficiency Metrics: Track performance characteristics like latency and throughput
Cost Metrics: Monitor token usage and associated costs
Custom Metrics: Domain-specific measurements tailored to your use case

Optimization

Optimization is the systematic process of refining and evaluating LLM prompts and configurations to improve performance. It involves iteratively testing different approaches and using data-driven insights to make improvements.

Key Aspects of Optimization:

Prompt Engineering: Refining the instructions given to LLMs
Parameter Tuning: Adjusting model settings like temperature, top-p, and max tokens
Few-shot Learning: Optimizing example selection for in-context learning
Tool Integration: Improving how LLMs interact with external tools and functions
Performance Monitoring: Tracking improvements and regressions over time

Evaluation

Evaluation provides a framework for systematically testing your prompts and models against datasets using various metrics to measure performance. It’s the foundation for making data-driven decisions about your LLM applications.

Key Components of Evaluation:

Datasets: Collections of test cases with inputs and expected outputs
Experiments: Individual evaluation runs that test specific configurations
Metrics: Quantitative measures of performance
Comparative Analysis: Side-by-side comparison of different approaches
Statistical Significance: Ensuring results are reliable and reproducible

Learn More

Now that you understand the core concepts, explore these resources to dive deeper:

Tracing and Observability:

Log traces - Learn how to capture traces in your applications
Log agents - Understand how to trace agent-based applications
Annotate traces - Add custom metadata to your traces
Cost tracking - Monitor and analyze costs

Evaluation and Testing:

Evaluation concepts - Deep dive into evaluation concepts
Evaluate prompts - Test and compare different prompts
Evaluate agents - Evaluate complex agent systems
Metrics overview - Available evaluation metrics

Optimization:

Agent Optimization concepts - Core optimization concepts
Optimization algorithms - Available optimization strategies
Best practices - Optimization best practices

Integration Guides:

SDK Configuration - Configure Opik in your applications
Supported Models - Models compatible with Opik
Integrations - Framework-specific integration guides

Best Practices for Tracing

1. Start with Clear Trace Boundaries

Define clear boundaries for what constitutes a single trace. Typically, this should align with a complete user interaction or business operation.

2. Use Meaningful Span Names

Choose descriptive names for your spans that clearly indicate what operation is being performed. This makes debugging much easier.

3. Leverage Thread IDs for Conversations

Use consistent thread IDs for related interactions. This is especially important for chat applications and multi-step workflows.

4. Add Relevant Metadata

Include custom attributes and metadata that will be useful for analysis. Consider adding user IDs, session information, and business context.

5. Monitor Performance Continuously

Set up alerts and dashboards to monitor trace performance, error rates, and costs. This helps you catch issues early.

6. Use Traces for Optimization

Regularly analyze your traces to identify optimization opportunities, such as reducing latency or improving prompt effectiveness.

Pro Tip: Start with basic tracing and gradually add more detailed spans as you identify areas that need deeper observability. Don’t try to trace everything at once - focus on the most critical paths first.

Important: Be mindful of sensitive data when tracing. Avoid logging personally identifiable information (PII) or sensitive business data in your traces. Use Opik’s data filtering capabilities to protect sensitive information.