For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
    • Overview
  • Intro
    • Opik Overview
    • Next steps / Set expectations
  • Observability
    • Log Traces
    • Annotate Traces
  • Evaluation
    • Evaluation Concepts and Overview
    • Create Evaluation Datasets
    • Define Evaluation Metrics
    • Evaluate your LLM Application
    • No-code LLM Evaluation Workflow
  • Prompt Engineering
    • Prompt Management
    • Prompt Playground
  • Testing
    • PyTest Integration
  • Production Monitoring
    • Online Evaluation Rules
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Understanding LLM Evaluation with Opik
  • Key Highlights
Evaluation

Evaluation Concepts and Overview

Was this page helpful?
Previous

Create Evaluation Datasets

Next
Built with

Understanding LLM Evaluation with Opik

This video introduces the fundamentals of LLM evaluation and why it differs from traditional machine learning metrics. Unlike conventional ML evaluation that relies on accuracy and F1 scores, LLM evaluation requires assessing text qualities like relevance, accuracy, and helpfulness. You’ll learn about Opik’s systematic three-component evaluation framework and see how it enables quantitative performance measurement across hundreds of test cases.

Key Highlights

  • Beyond Traditional Metrics: LLM evaluation requires new approaches since outputs are text that must be assessed for qualities like relevance, accuracy, and helpfulness
  • Three-Component Framework: Opik’s evaluation system consists of datasets (example inputs/outputs), metrics (automated scoring methods), and experiments (evaluation runs)
  • Comprehensive Dataset Management: Collections of example inputs and expected outputs that represent your specific use cases and requirements
  • Flexible Metrics System: From simple heuristics to sophisticated LLM-as-a-judge approaches for automated output scoring
  • Systematic Experimentation: Each experiment represents a specific LLM application configuration tested against datasets with defined metrics
  • Model Comparison Power: Compare different models (GPT-3.5 vs Claude 3.5 vs GPT-4 vs Gemini) systematically on the same datasets
  • Prompt Template Testing: Evaluate various prompt templates against datasets with specific models for optimization
  • Quantitative Decision Making: Replace subjective judgments based on few examples with quantitative measurement across hundreds of test cases
  • Production Confidence: Structured evaluation approach provides confidence before deploying LLM applications to production