For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Copy to LLMGithubGo to App
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
DocumentationIntegrationsAgent OptimizationSelf-hosting OpikSDK & API referenceOpik University
    • Overview
  • Intro
    • Opik Overview
    • Next steps / Set expectations
  • Observability
    • Log Traces
    • Annotate Traces
  • Evaluation
    • Evaluation Concepts and Overview
    • Create Evaluation Datasets
    • Define Evaluation Metrics
    • Evaluate your LLM Application
    • No-code LLM Evaluation Workflow
  • Prompt Engineering
    • Prompt Management
    • Prompt Playground
  • Testing
    • PyTest Integration
  • Production Monitoring
    • Online Evaluation Rules
LogoLogo
Copy to LLMGithubGo to App
On this page
  • Bringing It All Together: Complete LLM Evaluation
  • Key Highlights
Evaluation

Evaluate your LLM Application

Was this page helpful?
Previous

No-code LLM Evaluation Workflow

Next
Built with

Bringing It All Together: Complete LLM Evaluation

This comprehensive video demonstrates the complete evaluation workflow in Opik, where datasets and metrics come together to systematically assess LLM performance. You’ll see a practical comparison between GPT-4 and Gemini models on a RAG application, learn about prompt versioning, experiment management, and discover how to make data-driven decisions for production deployment. This is where all previous concepts unite into actionable insights.

Key Highlights

  • End-to-End Evaluation Workflow: Run complete evaluations that process datasets, apply models, and score outputs using defined metrics in a systematic pipeline
  • Prompt Management & Versioning: Use Opik’s prompt class to create versioned prompts with commit history, ensuring reproducibility and saving time/money
  • Multi-Model Benchmarking: Compare different models (GPT-4 vs Gemini) side-by-side using evaluation tasks and systematic scoring across identical datasets
  • Smart Experiment Organization: Name experiments strategically (e.g., by model name) for easy identification and comparison rather than relying on random generated names
  • Live Experiment Monitoring: Track evaluation progress in real-time through the Opik UI, viewing dataset processing and results as they’re generated
  • Side-by-Side Comparison: Use the compare feature to evaluate multiple experiments simultaneously, making model selection decisions based on quantitative metrics
  • Template Generation: Leverage the “Create New Experiment” button to automatically generate evaluation scripts with selected metrics for reuse in Python. Each metric in the modal includes a documentation link for quick reference
  • Trace-Level Inspection: Dive deep into individual responses by opening traces from experiment results to understand model behavior and decision paths
  • Data-Driven Production Decisions: Choose the best-performing prompts and models based on concrete metrics rather than subjective assessment, building confidence for deployment