No-code LLM Evaluation Workflow | Opik Documentation

End-to-End UI-Based LLM Experimentation

This comprehensive video demonstrates how to conduct complete LLM experimentation using Opik’s UI in a 15-minute video. Through a practical example of building risk-aware LLM applications, you’ll learn the complete workflow from dataset creation to experiment comparison. The video walks through testing different prompt strategies to ensure LLM outputs include adequate cautionary statements, using LLM-as-a-judge metrics for systematic evaluation without requiring any coding experience.

Key Highlights

Complete UI Workflow: Demonstrates end-to-end LLM evaluation entirely through Opik’s interface, making it accessible to non-technical users and data scientists alike
Practical Use Case: Real-world example of ensuring LLM applications provide adequate risk warnings, preventing potentially dangerous command executions like Docker volume deletions
Dataset Creation & Management: Upload CSV files or use AI to synthetically expand datasets with risky prompts for comprehensive testing scenarios
System Prompt Testing: Compare different prompting strategies including “Risky Rick” (minimal warnings), “Safety Sid” (comprehensive cautions), and baseline ChatGPT responses
Interactive Playground: Test multiple prompt variations side-by-side with configurable model parameters and dataset integration using template variables
LLM-as-a-Judge Evaluation: Create custom evaluation rules that automatically assess output quality using Boolean scoring (0/1) for systematic risk assessment
Automated Experiment Comparison: Run parallel tests across entire datasets and compare aggregated metrics to identify the best-performing prompt strategy
Production Cost Considerations: Configure sampling and filtering options for online evaluation to manage costs when running LLM-as-a-judge on production data
Scalable Evaluation Approach: Move beyond manual spot-checking to systematic assessment across 20+ test cases, enabling data-driven decisions for production deployment