-
SelfCheckGPT for LLM Evaluation
Detecting hallucinations in language models is challenging. There are three general approaches: The problem with many LLM-as-a-Judge techniques is that…
-
LLM Juries for Evaluation
Evaluating the correctness of generated responses is an inherently challenging task. LLM-as-a-Judge evaluators have gained popularity for their ability to…
-
A Simple Recipe for LLM Observability
So, you’re building an AI application on top of an LLM, and you’re planning on setting it live in production.…
-
G-Eval for LLM Evaluation
LLM-as-a-judge evaluators have gained widespread adoption due to their flexibility, scalability, and close alignment with human judgment. They excel at…
-
Build Multi-Index Advanced RAG Apps
Welcome to Lesson 12 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn…
-
Build a scalable RAG ingestion pipeline using 74.3% less code
Welcome to Lesson 11 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn…
-
BERTScore For LLM Evaluation
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like BLEU and ROUGE to a…
-
Building ClaireBot, an AI Personal Stylist Chatbot
Follow the evolution of my personal AI project and discover how to integrate image analysis, LLM models, and LLM-as-a-judge evaluation…
-
Perplexity for LLM Evaluation
Perplexity is, historically speaking, one of the “standard” evaluation metrics for language models. And while recent years have seen a…
-
OpenAI Evals: Log Datasets & Evaluate LLM Performance with Opik
OpenAI’s Python API is quickly becoming one of the most-downloaded Python packages. With an easy-to-use SDK and access…