{"id":10094,"date":"2024-07-31T10:37:20","date_gmt":"2024-07-31T18:37:20","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=10094"},"modified":"2025-04-29T12:44:08","modified_gmt":"2025-04-29T12:44:08","slug":"rag-evaluation-framework-ragas","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/","title":{"rendered":"The Ultimate Prompt Monitoring Pipeline"},"content":{"rendered":"\n<p><em>Welcome to Lesson 10 of 12&nbsp;in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">Lesson 1<\/a>.<\/em><\/p>\n\n\n\n<p><strong>Lessons&nbsp;<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>This lesson will show you how to build a specialized prompt monitoring layer on top of your LLM Twin inference pipeline.<\/p>\n\n\n\n<p>We will also show you how to compute evaluation metrics on top of your production data to alert us when we experience hallucinations, moderation, or other business-related issues while the system is in production.<\/p>\n\n\n\n<p>In this lesson, you will learn the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Why does having specialized software for monitoring LLM apps matter?<\/li>\n\n\n\n<li>How to implement a prompt monitoring layer for your complex traces.<\/li>\n\n\n\n<li>Build a monitoring evaluation pipeline to alarm you when the system degrades.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*Nqetd24KxJcseLEl.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>If you haven\u2019t followed the rest of the LLM Twin series, to understand the particularities of our use case, we recommend you to read the following lessons:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">Lesson 8<\/a>\u00a0on LLM &amp; RAG evaluation.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Lesson 9<\/a>\u00a0on building the LLM Twin inference pipeline.<\/li>\n<\/ul>\n\n\n\n<p><em>You are good to go if you are here just for the monitor stuff. Enjoy!<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#bhyt\">Understanding the challenges of monitoring LLM apps<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#pll8\">Monitoring a simple LLM call with Opik<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#erty\">Monitoring complex traces with Opik<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#sdre\">Sampling items for evaluating chains in production<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#x33b\">Evaluating chains in production<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#4f4f\">Testing out the prompt monitoring service<\/a><\/li>\n<\/ol>\n\n\n\n<p>\ud83d\udd17 Consider checking out the GitHub repository [1] and support us with a \u2b50\ufe0f<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bhyt\">1. Understanding the challenges of monitoring LLM apps<\/h2>\n\n\n\n<p>Monitoring is not new to LLMOps, but in the LLM world, we have a new entity to manage: the prompt. Thus, we have to find specific ways to log and analyze them.<\/p>\n\n\n\n<p>Most ML platforms such as&nbsp;<a href=\"\/login?from=llm\">Opik<\/a>&nbsp;(by&nbsp;<a href=\"\/signup\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=substack\">Comet<\/a>), have implemented logging tools to debug and monitor prompts. In production, these tools are usually used to track user input, prompt templates, input variables, generated responses, token numbers, and latency.<\/p>\n\n\n\n<p>When generating an answer with an LLM, we don\u2019t wait for the whole answer to be generated; we stream the output token by token. This makes the entire process snappier and more responsive.<\/p>\n\n\n\n<p>Thus, when it comes to tracking the latency of generating an answer, the final user experience must look at this from multiple perspectives, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Time to First Token (TTFT): The time it takes for the first token to be generated<\/li>\n\n\n\n<li>Time between Tokens (TBT): The interval between each token generation<\/li>\n\n\n\n<li>Tokens per Second (TPS): The rate at which tokens are generated<\/li>\n\n\n\n<li>Time per Output Token (TPOT): The time it takes to generate each output token<\/li>\n\n\n\n<li>Total Latency: The total time required to complete a response<\/li>\n<\/ul>\n\n\n\n<p>Also, tracking the total input and output tokens is critical to understanding the costs of hosting your LLMs.<\/p>\n\n\n\n<p><em>Before shipping a new model (or features) to production, it\u2019s recommended that you compute all these latency metrics, along with others such as average input and output token length. To do so, you can use benchmarking open-source tools such as&nbsp;<a href=\"https:\/\/github.com\/philschmid\/llmperf\">llmperf<\/a>.<\/em><\/p>\n\n\n\n<p>Ultimately, you can compute metrics that validate your model\u2019s performance for each input, prompt, and output tuple. Depending on your use case, you can compute things such as accuracy, toxicity, and hallucination rate. When working with RAG systems, you can also compute metrics relative to the relevance and precision of the retrieved context.<\/p>\n\n\n\n<p>Another essential thing to consider when monitoring prompts is to log their full traces. You might have multiple intermediate steps from the user query to the final general answer.<\/p>\n\n\n\n<p>For example, rewriting the query to improve the RAG\u2019s retrieval accuracy evolves one or more intermediate steps. Thus, logging the full trace reveals the entire process from when a user sends a query to when the final response is returned, including the actions the system takes, the documents retrieved, and the final prompt sent to the model.<\/p>\n\n\n\n<p>Additionally, you can log the latency, tokens, and costs at each step, providing a more fine-grained view of all the steps.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*yUNB2oEzBK2A6s_p.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 1: Trace example from Opik<\/figcaption><\/figure>\n\n\n\n<p>As shown in&nbsp;<em>Figure 1,<\/em>&nbsp;the end goal is to trace each step from the user\u2019s input until the generated answer. If something fails or behaves unexpectedly, you can point exactly to the faulty step. The query can fail due to an incorrect answer, an invalid context, or incorrect data processing. Also, the application can behave unexpectedly if the number of generated tokens suddenly fluctuates during specific steps.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"pll8\">2. Monitoring a simple LLM call<\/h2>\n\n\n\n<p>We will use Opik to implement the prompt monitoring layer.<\/p>\n\n\n\n<p>We have also used&nbsp;<a href=\"\/login?from=llm\">Opik<\/a>&nbsp;in&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">Lesson 8<\/a>&nbsp;for LLM &amp; RAG evaluation, as Opik\u2019s mission is to build an open-source Python tool for end-to-end LLM development (backed up by Comet).<\/p>\n\n\n\n<p>The first step in understanding their monitoring Python SDK is to know how to monitor a simple LLM call.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When working with custom APIs<\/h3>\n\n\n\n<p>To do so, when we must annotate the function with the&nbsp;<strong>@opik.track(name=\u201d\u2026\u201d)<\/strong>&nbsp;Python decorator.<\/p>\n\n\n\n<p>The&nbsp;<strong>name<\/strong>&nbsp;parameter is useless when logging a single prompt, but it is beneficial when logging traces with multiple prompts. It helps you structure your monitoring strategy and quickly identify the issue.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> import opik\n\n@opik.track(name=\"inference_pipeline.call_llm_service\")\ndef call_llm_service(messages: list&#91;dict&#91;str, str]]) -&gt; str:\nanswer = self._llm_endpoint.predict(\ndata={\n\"messages\": messages,\n\"parameters\": {\n\"max_new_tokens\": settings.MAX_TOTAL_TOKENS\n- settings.MAX_INPUT_TOKENS,\n\"temperature\": 0.01,\n\"top_p\": 0.6,\n\"stop\": &#91;\"&lt;|eot_id|&gt;\"],\n\"return_full_text\": False,\n},\n}\n)\nanswer = answer&#91;\"choices\"]&#91;0]&#91;\"message\"]&#91;\"content\"].strip()\n\nreturn answer <\/code><\/pre>\n\n\n\n<p>Doing so will automatically track the input &amp; output to the Opik dashboard, as seen in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*cHYjnuwBHmcdCW1p.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 2: Part of the input logged to the Opik dashboard.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">When working with LangChain, OpenAI or other standardized frameworks<\/h3>\n\n\n\n<p>As we use LangChain for our OpenAI calls (used to do advanced RAG, such as query expansion), we will show you how easy it is to integrate these prompt monitoring tools in your ecosystem.<\/p>\n\n\n\n<p>Instead of using the&nbsp;<strong>@opik.track()<\/strong>&nbsp;Python decorator, we define an OpikTracer(), which is hooked as a callback to the LangChain chain.<\/p>\n\n\n\n<p>This will automatically log all your chain inputs and outputs, similar to the decorator.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> from opik.integrations.langchain import OpikTracer from core.rag.prompt_templates import QueryExpansionTemplate class QueryExpansion: opik_tracer = OpikTracer(tags=&#91;\"QueryExpansion\"]) @staticmethod def generate_response(query: str, to_expand_to_n: int) -&gt; list&#91;str]: query_expansion_template = QueryExpansionTemplate() prompt = query_expansion_template.create_template(to_expand_to_n) model = ChatOpenAI( model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY, temperature=0, ) chain = prompt | model chain = chain.with_config({\"callbacks\": &#91;QueryExpansion.opik_tracer]}) response = chain.invoke({\"question\": query}) ... return expanded_queries\n <\/code><\/pre>\n\n\n\n<p>Opik supports many integrations for the most popular LLM tools, such as LlamaIndex, Ollama, Groq, AWS Bedrock, Antrophic, and more.<\/p>\n\n\n\n<p>\ud83d\udd17 Check the complete list&nbsp;<a href=\"https:\/\/www.comet.com\/docs\/opik\/cookbook\/quickstart_notebook\/\">here<\/a>&nbsp;[2].<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tracking metadata<\/h3>\n\n\n\n<p>The last step is to attach the necessary metadata for your use case to the current trace.<\/p>\n\n\n\n<p>As seen in the following code snippet, you can easily do that by calling the&nbsp;<strong>update_current_trace()<\/strong>&nbsp;function, where you can tag your trace or add any other metadata through a Python dictionary, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the number of input and output tokens;<\/li>\n\n\n\n<li>the model IDs used throughout the inference;<\/li>\n\n\n\n<li>the prompt template and variables.<\/li>\n<\/ul>\n\n\n\n<p>All critical information when debugging and evaluating the prompts!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> from opik import opik_context opik_context.update_current_trace( tags=&#91;\"rag\"], metadata={ \"prompt_template\": prompt_template.template, \"prompt_template_variables\": prompt_template_variables, \"model_id\": settings.MODEL_ID, \"embedding_model_id\": settings.EMBEDDING_MODEL_ID, \"input_tokens\": input_num_tokens, \"answer_tokens\": num_answer_tokens, \"total_tokens\": input_num_tokens + num_answer_tokens, }, ) \n <\/code><\/pre>\n\n\n\n<p>In Figure 3, we can observe how the metadata looks in Opik.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*nW_TgkVNDZdodghn.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>Figure 3: Example of metadata in Opik dashboard.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"erty\">3. Monitoring complex traces with Opik<\/h2>\n\n\n\n<p>We must track a more complex trace than a simple prompt to monitor our&nbsp;<em>LLM Twin inference pipeline.<\/em><\/p>\n\n\n\n<p>To thoroughly debug and analyze our application, following a top-down approach, we have to track the following aspects:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The main\u00a0<strong>generate()<\/strong>\u00a0method.<\/li>\n\n\n\n<li>The prompt formatting step tracks the prompt template and variables.<\/li>\n\n\n\n<li>The call to the LLM service, which is hosted as a real-time endpoint on AWS SageMaker.<\/li>\n<\/ul>\n\n\n\n<p>Or advanced RAG elements, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Top K chunks used as context.<\/li>\n\n\n\n<li>The results of the\u00a0<strong>QueryExpansion<\/strong>\u00a0step.<\/li>\n\n\n\n<li>The results of the\u00a0<strong>SelfQuery<\/strong>\u00a0step.<\/li>\n\n\n\n<li>The input and output of reranking the final chunks.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*IUnf9vbN7L4ftdai.png\" alt=\"\"\/><\/figure>\n\n\n\n<p>Let\u2019s dig into the code to see how easily we can aggregate all these aspects into a single trace using Opik.<\/p>\n\n\n\n<p>We will start with the&nbsp;<strong>LLMTwin<\/strong>&nbsp;class, which aggregates all our inference logic. We won\u2019t discuss the class details, as we presented them in Lesson 9 when implementing the inference layer.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> import opik from opik import opik_context class LLMTwin: def __init__(self, mock: bool = False) -&gt; None: self._mock = mock self._llm_endpoint = self.build_sagemaker_predictor() self.prompt_template_builder = InferenceTemplate() def build_sagemaker_predictor(self) -&gt; HuggingFacePredictor: return HuggingFacePredictor( endpoint_name=settings.DEPLOYMENT_ENDPOINT_NAME, sagemaker_session=sagemaker.Session(), ) @opik.track(name=\"inference_pipeline.generate\") def generate( self, query: str, enable_rag: bool = False, sample_for_evaluation: bool = False, ) -&gt; dict: system_prompt, prompt_template = self.prompt_template_builder.create_template( enable_rag=enable_rag ) prompt_template_variables = {\"question\": query} if enable_rag is True: retriever = VectorRetriever(query=query) hits = retriever.retrieve_top_k( k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY ) context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K) prompt_template_variables&#91;\"context\"] = context else: context = None messages, input_num_tokens = self.format_prompt( system_prompt, prompt_template, prompt_template_variables ) logger.debug(f\"Prompt: {pprint.pformat(messages)}\") answer = self.call_llm_service(messages=messages) logger.debug(f\"Answer: {answer}\") num_answer_tokens = compute_num_tokens(answer) opik_context.update_current_trace( tags=&#91;\"rag\"], metadata={ \"prompt_template\": prompt_template.template, \"prompt_template_variables\": prompt_template_variables, \"model_id\": settings.MODEL_ID, \"embedding_model_id\": settings.EMBEDDING_MODEL_ID, \"input_tokens\": input_num_tokens, \"answer_tokens\": num_answer_tokens, \"total_tokens\": input_num_tokens + num_answer_tokens, }, ) answer = {\"answer\": answer, \"context\": context} if sample_for_evaluation is True: add_to_dataset_with_sampling( item={\"input\": {\"query\": query}, \"expected_output\": answer}, dataset_name=\"LLMTwinMonitoringDataset\", ) return answer @opik.track(name=\"inference_pipeline.format_prompt\") def format_prompt( self, system_prompt, prompt_template: PromptTemplate, prompt_template_variables: dict, ) -&gt; tuple&#91;list&#91;dict&#91;str, str]], int]: ... # Implementation here. return messages, total_input_tokens @opik.track(name=\"inference_pipeline.call_llm_service\") def call_llm_service(self, messages: list&#91;dict&#91;str, str]]) -&gt; str: ... # Implementation here. return answer <\/code><\/pre>\n\n\n\n<p>To monitor complex traces, it all boils down to two simple things:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Opik\u2019s\u00a0<strong>@opik.track(name=\u201d\u2026\u201d)<\/strong>\u00a0Python decorator on all your relevant functions, using the name argument to distinguish different steps.<\/li>\n\n\n\n<li>Split your core logic into functions that do only one thing (following the DRY principle from software). Doing so is enough to ignore the implementation and track the input and output of each function, as we did for the\u00a0<strong>format_prompt()<\/strong>\u00a0and\u00a0<strong>call_llm_service()<\/strong>\u00a0functions.<\/li>\n<\/ul>\n\n\n\n<p>To dig even deeper into our RAG logic, we can exploit the same strategy in other elements, such as the&nbsp;<strong>VectorRetriever<\/strong>&nbsp;used to retrieve our context and apply all the advanced RAG methods mentioned above.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> class VectorRetriever: def __init__(self, query: str) -&gt; None: ... self._query_expander = QueryExpansion() self._metadata_extractor = SelfQuery() self._reranker = Reranker() @opik.track(name=\"retriever.retrieve_top_k\") def retrieve_top_k(self, k: int, to_expand_to_n_queries: int) -&gt; list: ... return hits @opik.track(name=\"retriever.rerank\") def rerank(self, hits: list, keep_top_k: int) -&gt; list&#91;str]: ... return rerank_hits <\/code><\/pre>\n\n\n\n<p>We can go even deeper and monitor the QueryExpansion and SelfQuery functionality as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> class QueryExpansion:\n    opik_tracer = OpikTracer(tags=&#91;\"QueryExpansion\"])\n\n    @staticmethod\n    @opik.track(name=\"QueryExpansion.generate_response\")\n    def generate_response(query: str, to_expand_to_n: int) -&gt; list&#91;str]:\n        ...\n\n        chain = prompt | model\n        chain = chain.with_config({\"callbacks\": &#91;QueryExpansion.opik_tracer]})\n\n        ...\n\n        return stripped_queries <\/code><\/pre>\n\n\n\n<p>We applied the Python decorator and Opik\u2019s OpenAI integration as proof of concept. This might be overkill, as it adds useless noise in real-world applications. But if that happens, you can easily pick only one option.<\/p>\n\n\n\n<p>Opik knows how to aggregate all these elements into a single trace, which can easily be visualized in its&nbsp;<a href=\"\/login?from=llm\">dashboard<\/a>, as seen in Figure 4.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*uucT34vIoxIcwCqX.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 4: Example of monitoring the LLM Twin inference pipeline using Opik.<\/figcaption><\/figure>\n\n\n\n<p>You can easily debug and analyze each step, as illustrated in Figure 5.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*scSo0QhWxHmBnyre.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 5: Inspect one step of the trace using Opik.<\/figcaption><\/figure>\n\n\n\n<p>Also, you can quickly see its associated&nbsp;<strong>metadata<\/strong>, as seen in Figure 6.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*cCD2xny6sc3zss3f.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 6: Inspect the trace\u2019s metadata using Opik.<\/figcaption><\/figure>\n\n\n\n<p>You can even use Opik\u2019s dashboard to label each trace with feedback scores. These scores can then be aggregated into a preference alignment dataset, which you can use to fine-tune your LLMs using techniques such as RLHF or DPO.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"sdre\">4. Sampling items for evaluating chains in production<\/h2>\n\n\n\n<p>So far, we\u2019ve looked into how to log and manually inspect our traces. Another important monitoring aspect is automatically assessing the inputs and outputs generated by your LLM system to ensure that everything works as pre-deployment.<\/p>\n\n\n\n<p>To do so, while the inference pipeline is in production, you can add your input and output to a monitor Opik dataset:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> answer = {\"answer\": answer, \"context\": context} if sample_for_evaluation is True: add_to_dataset_with_sampling( item={\"input\": {\"query\": query}, \"expected_output\": answer}, dataset_name=\"LLMTwinMonitoringDataset\", ) <\/code><\/pre>\n\n\n\n<p>As evaluating LLM systems using LLM judges is expensive, we don\u2019t want to assess all our traffic. To avoid this, the easiest way is to do random sampling and save only a subset of your data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> Ydef add_to_dataset_with_sampling(item: dict, dataset_name: str) -&gt; bool: if \"1\" in random.choices(&#91;\"0\", \"1\"], weights=&#91;0.5, 0.5]): client = opik.Opik() dataset = client.get_dataset(name=dataset_name) dataset.insert(&#91;item]) return True return False <\/code><\/pre>\n\n\n\n<p>You could move this to a different thread to avoid blocking your main thread with I\/O operations. GIL does not block Python I\/O operations and can easily be parallelized.<\/p>\n\n\n\n<p>You can also manually flag and add samples to the monitoring dataset from the traces you monitor. This is good practice when manually investigating your production data and finding helpful edge cases you want to evaluate, as seen in Figure 7.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/0*WjPxpzzsvyfhf779.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 7: Add to dataset example.<\/figcaption><\/figure>\n\n\n\n<p>\ud83d\udd17&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/inference_pipeline\/llm_twin.py\">Full code<\/a>&nbsp;of the&nbsp;<strong>LLMTwin<\/strong>&nbsp;class.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"x33b\">5. Evaluating chains in production<\/h2>\n\n\n\n<p>The last step is to evaluate the samples we collected while in production. We don\u2019t have ground truth (GT), so we cannot leverage all the metrics we presented in Lesson 8.<\/p>\n\n\n\n<p>But as LLM judges are super versatile, we don\u2019t need GTs for metrics such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Hallucination<\/li>\n\n\n\n<li>Moderation<\/li>\n\n\n\n<li>AnswerRelevance<\/li>\n\n\n\n<li>Style<\/li>\n<\/ul>\n\n\n\n<p>These are enough to trigger a monitoring alarm and notice the system malfunctioning.<\/p>\n\n\n\n<p>In the code snippet below, we implemented a Python script that runs all these metrics on top of the&nbsp;<strong>LLMTwinMonitoringDataset<\/strong>, which aggregates samples from production.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> import opik from config import settings from core.logger_utils import get_logger from opik.evaluation import evaluate from opik.evaluation.metrics import AnswerRelevance, Hallucination, Moderation from .style import Style logger = get_logger(__name__) def evaluation_task(x: dict) -&gt; dict: return { \"input\": x&#91;\"input\"]&#91;\"query\"], \"context\": x&#91;\"expected_output\"]&#91;\"context\"], \"output\": x&#91;\"expected_output\"]&#91;\"answer\"], } def main() -&gt; None: parser = argparse.ArgumentParser(description=\"Evaluate monitoring script.\") parser.add_argument( \"--dataset_name\", type=str, default=\"LLMTwinMonitoringDataset\", help=\"Name of the dataset to evaluate\", ) args = parser.parse_args() dataset_name = args.dataset_name logger.info(f\"Evaluating Opik dataset: '{dataset_name}'\") client = opik.Opik() try: dataset = client.get_dataset(dataset_name) except Exception: logger.error(f\"Monitoring dataset '{dataset_name}' not found in Opik. Exiting.\") exit(1) experiment_config = { \"model_id\": settings.MODEL_ID, } scoring_metrics = &#91;Hallucination(), Moderation(), AnswerRelevance(), Style()] evaluate( dataset=dataset, task=evaluation_task, scoring_metrics=scoring_metrics, experiment_config=experiment_config, ) <\/code><\/pre>\n\n\n\n<p><em>More details on how the code above and LLM &amp; RAG evaluation work in Lesson 8.<\/em><\/p>\n\n\n\n<p>The production data is collected in real-time from all the requests made by the clients.<\/p>\n\n\n\n<p>The simplest way to ship the monitoring evaluation pipeline is in offline batch mode, which can quickly be scheduled to run every hour.<\/p>\n\n\n\n<p>Another option is to evaluate each sample independently or create a trigger, such as when we have ~50 new samples, evaluate them. The frequency of how you run the evaluation depends a lot on the nature of your application (e.g., medical vs. retail).<\/p>\n\n\n\n<p><strong>The next step<\/strong>&nbsp;is to hook the evaluation pipeline to an alarming system that notices when the application has moderation, hallucination or other business issues so we can quickly respond.<\/p>\n\n\n\n<p>\ud83d\udd17&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/inference_pipeline\/evaluation\/evaluate_monitoring.py\">Full code<\/a>&nbsp;of the monitoring evaluation pipeline.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"4f4f\">6. Testing out the prompt monitoring service<\/h2>\n\n\n\n<p>If you properly set up Opik and the LLM Twin inference pipeline, as explained in the&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/INSTALL_AND_USAGE.md\">INSTALL_AND_USAGE<\/a>&nbsp;document from GitHub, the data will be automatically collected in Opik\u2019s dashboard.<\/p>\n\n\n\n<p>Thus, to test things out, first deploy the infrastructure:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>make local-start # Local infrastructure for RAG\nmake deploy-inference-pipeline  # Deploy LLM to AWS SageMaker <\/code><\/pre>\n\n\n\n<p>Now, call the inference pipeline:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">make call-inference-pipeline<\/pre>\n\n\n\n<p>Ultimately, go to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"\/login?from=llm\">Opik\u2019s dashboard<\/a><\/li>\n\n\n\n<li>\u201c<em>llm-twin<\/em>\u201d project<\/li>\n<\/ol>\n\n\n\n<p>And you should see the traces over there.<\/p>\n\n\n\n<p>To test out the evaluation pipeline, as it runs as a different process, run the following:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">make evaluate-llm-monitoring<\/pre>\n\n\n\n<p>To run the monitoring evaluation pipeline successfully, ensure you run your inference pipeline a few times so some samples are logged into the monitoring dataset.<\/p>\n\n\n\n<p>Don\u2019t forget to stop the AWS SageMaker inference endpoint once you are done testing:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">make delete-inference-pipeline-deployment<\/pre>\n\n\n\n<p><em>Find&nbsp;<strong>step-by-step instructions<\/strong>&nbsp;on installing and running&nbsp;<strong>the entire course<\/strong>&nbsp;in our&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/INSTALL_AND_USAGE.md\">INSTALL_AND_USAGE<\/a>&nbsp;document from the repository.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In this lesson of the LLM Twin course, you learned to&nbsp;<strong>build<\/strong>&nbsp;a monitoring service and evaluation pipeline.<\/p>\n\n\n\n<p><strong>First<\/strong>, we\u2019ve understood why we need specialized software to monitor prompts and traces.<\/p>\n\n\n\n<p><strong>Next<\/strong>, we\u2019ve looked into how to implement a prompt monitoring layer.<\/p>\n\n\n\n<p><strong>Ultimately<\/strong>, we\u2019ve understood how to build a monitoring evaluation pipeline.<\/p>\n\n\n\n<p><em>With this, we\u2019ve&nbsp;<strong>wrapped up<\/strong>&nbsp;the core lessons of the&nbsp;<strong>LLM Twin open-source course<\/strong>. We hope you enjoyed it and it brought value to your LLM &amp; RAG skills.<\/em><\/p>\n\n\n\n<p>Continue the course with the bonus Lesson 11, which shows you how to optimize the RAG modules using Superlinked.<\/p>\n\n\n\n<p>\ud83d\udd17 Consider checking out the GitHub repository [1] and support us with a \u2b50\ufe0f<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<p><strong>Literature<\/strong><br>[1] Your LLM Twin Course \u2014 GitHub Repository (2024), Decoding ML GitHub Organization[2] Quickstart notebook \u2014 Summarization task | Opik Documentation. (n.d.). https:\/\/www.comet.com\/docs\/opik\/cookbook\/quickstart_notebook<\/p>\n\n\n\n<p><strong>Images<\/strong><br>If not otherwise stated, all images are created by the author.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to Lesson 10 of 12&nbsp;in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":10100,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,65,7],"tags":[],"coauthors":[222,223],"class_list":["post-10094","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-comet-community-hub","category-llmops","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The Ultimate Prompt Monitoring Pipeline<\/title>\n<meta name=\"description\" content=\"How to compute LLM evaluation metrics on top of your production data to get alerted about hallucinations, content moderation and other issues.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The Ultimate Prompt Monitoring Pipeline\" \/>\n<meta property=\"og:description\" content=\"How to compute LLM evaluation metrics on top of your production data to get alerted about hallucinations, content moderation and other issues.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-07-31T18:37:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:44:08+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png\" \/>\n\t<meta property=\"og:image:width\" content=\"700\" \/>\n\t<meta property=\"og:image:height\" content=\"400\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The Ultimate Prompt Monitoring Pipeline","description":"How to compute LLM evaluation metrics on top of your production data to get alerted about hallucinations, content moderation and other issues.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/","og_locale":"en_US","og_type":"article","og_title":"The Ultimate Prompt Monitoring Pipeline","og_description":"How to compute LLM evaluation metrics on top of your production data to get alerted about hallucinations, content moderation and other issues.","og_url":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-07-31T18:37:20+00:00","article_modified_time":"2025-04-29T12:44:08+00:00","og_image":[{"width":700,"height":400,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","type":"image\/png"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"The Ultimate Prompt Monitoring Pipeline","datePublished":"2024-07-31T18:37:20+00:00","dateModified":"2025-04-29T12:44:08+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/"},"wordCount":2431,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","articleSection":["Comet Community Hub","LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/","url":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/","name":"The Ultimate Prompt Monitoring Pipeline","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","datePublished":"2024-07-31T18:37:20+00:00","dateModified":"2025-04-29T12:44:08+00:00","description":"How to compute LLM evaluation metrics on top of your production data to get alerted about hallucinations, content moderation and other issues.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","width":700,"height":400,"caption":"illustration of a human face with colored lines and symbols radiating outward to visualize the concept of neural networks"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The Ultimate Prompt Monitoring Pipeline"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/07\/rag-evaluation-ragas.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10094","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=10094"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10094\/revisions"}],"predecessor-version":[{"id":15797,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/10094\/revisions\/15797"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/10100"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=10094"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=10094"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=10094"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=10094"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}