{"id":9900,"date":"2024-05-10T13:49:41","date_gmt":"2024-05-10T21:49:41","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9900"},"modified":"2025-04-29T12:41:27","modified_gmt":"2025-04-29T12:41:27","slug":"advanced-rag-algorithms-optimize-retrieval","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/","title":{"rendered":"The 4 Advanced RAG Algorithms You Must Know to Implement"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Welcome to&nbsp;<strong>Lesson 5<\/strong><strong>&nbsp;of 12<\/strong>&nbsp;in our free course series,&nbsp;<strong>LLM Twin: Building Your Production-Ready AI Replica<\/strong>. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">Lesson 1<\/a>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lessons<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b55f\">In&nbsp;<strong>Lesson 5<\/strong>, we will focus on building an advanced retrieval module used for RAG.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"36c7\">We will show you how to implement 4&nbsp;<strong>retrieval<\/strong>&nbsp;and&nbsp;<strong>post-retrieval advanced optimization techniques<\/strong>&nbsp;to&nbsp;<strong>improve<\/strong>&nbsp;the&nbsp;<strong>accuracy<\/strong>&nbsp;of your&nbsp;<strong>RAG retrieval step<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3e1b\">In this lesson, we will focus only on the retrieval part of the RAG system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"16e0\">In&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\"><strong>Lesson 4<\/strong><\/a>, we showed you how to clean, chunk, embed, and load social media data to a&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant vector DB<\/a>&nbsp;(the ingestion part of RAG).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f8ae\">In future lessons, we will integrate this retrieval module into the inference pipeline for a full-fledged RAG system.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*AAPl6tL6KFXKyqk16rc7Dw.png\" alt=\"flow chart visualizing retrieval-augmented generation python module architecture \" title=\"Retrieval-augmented generation python module architecture\"\/><figcaption class=\"wp-element-caption\">Retrieval Python Module Architecture<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"067a\">We<strong>&nbsp;assume&nbsp;<\/strong>you<strong>&nbsp;<\/strong>are<strong>&nbsp;already familiar&nbsp;<\/strong>with<strong>&nbsp;<\/strong>what<strong>&nbsp;<\/strong>a<strong>&nbsp;naive RAG looks like.<\/strong>&nbsp;<strong>If not<\/strong>,&nbsp;<strong>check out&nbsp;<a href=\"https:\/\/medium.com\/decodingml\/why-you-must-choose-streaming-over-batch-pipelines-when-doing-rag-in-llm-applications-3b6fd32a93ff\">this article<\/a><\/strong>&nbsp;from&nbsp;<a href=\"https:\/\/medium.com\/decodingml\">Decoding ML<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ed65\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#bb38\">Overview of advanced RAG optimization techniques<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#ebae\">Advanced RAG techniques applied to the LLM twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#9e9e\">Retrieval optimization (1): Query expansion<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#e64d\">Retrieval optimization (2): Self query<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#e16b\">Retrieval optimization (3): Hybrid &amp; filtered vector search<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#e27b\">Implement the advanced retrieval Python class<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#100f\">Post-retrieval optimization: Rerank using GPT-4<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#1f5c\">Running the RAG retrieval module<\/a><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2559\"><em>\ud83d\udd17&nbsp;<\/em><strong><em>Check out&nbsp;<\/em><\/strong><a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\"><em>the code on GitHub<\/em><\/a><em>&nbsp;[1] and support us with a \u2b50\ufe0f<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bb38\">1. Overview of advanced RAG optimization techniques<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9862\">A production RAG system is split into&nbsp;<strong>3 main components<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ingestion:<\/strong>\u00a0clean, chunk, embed, and load your data to a vector DB<\/li>\n\n\n\n<li><strong>Retrieval:<\/strong>\u00a0query your vector DB for context<\/li>\n\n\n\n<li><strong>Generation:<\/strong>\u00a0attach the retrieved context to your prompt and pass it to an LLM<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ce57\">The&nbsp;<strong>ingestion component<\/strong>&nbsp;sits in the&nbsp;<em>feature pipeline<\/em>, while the&nbsp;<strong>retrieval<\/strong>&nbsp;and&nbsp;<strong>generation<\/strong>&nbsp;<strong>components<\/strong>&nbsp;are implemented inside the&nbsp;<em>inference pipeline<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"dfee\">You can&nbsp;<strong>also<\/strong>&nbsp;<strong>use<\/strong>&nbsp;the&nbsp;<strong>retrieval<\/strong>&nbsp;and&nbsp;<strong>generation<\/strong>&nbsp;<strong>components<\/strong>&nbsp;in your&nbsp;<em>training pipeline<\/em>&nbsp;to fine-tune your LLM further on domain-specific prompts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"818c\">You can apply advanced techniques to optimize your RAG system for ingestion, retrieval and generation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fe40\"><em>That being said, there are 3 main types of advanced RAG techniques:<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Pre-retrieval optimization [<\/strong>ingestion]: tweak how you create the chunks<\/li>\n\n\n\n<li><strong>Retrieval optimization\u00a0<\/strong>[retrieval]:<strong>\u00a0<\/strong>improve the queries to your vector DB<\/li>\n\n\n\n<li><strong>Post-retrieval optimization\u00a0<\/strong>[retrieval]<strong>:\u00a0<\/strong>process the retrieved chunks to filter out the noise<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The&nbsp;generation step&nbsp;can be&nbsp;improved&nbsp;through fine-tuning or prompt engineering, which will be explained in future lessons.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a0d1\">The&nbsp;<strong>pre-retrieval optimization techniques<\/strong>&nbsp;are explained in&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">Lesson 4<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d6d0\">In this lesson, we will show you some&nbsp;<strong>popular<\/strong>&nbsp;<strong>retrieval<\/strong>&nbsp;and&nbsp;<strong>post-retrieval<\/strong>&nbsp;<strong>optimization techniques<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ebae\">2. Advanced RAG techniques applied to the LLM twin<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"81e8\"><strong>Retrieval optimization<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"799b\"><em>We will combine 3 techniques:<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Query Expansion<\/li>\n\n\n\n<li>Self Query<\/li>\n\n\n\n<li>Filtered vector search<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1189\"><strong>Post-retrieval optimization<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"20ad\">We will&nbsp;<strong>use<\/strong>&nbsp;the&nbsp;<strong>rerank<\/strong>&nbsp;pattern&nbsp;<strong>using<\/strong>&nbsp;<strong>GPT-4<\/strong>&nbsp;and&nbsp;<strong>prompt engineering<\/strong>&nbsp;instead of&nbsp;<a href=\"https:\/\/cohere.com\/rerank\" target=\"_blank\" rel=\"noreferrer noopener\">Cohere<\/a>&nbsp;or an&nbsp;<a href=\"https:\/\/www.sbert.net\/examples\/applications\/retrieve_rerank\/README.html\" target=\"_blank\" rel=\"noreferrer noopener\">open-source re-ranker cross-encoder<\/a>&nbsp;[4].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"da22\">I don\u2019t want to spend too much time on the theoretical aspects. There are plenty of articles on that.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"717b\"><em>So, we will&nbsp;<\/em><strong><em>jump<\/em><\/strong><em>&nbsp;straight to&nbsp;<\/em><strong><em>implementing<\/em><\/strong><em>&nbsp;and&nbsp;<\/em><strong><em>integrating<\/em><\/strong><em>&nbsp;these techniques in our LLM twin system.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But before seeing the code, let\u2019s clarify a few things \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*ui2cQRlRDVnKrXPXk7COLA.png\" alt=\"architecture diagram showing retrieval-augmented generation process \"\/><figcaption class=\"wp-element-caption\">Advanced RAG architecture<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"3d24\"><strong>2.1 Important Note!<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"707c\">We<strong>&nbsp;<\/strong>will<strong>&nbsp;<\/strong>show<strong>&nbsp;<\/strong>you a<strong>&nbsp;custom implementation&nbsp;<\/strong>of<strong>&nbsp;<\/strong>the&nbsp;<strong>advanced techniques<\/strong>&nbsp;and&nbsp;<strong>NOT use LangChain.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6ae4\">Our primary&nbsp;<strong>goal<\/strong>&nbsp;is to&nbsp;<strong>build<\/strong>&nbsp;your&nbsp;<strong>intuition<\/strong>&nbsp;about how they&nbsp;<strong>work<\/strong>&nbsp;<strong>behind the scenes<\/strong>. However, we will&nbsp;<strong>attach LangChain\u2019s equivalent<\/strong>&nbsp;so you can use them in your apps.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0efa\"><strong>Customizing LangChain<\/strong>&nbsp;can be a&nbsp;<strong>real headache<\/strong>. Thus, understanding what happens behind its utilities can help you build real-world applications.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7f56\">Also, it is&nbsp;<strong>critical<\/strong>&nbsp;to&nbsp;<strong>know<\/strong>&nbsp;that if you don\u2019t ingest the data using LangChain, you cannot use their retrievals either, as they expect the data to be in a specific format.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"855f\">We haven\u2019t used LangChain\u2019s ingestion function in&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">Lesson 4<\/a>&nbsp;either (the feature pipeline that loads data to Qdrant) as we want to do everything \u201cby hand\u201d.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"c8a8\">2.2. Why Qdrant?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"edc7\">There are many vector DBs out there, too many\u2026<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f687\">But since we discovered Qdrant, we loved it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"28d0\"><strong>Why?<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It is built in Rust.<\/li>\n\n\n\n<li>Apache-2.0 license \u2014 open-source \ud83d\udd25<\/li>\n\n\n\n<li>It has a great and intuitive Python SDK.<\/li>\n\n\n\n<li>It has a freemium self-hosted version to build PoCs for free.<\/li>\n\n\n\n<li>It supports unlimited document sizes, and vector dims of up to 645536.<\/li>\n\n\n\n<li>It is production-ready. Companies such as Disney, Mozilla, and Microsoft already use it.<\/li>\n\n\n\n<li>It is one of the most popular vector DBs out there.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9960\"><strong><em>To<\/em><\/strong><em>&nbsp;<\/em><strong><em>put that in perspective,<\/em><\/strong>&nbsp;Pinecone, one of its biggest competitors, supports only documents with up to 40k tokens and vectors with up to 20k dimensions\u2026. and a proprietary license.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"be69\">I could go on and on\u2026<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a1c6\">\u2026but if you are&nbsp;<strong>curious to find out more<\/strong>,&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\"><em>check out Qdrant<\/em><\/a>&nbsp;\u2190<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"9e9e\">3. Retrieval optimization (1): Query expansion<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"a7c7\">The problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e175\">In a typical retrieval step, you query your vector DB using a single point.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9e3a\"><strong>The issue<\/strong>&nbsp;with that approach is that by&nbsp;<strong>using<\/strong>&nbsp;a&nbsp;<strong>single vector<\/strong>, you&nbsp;<strong>cover<\/strong>&nbsp;only a&nbsp;<strong>small area<\/strong>&nbsp;of your&nbsp;<strong>embedding space<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5659\">Thus, if your embedding doesn\u2019t contain all the required information, your retrieved context will not be relevant.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bef6\"><strong>What<\/strong>&nbsp;if we&nbsp;<strong>could<\/strong>&nbsp;<strong>query<\/strong>&nbsp;the&nbsp;<strong>vector DB<\/strong>&nbsp;with&nbsp;<strong>multiple<\/strong>&nbsp;<strong>data points<\/strong>&nbsp;that are semantically related?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cbfd\">That is what the&nbsp;<strong>\u201cQuery expansion\u201d&nbsp;<\/strong>technique is doing!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"747e\">The solution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d4ea\">Query expansion is quite intuitive.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"20c6\">You use an LLM to generate multiple queries based on your initial query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4e85\">These queries should contain multiple perspectives of the initial query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9c12\">Thus, when embedded, they hit different areas of your embedding space that are still relevant to our initial question.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"51de\">You can do query expansion with a detailed zero-shot prompt.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d989\">Here is our simple &amp; custom solution \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*j9j1SiJiB-_YMFDthF04hA.png\" alt=\"Query expansion template\"\/><figcaption class=\"wp-element-caption\">Query expansion template<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1b3a\"><em>Here is&nbsp;<a href=\"https:\/\/python.langchain.com\/docs\/modules\/data_connection\/retrievers\/MultiQueryRetriever\/\" target=\"_blank\" rel=\"noreferrer noopener\">LangChain\u2019s&nbsp;<strong>MultiQueryRetriever<\/strong>&nbsp;class<\/a>&nbsp;[5] (their equivalent).<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"e64d\">4. Retrieval optimization (2): Self query<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"2db2\">The problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7754\">When embedding your query, you cannot guarantee that all the aspects required by your use case are present in the embedding vector.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9fcc\">For example, you want to be 100% sure that your retrieval relies on the tags provided in the query.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9751\">The issue is that by embedding the query prompt, you can never be sure that the tags are represented in the embedding vector or have enough signal when computing the distance against other vectors.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"edaa\">The solution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3a77\">What if you could extract the tags within the query and use them along the embedded query?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ce36\">That is what self-query is all about!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2c59\">You use an LLM to extract various metadata fields that are critical for your business use case (e.g., tags, author ID, number of comments, likes, shares, etc.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4c2c\">In our custom solution, we are extracting just the author ID. Thus, a zero-shot prompt engineering technique will do the job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e840\">But, when extracting multiple metadata types, you should also use few-shot learning to optimize the extraction step.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"54bb\"><em>Self-queries work hand-in-hand with vector filter searches, which we will explain in the next section.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7f64\">Here is our solution \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*q-8OLe-67jXNB_wl8f6t3Q.png\" alt=\"self-query code template \"\/><figcaption class=\"wp-element-caption\">Self-query template<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c7ef\"><em>Here is&nbsp;<a href=\"https:\/\/python.langchain.com\/docs\/modules\/data_connection\/retrievers\/self_query\/\" target=\"_blank\" rel=\"noreferrer noopener\">LangChain\u2019s&nbsp;<strong>SelfQueryRetriever<\/strong>&nbsp;class<\/a>&nbsp;[6] equivalent and this is&nbsp;<a href=\"http:\/\/qdrant\/\" target=\"_blank\" rel=\"noreferrer noopener\">an example using Qdrant<\/a>&nbsp;[8].<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"e16b\">5. Retrieval optimization (3): Hybrid &amp; filtered vector search<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"e1dc\">The problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"75f3\">Embeddings are great for capturing the general semantics of a specific chunk.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8f00\">But they are not that great for querying specific keywords.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7ec4\">For example, if we want to retrieve article chunks about LLMs from our&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant vector DB<\/a>, embeddings would be enough.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"122f\">However, if we want to query for a specific LLM type (e.g., LLama 3), using only similarities between embeddings won\u2019t be enough.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e492\">Thus, embeddings are not great for finding exact phrase matching for specific terms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"734e\">The solution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bf3d\"><strong>Combine the vector search technique with one (or more) complementary search strategy, which works great for finding exact words.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"44f2\">It is not defined which algorithms are combined, but the most standard strategy for hybrid search is to combine the traditional keyword-based search and modern vector search.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1ef9\"><em>How are these combined?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2474\"><em>The&nbsp;<\/em><strong><em>first method<\/em><\/strong><em>&nbsp;is to merge the similarity scores of the 2 techniques as follows:<\/em><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> hybrid_score = (1 - alpha) * sparse_score + alpha * dense_score <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"284e\">Where&nbsp;<strong>alpha<\/strong>&nbsp;takes a value between [0, 1], with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>alpha = 1<\/strong>: Vector Search<\/li>\n\n\n\n<li><strong>alpha = 0<\/strong>: Keyword search<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"94b5\">Also, the similarity scores are defined as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>sparse_score:<\/strong>\u00a0is the result of the\u00a0<em>keyword search<\/em>\u00a0that, behind the scenes, uses a\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" target=\"_blank\" rel=\"noreferrer noopener\">BM25<\/a>\u00a0algorithm [7] that sits on top of TF-IDF.<\/li>\n\n\n\n<li><strong>dense_score:<\/strong>\u00a0is the result of the\u00a0<em>vector search<\/em>\u00a0that most commonly uses a similarity metric such as cosine distance<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fea0\"><em>The&nbsp;<\/em><strong><em>second method<\/em><\/strong><em>&nbsp;uses the vector search technique as usual and applies a filter based on your keywords on top of the metadata of retrieved results.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2914\"><em>\u2192 This is also known as<strong>&nbsp;filtered vector search<\/strong>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a132\">In this use case, the&nbsp;<strong>similar score<\/strong>&nbsp;is&nbsp;<strong>not changed based<\/strong>&nbsp;on the&nbsp;<strong>provided<\/strong>&nbsp;<strong>keywords<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34ef\">It is just a fancy word for a simple filter applied to the metadata of your vectors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c7d4\">But it is&nbsp;<strong>essential<\/strong>&nbsp;to&nbsp;<strong>understand<\/strong>&nbsp;the&nbsp;<strong>difference<\/strong>&nbsp;<strong>between<\/strong>&nbsp;the&nbsp;<strong>first<\/strong>&nbsp;and&nbsp;<strong>second<\/strong>&nbsp;<strong>methods<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The<strong>\u00a0first method<\/strong>\u00a0combines the similarity score between the keywords and vectors using the alpha parameter;<\/li>\n\n\n\n<li>The\u00a0<strong>second method<\/strong>\u00a0is a simple filter on top of your vector search.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"79a5\">How does this fit into our architecture?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a8e6\">Remember that during the self-query step, we extracted the&nbsp;<strong>author_id&nbsp;<\/strong>as an exact field that we have to match.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"eed6\">Thus, we will search for the&nbsp;<strong>author_id<\/strong>&nbsp;using the keyword search algorithm and attach it to the 5 queries generated by the query expansion step.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ac74\"><em>As we want the&nbsp;<\/em><strong><em>most relevant chunks<\/em><\/strong><em>&nbsp;from a&nbsp;<\/em><strong><em>given author,<\/em><\/strong><em>&nbsp;it makes the most sense to use a&nbsp;<\/em><strong><em>filter<\/em><\/strong><em>&nbsp;<\/em><strong><em>using<\/em><\/strong><em>&nbsp;the&nbsp;<\/em><strong><em>author_id<\/em><\/strong><em>&nbsp;as follows (<\/em><strong><em>filtered vector search<\/em><\/strong><em>)<\/em>&nbsp;\u2193<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> self._qdrant_client.search(\n      collection_name=\"vector_posts\",\n      query_filter=models.Filter(\n          must=&#91;\n              models.FieldCondition(\n                  key=\"author_id\",\n                  match=models.MatchValue(\n                      value=metadata_filter_value,\n                  ),\n              )\n          ]\n      ),\n      query_vector=self._embedder.encode(generated_query).tolist(),\n      limit=k,\n) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"763e\">Note that we can easily extend this with multiple keywords (e.g., tags), making the combination of self-query and hybrid search a powerful retrieval duo.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"630c\">The only&nbsp;<strong>question<\/strong>&nbsp;you have to&nbsp;<strong>ask yourself&nbsp;<\/strong>is whether we want to&nbsp;<strong>use<\/strong>&nbsp;a simple&nbsp;<strong>vector search filter<\/strong>&nbsp;or the more complex&nbsp;<strong>hybrid search<\/strong>&nbsp;strategy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ab68\"><em>Note that LangChain\u2019s&nbsp;<strong>SelfQueryRetriever<\/strong>&nbsp;class combines the self-query and hybrid search techniques behind the scenes, as can be seen in&nbsp;<a href=\"https:\/\/python.langchain.com\/docs\/integrations\/retrievers\/self_query\/qdrant_self_query\/\" target=\"_blank\" rel=\"noreferrer noopener\">their Qdrant example<\/a>&nbsp;[8]. That is why we wanted to build everything from scratch.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"e27b\">6. Implement the advanced retrieval Python class<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8da8\"><em>Now that you\u2019ve understood the&nbsp;<\/em><strong><em>advanced retrieval optimization techniques<\/em><\/strong><em>&nbsp;we\u2019re using, let\u2019s&nbsp;<\/em><strong><em>combine<\/em><\/strong><em>&nbsp;them into a&nbsp;<\/em><strong><em>Python retrieval class<\/em><\/strong><em>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2995\">Here is what the main retriever function looks like \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*LzSgpi2it-TTJjdXW77Fzw.png\" alt=\"code example of the main retriever function with explanatory labels overlaid \"\/><figcaption class=\"wp-element-caption\">VectorRetriever: main retriever function<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c224\"><em>Using a Python&nbsp;<strong>ThreadPoolExecutor<\/strong>&nbsp;is extremely powerful for addressing I\/O bottlenecks, as these types of operations are not blocked by Python\u2019s GIL limitations.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"473a\">Here is how we wrapped every advanced retrieval step into its own class \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*5Wlet8qLzsOMNG6XnpQBwg.png\" alt=\"code example showing query expansion chains wrapper\"\/><figcaption class=\"wp-element-caption\">Query expansion chains wrapper<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2b7f\">The&nbsp;<em>SelfQuery<\/em>&nbsp;class looks very similar \u2014 \ud83d\udd17&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/course\/module-3\/rag\/self_query.py\" target=\"_blank\" rel=\"noreferrer noopener\">access it here<\/a>&nbsp;[1] \u2190.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"21a3\">The final step is to call&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;for each query generated by the query expansion step \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*pgGOYr44WJvh449Btck38g.png\" alt=\"code example showing main search function with explanatory notes overlaid \"\/><figcaption class=\"wp-element-caption\">VectorRetriever: main search function<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34bd\"><em>Note that we have&nbsp;<\/em><strong><em>3 types of data<\/em><\/strong><em>: posts, articles, and code repositories.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a50f\">Thus, we have to make a query for each collection and combine the results in the end.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"056a\">The most performant method is to use multi-indexing techniques, which allow you to query multiple types of data at once.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ded4\">But at the time I am writing this article, this is not a solved problem at the production level.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a8ec\">Thus, we gathered data from each collection individually and kept the best-retrieved results using rerank.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2b31\">Which is the final step of the article.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"100f\">7. Post-retrieval optimization: Rerank using GPT-4<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7194\">We made a&nbsp;<strong>different search<\/strong>&nbsp;in the Qdrant vector DB for&nbsp;<strong>N prompts<\/strong>&nbsp;<strong>generated<\/strong>&nbsp;by the&nbsp;<strong>query expansion step<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6bf6\"><strong>Each<\/strong>&nbsp;<strong>search<\/strong>&nbsp;returns&nbsp;<strong>K results<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5513\">Thus, we&nbsp;<strong>end up with<\/strong>&nbsp;<strong>N x K chunks<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d52c\">In our particular case,&nbsp;<strong>N = 5<\/strong>&nbsp;&amp;&nbsp;<strong>K = 3.&nbsp;<\/strong>Thus, we end up with 15 chunks.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*DiTUw7WKXsuGpEUzYFKlAQ.png\" alt=\"flow chart showing post-retrieval reranking process\"\/><figcaption class=\"wp-element-caption\">Post-retrieval optimization: rerank<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"cbd9\">The problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bfb8\">The retrieved context may contain irrelevant chunks that only:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Add noise:<\/strong>\u00a0the retrieved context might be irrelevant<\/li>\n\n\n\n<li><strong>Make the prompt bigger:<\/strong>\u00a0results in higher costs &amp; the LLM is usually biased in looking only at the first and last pieces of context. Thus, if you add a big context, there is a big chance it will miss the essence.<\/li>\n\n\n\n<li><strong>Unaligned with your question:<\/strong>\u00a0the chunks are retrieved based on the query and chunk embedding similarity. The issue is that the embedding model is not tuned to your particular question, which might result in high similarity scores that are not 100% relevant to your question.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"d9e6\">The solution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5856\">We will use&nbsp;<strong>rerank<\/strong>&nbsp;to order all the&nbsp;<strong>N x K<\/strong>&nbsp;chunks based on their relevance relative to the initial question, where the first one will be the most relevant and the last chunk the least.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5791\">Ultimately, we will pick the TOP K most relevant chunks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"40ac\">Rerank works really well when combined with query expansion.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"22cc\">A natural flow when using rerank is as follows:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e7a1\"><code>Search for &gt;K chunks &gt;&gt;&gt; Reorder using rerank &gt;&gt;&gt; Take top K<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6656\">Thus, when combined with query expansion, we gather potential useful context from multiple points in space rather than just looking for more than K samples in a single location.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0fbb\">Now the flow looks like:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4ad5\"><code>Search for N x K chunks &gt;&gt;&gt; Reoder using rerank &gt;&gt;&gt; Take top K<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b1d3\">A typical solution for reranking is to use&nbsp;<a href=\"https:\/\/www.sbert.net\/examples\/applications\/retrieve_rerank\/README.\" target=\"_blank\" rel=\"noreferrer noopener\">open-source Bi-Encoders from sentence transformers<\/a>&nbsp;[4].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bea9\">These solutions take both the question and context as input and return a score from 0 to 1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d299\">In this article, we want to take a different approach and use GPT-4 + prompt engineering as our reranker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0cd0\">If you want to see how to&nbsp;<strong>apply rerank using open-source algorithms<\/strong>, check out this&nbsp;<strong><a href=\"https:\/\/medium.com\/decodingml\/a-real-time-retrieval-system-for-rag-on-social-media-data-9cc01d50a2a0?source=post_page-----5d0c7f1199d2--------------------------------\">hands-on article<\/a><\/strong>&nbsp;from&nbsp;<a href=\"https:\/\/medium.com\/decodingml\">Decoding ML.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"289b\">Now let\u2019s see our implementation using GPT-4 &amp; prompt engineering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c8e8\">Similar to what we did for the expansion and self-query chains, we define a template and a chain builder \u2193<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*kChkeI2YBDX-K2yJIx4s5w.png\" alt=\"python code example building a reranking chain \"\/><figcaption class=\"wp-element-caption\">Rerank chain<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"77ea\">Here is how we integrate the rerank chain into the retriever:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*G2jNy3ZzSz7FN5QjA0XeeA.png\" alt=\"python code example integrating the rerank chain into the retriever \"\/><figcaption class=\"wp-element-caption\">Retriever: rerank step<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"494c\">\u2026and that\u2019s it!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"159d\">Note that this is an experimental process. Thus, you can further tune your prompts for better results, but the primary idea is the same.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2192 Find the complete code used in this lesson in our GitHub at&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/tree\/main\/src\/core\/rag\">core\/rag<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1f5c\">8. Running the RAG retrieval module<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7961\">The last step is to run the whole thing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0264\">But there is a catch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d1a9\">As initially said, the retriever will not be used as a standalone component in the LLM system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>The inference pipeline will use it as a&nbsp;<strong>layer<\/strong>&nbsp;between the&nbsp;<strong>data<\/strong>&nbsp;and the&nbsp;<strong>Qdrant vector DB<\/strong>&nbsp;to do RAG.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Still, to check that everything works fine, let\u2019s test out the RAG retrieval module as a standalone script.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can call the VectorRetriever module using the following code as an example:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> from core import get_logger\nfrom core.config import settings\nfrom core.rag.retriever import VectorRetriever\n\nlogger = get_logger(__name__)\n\n\nquery = \"\"\"\nHello I am Paul Iusztin.\n        \nCould you draft an article paragraph discussing RAG? \nI'm particularly interested in how to design a RAG system.\n\"\"\"\n\n  retriever = VectorRetriever(query=query)\n  hits = retriever.retrieve_top_k(k=6, to_expand_to_n_queries=5)\n  reranked_hits = retriever.rerank(hits=hits, keep_top_k=5)\n\n  logger.info(\"====== RETRIEVED DOCUMENTS ======\")\n  for rank, hit in enumerate(reranked_hits):\n      logger.info(f\"Rank = {rank} : {hit}\") <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To spin up locally the Qdrant vector DB in a Docker container, run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make local-start <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To populate it with data, run the following:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make local-ingest-data<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now, to test out the script from above, run:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make local-test-retriever<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It should print the most similar hits found in the Qdrant vector DB to the CLI.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2026and that\u2019s it!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In future lessons, we will learn to integrate it into the inference pipeline for an end-to-end RAG system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"9f2a\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34bb\">In&nbsp;<strong>Lesson 5<\/strong>, you learned to&nbsp;<strong>build<\/strong>&nbsp;an&nbsp;<strong>advanced RAG retrieval module<\/strong>&nbsp;optimized for searching posts, articles, and code repositories from a&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant vector DB<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f7ea\"><strong>First<\/strong>, you learned about where the RAG pipeline can be optimized:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>pre-retrieval<\/li>\n\n\n\n<li>retrieval<\/li>\n\n\n\n<li>post-retrieval<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2347\"><strong>After<\/strong>&nbsp;you learn how to build from scratch (without using LangChain\u2019s utilities) the following advanced RAG retrieval &amp; post-retrieval optimization techniques:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>query expansion<\/li>\n\n\n\n<li>self query<\/li>\n\n\n\n<li>hybrid search<\/li>\n\n\n\n<li>rerank<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7023\"><strong>Ultimately<\/strong>, you understood where the retrieval component sits in an RAG production LLM system, where the code is shared between multiple microservices and doesn\u2019t sit in a single Notebook.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3e43\"><em>In&nbsp;<\/em><strong><em>Lesson 6<\/em><\/strong><em>, we will move to the training pipeline and show you how to automatically transform the data crawled from LinkedIn, Substack, Medium, and GitHub into an instruction dataset using GPT-4 to fine-tune your LLM Twin.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4c69\">See you there! \ud83e\udd17<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9a69\"><em>\ud83d\udd17&nbsp;<\/em><strong><em>Check out<\/em><\/strong><em>&nbsp;<\/em><a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\"><em>the code on GitHub<\/em><\/a><em>&nbsp;[1] and support us with a \u2b50\ufe0f<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"c402\">References<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1c65\">Literature<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"951b\">[1]&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">Your LLM Twin Course \u2014 GitHub Repository<\/a>&nbsp;(2024), Decoding ML GitHub Organization<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"054c\">[2]&nbsp;<a href=\"https:\/\/bytewax.io\/?utm_source=medium&amp;utm_medium=decodingml&amp;utm_campaign=2024_q1\" target=\"_blank\" rel=\"noreferrer noopener\">Bytewax<\/a>, Bytewax Landing Page<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"912c\">[3]&nbsp;<a href=\"https:\/\/qdrant.tech\/documentation\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>, Qdrant Documentation<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c886\">[4]&nbsp;<a href=\"https:\/\/www.sbert.net\/examples\/applications\/retrieve_rerank\/README.html\" target=\"_blank\" rel=\"noreferrer noopener\">Retrieve &amp; Re-Rank<\/a>, Sentence Transformers Documentation<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4ea8\">[5]&nbsp;<a href=\"https:\/\/python.langchain.com\/docs\/modules\/data_connection\/retrievers\/MultiQueryRetriever\/\" target=\"_blank\" rel=\"noreferrer noopener\">MultiQueryRetriever<\/a>, LangChain\u2019s Documentation<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0fbf\">[6]&nbsp;<a href=\"https:\/\/python.langchain.com\/docs\/modules\/data_connection\/retrievers\/self_query\/\" target=\"_blank\" rel=\"noreferrer noopener\">Self-querying<\/a>, LangChain\u2019s Documentation<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"982b\">[7]&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Okapi_BM25\" target=\"_blank\" rel=\"noreferrer noopener\">Okapi BM25<\/a>, Wikipedia<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"367e\">[8]&nbsp;<a href=\"http:\/\/qdrant\/\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant Self Query Example<\/a>, LangChain\u2019s Documentation<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"9ef6\">Images<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d9ab\">If not otherwise stated, all images are created by the author.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to&nbsp;Lesson 5&nbsp;of 12&nbsp;in our free course series,&nbsp;LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":9903,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,6,7],"tags":[14,64,85,15,89,90,52,31,16,91,92],"coauthors":[222,223],"class_list":["post-9900","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-llmops","category-machine-learning","category-tutorials","tag-comet-ml","tag-cometllm","tag-data-pipeline","tag-deep-learning-experiment-management","tag-feature-engineering","tag-feature-pipeline","tag-llm","tag-llmops","tag-ml-experiment-management","tag-rag","tag-streaming-pipeline"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>4 Advanced RAG Algorithms to Implement in Your LLM System<\/title>\n<meta name=\"description\" content=\"How to build an advanced RAG retrieval module to optimize technique and improve accuracy.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"The 4 Advanced RAG Algorithms You Must Know to Implement\" \/>\n<meta property=\"og:description\" content=\"Implement these 4 advanced RAG methods to improve the accuracy of your retrieval and post-retrieval algorithm.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-10T21:49:41+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:41:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1400\" \/>\n\t<meta property=\"og:image:height\" content=\"800\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:description\" content=\"Implement these 4 advanced RAG methods to improve the accuracy of your retrieval and post-retrieval algorithm.\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"4 Advanced RAG Algorithms to Implement in Your LLM System","description":"How to build an advanced RAG retrieval module to optimize technique and improve accuracy.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/","og_locale":"en_US","og_type":"article","og_title":"The 4 Advanced RAG Algorithms You Must Know to Implement","og_description":"Implement these 4 advanced RAG methods to improve the accuracy of your retrieval and post-retrieval algorithm.","og_url":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-05-10T21:49:41+00:00","article_modified_time":"2025-04-29T12:41:27+00:00","og_image":[{"width":1400,"height":800,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","type":"image\/webp"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_description":"Implement these 4 advanced RAG methods to improve the accuracy of your retrieval and post-retrieval algorithm.","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"The 4 Advanced RAG Algorithms You Must Know to Implement","datePublished":"2024-05-10T21:49:41+00:00","dateModified":"2025-04-29T12:41:27+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/"},"wordCount":3008,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","keywords":["Comet ML","CometLLM","Data Pipeline","Deep Learning Experiment Management","Feature Engineering","Feature pipeline","LLM","LLMOps","ML Experiment Management","RAG","Streaming pipeline"],"articleSection":["LLMOps","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/","url":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/","name":"4 Advanced RAG Algorithms to Implement in Your LLM System","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","datePublished":"2024-05-10T21:49:41+00:00","dateModified":"2025-04-29T12:41:27+00:00","description":"How to build an advanced RAG retrieval module to optimize technique and improve accuracy.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","width":1400,"height":800,"caption":"Image by DALL-E"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"The 4 Advanced RAG Algorithms You Must Know to Implement"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/llm-twin-course-project-rag-algorithms.webp","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9900"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9900\/revisions"}],"predecessor-version":[{"id":15795,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9900\/revisions\/15795"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/9903"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9900"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9900"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9900"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}