{"id":9538,"date":"2024-03-25T15:19:53","date_gmt":"2024-03-25T23:19:53","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9538"},"modified":"2025-04-29T12:35:54","modified_gmt":"2025-04-29T12:35:54","slug":"an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/","title":{"rendered":"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Welcome to Lesson 1 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2b42\"><strong>What is your LLM Twin?<\/strong>&nbsp;It is an AI character that writes like yourself by incorporating your style, personality and voice into an LLM.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*3HhjepgNW8LXIszD7OBSXw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Image by DALL-E<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"fe1c\">Why is this course different?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"085d\"><em>By finishing the \u201c<\/em><strong><em>LLM Twin: Building Your Production-Ready AI Replica\u201d<\/em>&nbsp;<\/strong><em>free course, you will learn how to design, train, and deploy a production-ready LLM twin of yourself powered by LLMs, vector DBs, and LLMOps good practices<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8ae4\"><strong>Why should you care? \ud83e\udef5<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1966\"><strong>\u2192 No more isolated scripts or Notebooks!<\/strong>&nbsp;Learn production ML by building and deploying an end-to-end production-grade LLM system.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"be83\">What will you learn to build by the end of this course?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1ee7\">You will&nbsp;<strong>learn<\/strong>&nbsp;how to&nbsp;<strong>architect&nbsp;<\/strong>and<strong>&nbsp;build a real-world LLM system<\/strong>&nbsp;from&nbsp;<strong>start<\/strong>&nbsp;to&nbsp;<strong>finish<\/strong>&nbsp;\u2014 from&nbsp;<strong>data collection<\/strong>&nbsp;to&nbsp;<strong>deployment<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0537\">You will also&nbsp;<strong>learn<\/strong>&nbsp;to&nbsp;<strong>leverage MLOps best practices<\/strong>, such as experiment trackers, model registries, prompt monitoring, and versioning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4f2a\"><strong>The end goal?<\/strong>&nbsp;Build and deploy your own LLM twin.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5a60\"><em>The&nbsp;<\/em><strong><em>architecture<\/em><\/strong><em>&nbsp;of the&nbsp;<\/em><strong><em>LLM twin<\/em><\/strong><em>&nbsp;is split into&nbsp;<\/em><strong><em>4 Python microservices<\/em><\/strong><em>:<\/em><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong><em>The data collection pipeline<\/em><\/strong>\u00a0crawls your digital data from various social media platforms. It cleans, normalizes and loads the data to a NoSQL DB through a series of ETL pipelines. Then, using the CDC pattern, it sends database changes to a queue.<\/li>\n\n\n\n<li><strong><em>The feature pipeline<\/em><\/strong>\u00a0<em><strong>consumes<\/strong><\/em>\u00a0messages from a queue through a Bytewax streaming pipeline. It cleans, chunks, and embeds every message and loads it to a vector DB in real-time.<\/li>\n\n\n\n<li><strong><em>The training pipeline<\/em><\/strong>\u00a0creates a custom instruction dataset based on your digital data. Fine-tune an LLM using Unsloth, AWS SageMaker, and Comet ML\u2019s experiment tracker. Evaluate the LLMs using Opik and save the best model to the Hugging Face model registry.<\/li>\n\n\n\n<li><strong><em>The inference pipeline<\/em><\/strong>\u00a0loads and quantizes the fine-tuned LLM from the model registry to the AWS SageMaker REST API. Enhance the prompts using RAG. Monitor the LLM using Opik. Hook the LLM Twin to a Gradio UI.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*necM3Dg6rZqzPVP658sO-w.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">LLM twin system architecture [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b448\">Along the 4 microservices, you will learn to integrate 3 serverless tools:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"\/signup\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet\u00a0<\/a>as your ML Platform;<\/li>\n\n\n\n<li><a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>\u00a0as your vector DB;<\/li>\n\n\n\n<li><a href=\"https:\/\/aws.amazon.com\/sagemaker\/\">AWS SageMaker<\/a>\u00a0as your ML infrastructure;<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/comet-ml\/opik\">Opik<\/a>\u00a0as your prompt evaluation and monitoring tool.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"87b4\">Who is this for?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6b7d\"><strong>Audience:<\/strong>&nbsp;MLE, DE, DS, or SWE who want to learn to engineer production-ready LLM systems using LLM and RAG systems using LLMOps best practices.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"644f\"><strong>Level: I<\/strong>ntermediate<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4fa9\"><strong>Prerequisites:<\/strong>&nbsp;Basic knowledge of Python and ML.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"4ffd\">How will you learn?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7a9f\">The course contains&nbsp;<strong>10 hands-on written lessons<\/strong>&nbsp;and the&nbsp;<strong>open-source code<\/strong>&nbsp;you can&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">access on GitHub,&nbsp;<\/a>showing how to build an end-to-end LLM system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, it includes&nbsp;<strong>2 bonus lessons<\/strong>&nbsp;on how to&nbsp;<strong>improve the RAG system.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"db36\">You can read everything at your own pace.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4eb7\"><em>\u2192 To get the most out of this course, we encourage you to clone and run the repository while you cover the lessons.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"7a36\">Costs?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"db62\">The&nbsp;<strong>articles<\/strong>&nbsp;and&nbsp;<strong>code<\/strong>&nbsp;are&nbsp;<strong>completely free<\/strong>. They will always remain free.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"dd1a\">But if you plan to run the code while reading it, you have to know that we use several cloud tools that might generate additional costs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, AWS has a pay-as-you-go pricing plan. From our tests, it will cost you ~15$ to run the fine-tuning and inference pipelines.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will stick to their free version for the other serverless tools, such as Qdrant, Comet, and Opik.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1a4a\">Meet your teachers!<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7faf\">The course is created under the&nbsp;<a href=\"https:\/\/medium.com\/decodingml\">Decoding ML<\/a>&nbsp;umbrella by:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.linkedin.com\/in\/pauliusztin\/\" target=\"_blank\" rel=\"noreferrer noopener\">Paul Iusztin<\/a>\u00a0| Senior ML &amp; MLOps Engineer<\/li>\n\n\n\n<li><a href=\"https:\/\/www.linkedin.com\/in\/vesaalexandru\/\" target=\"_blank\" rel=\"noreferrer noopener\">Alex Vesa<\/a>\u00a0| Senior AI Engineer<\/li>\n\n\n\n<li><a href=\"https:\/\/www.linkedin.com\/in\/arazvant\/\" target=\"_blank\" rel=\"noreferrer noopener\">Alex Razvant<\/a>\u00a0| Senior ML &amp; MLOps Engineer<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1057\">Lessons<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">Lesson 1: End-to-end framework for production-ready LLM systems<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fcdf\">In the&nbsp;<strong>first lesson<\/strong>, we will&nbsp;<strong>present<\/strong>&nbsp;the&nbsp;<strong>project<\/strong>&nbsp;you will&nbsp;<strong>build<\/strong>&nbsp;<strong>during<\/strong>&nbsp;<strong>the<\/strong>&nbsp;<strong>course<\/strong>:&nbsp;<em>your production-ready LLM Twin\/AI replica.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"af25\"><strong>Afterward<\/strong>, we will&nbsp;<strong>explain<\/strong>&nbsp;what&nbsp;<strong>the 3-pipeline design<\/strong>&nbsp;is and how it is applied to a standard ML system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ab95\"><strong>Ultimately<\/strong>, we will&nbsp;<strong>dig into<\/strong>&nbsp;the&nbsp;<strong>LLM project system design<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bfc7\">We will&nbsp;<strong>present<\/strong>&nbsp;all our&nbsp;<strong>architectural decisions<\/strong>&nbsp;regarding the design of&nbsp;<em>the data collection pipeline<\/em>&nbsp;for social media data and how we applied&nbsp;<em>the 3-pipeline architecture<\/em>&nbsp;to our&nbsp;<em>LLM microservices<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4658\">In the&nbsp;<strong>following lessons<\/strong>, we will&nbsp;<strong>examine<\/strong>&nbsp;<strong>each component\u2019s code<\/strong>&nbsp;and learn&nbsp;<strong>how<\/strong>&nbsp;to&nbsp;<strong>implement<\/strong>&nbsp;and&nbsp;<strong>deploy<\/strong>&nbsp;<strong>it<\/strong>&nbsp;to&nbsp;<a href=\"https:\/\/aws.amazon.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">AWS SageMaker.<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*necM3Dg6rZqzPVP658sO-w.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">LLM twin system architecture [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"656e\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#5b6c\">What are you going to build? The LLM twin concept<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#2ba3\">The 3-pipeline architecture<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#c9ec\">LLM twin system design\u00a0<\/a><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"5b6c\">1. What are you going to build? The LLM twin concept<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c5e6\">The&nbsp;<strong>outcome<\/strong>&nbsp;of this&nbsp;<strong>course<\/strong>&nbsp;is to learn to&nbsp;<strong>build<\/strong>&nbsp;your&nbsp;<strong>own AI replica<\/strong>. We will use an LLM to do that, hence the name of the course:&nbsp;<strong><em>LLM Twin: Building Your Production-Ready AI Replica.<\/em><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3294\"><strong>But what is an LLM twin?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cb29\">Shortly, your LLM twin will be an AI character who writes like you, using your writing style and personality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"618c\">It will not be you. It will be your writing copycat.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4a8e\">More concretely, you will build an AI replica that writes social media posts or technical articles (like this one) using your own voice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a50b\"><strong>Why not directly use ChatGPT? You may ask\u2026<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c336\">When trying to generate an article or post using an LLM, the results tend to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>be very generic and unarticulated,<\/li>\n\n\n\n<li>contain misinformation (due to hallucination),<\/li>\n\n\n\n<li>require tedious prompting to achieve the desired result.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6697\"><strong><em>But here is what we are going to do to fix that<\/em><\/strong><em>&nbsp;<\/em>\u2193\u2193\u2193<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8008\"><strong>First<\/strong>, we will fine-tune an LLM on your digital data gathered from LinkedIn, Medium, Substack and GitHub.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0971\">By doing so, the LLM will align with your writing style and online personality. It will teach the LLM to talk like the online version of yourself.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f9ec\">Have you seen the universe of AI characters Meta released in 2024 in the Messenger app? If not, you can learn more about it&nbsp;<a href=\"https:\/\/ai.meta.com\/genai\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>&nbsp;[2].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f752\">To some extent, that is what we are going to build.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b9b3\">But in our use case, we will focus on an LLM twin who writes social media posts or articles that reflect and articulate your voice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8964\"><em>For example<\/em>, we can ask your LLM twin to write a LinkedIn post about LLMs. Instead of writing some generic and unarticulated post about LLMs (e.g., what ChatGPT will do), it will use your voice and style.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8bf5\"><strong>Secondly<\/strong>, we will give the LLM access to a vector DB to access external information to avoid hallucinating. Thus, we will force the LLM to write only based on concrete data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ea4d\"><strong>Ultimately<\/strong>, in addition to accessing the vector DB for information, you can provide external links that will act as the building block of the generation process.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c3ce\"><em>For example<\/em>, we can modify the example above to: \u201cWrite me a 1000-word LinkedIn post about LLMs based on the article from this link: [URL].\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f2c8\">Excited? Let\u2019s get started \ud83d\udd25<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2ba3\">2. The 3-pipeline architecture<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7922\">We all know how messy ML systems can get. That is where the 3-pipeline architecture kicks in.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0001\"><strong>The<\/strong>&nbsp;<strong>3-pipeline design<\/strong>&nbsp;brings structure and modularity to your ML system while improving your MLOps processes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"12a9\">Problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bba4\">Despite advances in MLOps tooling, transitioning from prototype to production remains challenging.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"cd69\">In 2022, only 54% of the models get into production. Auch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"628a\"><em>So what happens?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2021\">Maybe the first things that come to your mind are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the model is not mature enough<\/li>\n\n\n\n<li>security risks (e.g., data privacy)<\/li>\n\n\n\n<li>not enough data<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"07bc\">To some extent, these are true.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"057a\">But the reality is that in many scenarios\u2026<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2481\">\u2026the architecture of the ML system is built with research in mind, or the ML system becomes a massive monolith that is extremely hard to refactor from offline to online.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"16a8\">So, good SWE processes and a well-defined architecture are as crucial as using suitable tools and models with high accuracy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"50ee\">Solution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a1d6\"><em>\u2192 The 3-pipeline architecture<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"430a\">Let\u2019s understand what the 3-pipeline design is.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"715e\">It is a mental map that helps you simplify the development process and split your monolithic ML pipeline into 3 components:<br>1. the feature pipeline<br>2. the training pipeline<br>3. the inference pipeline<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2f3b\">\u2026also known as the Feature\/Training\/Inference (FTI) architecture.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e973\"><strong>#1.<\/strong>&nbsp;The&nbsp;<strong>feature pipeline<\/strong>&nbsp;transforms your data into features &amp; labels, which are stored and versioned in a feature store. The feature store will act as the central repository of your features. That means that features can be accessed and shared only through the feature store.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d321\"><strong>#2.<\/strong>&nbsp;The&nbsp;<strong>training pipeline<\/strong>&nbsp;ingests a specific version of the features &amp; labels from the feature store and outputs the trained model weights, which are stored and versioned inside a model registry. The models will be accessed and shared only through the model registry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bb83\"><strong>#3.<\/strong>&nbsp;The&nbsp;<strong>inference pipeline<\/strong>&nbsp;uses a given version of the features from the feature store and downloads a specific version of the model from the model registry. Its final goal is to output the predictions to a client.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*0E4gQjIYKYHM-gvYGnTkTw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">The 3-pipeline architecture [Image by the Author].<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e61f\"><em>This is why the 3-pipeline design is so beautiful:<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4035\">\u2013 it is intuitive<br>\u2013 it brings structure, as on a higher level, all ML systems can be reduced to these 3 components<br>\u2013 it defines a transparent interface between the 3 components, making it easier for multiple teams to collaborate<br>\u2013 the ML system has been built with modularity in mind since the beginning<br>\u2013 the 3 components can easily be divided between multiple teams (if necessary)<br>\u2013 every component can use the best stack of technologies available for the job<br>\u2013 every component can be deployed, scaled, and monitored independently<br>\u2013 the feature pipeline can easily be either batch, streaming or both<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1bea\">But the most important benefit is that\u2026<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f007\">\u2026by following this pattern, you know 100% that your ML model will move out of your Notebooks into production.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6444\">\u21b3 If you want to&nbsp;<em>learn more about the 3-pipeline design<\/em>, I recommend&nbsp;<a href=\"https:\/\/www.hopsworks.ai\/post\/mlops-to-ml-systems-with-fti-pipelines\" target=\"_blank\" rel=\"noreferrer noopener\">this excellent article<\/a>&nbsp;[3] written by Jim Dowling, one of the creators of the FTI architecture.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"c9ec\">3. LLM Twin System design<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"995c\">Let\u2019s understand how to&nbsp;<strong>apply the 3-pipeline architecture<\/strong>&nbsp;to&nbsp;<strong>our LLM system<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ef6f\">The&nbsp;<strong>architecture<\/strong>&nbsp;of the&nbsp;<strong>LLM twin<\/strong>&nbsp;is split into&nbsp;<strong>4 Python microservices<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The data collection pipeline<\/li>\n\n\n\n<li>The feature pipeline<\/li>\n\n\n\n<li>The training pipeline<\/li>\n\n\n\n<li>The inference pipeline<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1000\/1*necM3Dg6rZqzPVP658sO-w.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">LLM twin system architecture [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9270\">As you can see, the data collection pipeline doesn\u2019t follow the 3-pipeline design. Which is true.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ab31\">It represents the data pipeline that sits before the ML system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0ddd\">The data engineering team usually implements it, and its scope is to gather, clean, normalize and store the data required to build dashboards or ML models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c409\">But let\u2019s say you are part of a small team and have to build everything yourself, from data gathering to model deployment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6dc2\">Thus, we will show you how the data pipeline nicely fits and interacts with the FTI architecture.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1bfc\"><em>Now,&nbsp;<\/em><strong><em>let\u2019s zoom in<\/em><\/strong><em>&nbsp;on&nbsp;<\/em><strong><em>each component<\/em><\/strong><em>&nbsp;to understand how they work individually and interact with each other. \u2193\u2193\u2193<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"1d26\">3.1. The data collection pipeline<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e92f\">Its scope is to&nbsp;<strong>crawl data<\/strong>&nbsp;for&nbsp;<strong>a given user<\/strong>&nbsp;from:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Medium (articles)<\/li>\n\n\n\n<li>Substack (articles)<\/li>\n\n\n\n<li>LinkedIn (posts)<\/li>\n\n\n\n<li>GitHub (code)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f85a\">As every platform is unique, we implemented a different Extract Transform Load (ETL) pipeline for each website.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7901\">\ud83d\udd17 1-min read on&nbsp;<a href=\"https:\/\/www.databricks.com\/glossary\/extract-transform-load\" target=\"_blank\" rel=\"noreferrer noopener\">ETL pipelines<\/a>&nbsp;[4]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a49c\">However, the&nbsp;<strong>baseline steps<\/strong>&nbsp;are the&nbsp;<strong>same<\/strong>&nbsp;for&nbsp;<strong>each platform<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7f49\"><em>Thus, for each ETL pipeline, we can abstract away the following baseline steps:<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>log in using your credentials<\/li>\n\n\n\n<li>use\u00a0<em>selenium<\/em>\u00a0to crawl your profile<\/li>\n\n\n\n<li>use\u00a0<em>BeatifulSoup\u00a0<\/em>to parse the HTML<\/li>\n\n\n\n<li>clean &amp; normalize the extracted HTML<\/li>\n\n\n\n<li>save the normalized (but still raw) data to Mongo DB<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"43a7\"><strong>Important note:<\/strong>&nbsp;We are crawling only our data, as most platforms do not allow us to access other people\u2019s data due to privacy issues. But this is perfect for us, as to build our LLM twin, we need only our own digital data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b875\"><strong>Why Mongo DB?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"415d\">We wanted a NoSQL database that quickly allows us to store unstructured data (aka text).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f0c9\"><strong>How will the data pipeline communicate with the feature pipeline?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2f93\">We will use the&nbsp;<strong>Change Data Capture (CDC) pattern<\/strong>&nbsp;to inform the feature pipeline of any change on our&nbsp;<a href=\"https:\/\/www.mongodb.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Mongo DB<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ebb0\">\ud83d\udd17 1-min read on the&nbsp;<a href=\"https:\/\/superlinked.com\/vectorhub\/12-data-modality#kAGPz?utm_source=community&amp;utm_medium=blog&amp;utm_campaign=oscourse\" target=\"_blank\" rel=\"noreferrer noopener\">CDC pattern<\/a>&nbsp;[5]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5ece\">To explain the CDC briefly, a watcher listens 24\/7 for any CRUD operation that happens to the Mongo DB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2bc3\">The watcher will issue an event informing us what has been modified. We will add that event to a RabbitMQ queue.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"48dc\">The feature pipeline will constantly listen to the queue, process the messages, and add them to the&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;vector DB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5e58\">For example, when we write a new document to the Mongo DB, the watcher creates a new event. The event is added to the RabbitMQ queue; ultimately, the feature pipeline consumes and processes it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"339c\">Doing this ensures that the Mongo DB and vector DB are constantly in sync.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d4d3\">With the CDC technique, we transition from a batch ETL pipeline (our data pipeline) to a streaming pipeline (our feature pipeline).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"df96\">Using the CDC pattern, we avoid implementing a complex batch pipeline to compute the difference between the Mongo DB and vector DB. This approach can quickly get very slow when working with big data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b41b\"><strong>Where will the data pipeline be deployed?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f77f\">The data collection pipeline and RabbitMQ service will be deployed to AWS. We will also use the freemium serverless version of Mongo DB.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"9418\">3.2. The feature pipeline<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4f86\">The feature pipeline is&nbsp;<strong>implemented using&nbsp;<\/strong><a href=\"https:\/\/bytewax.io\/?utm_source=medium&amp;utm_medium=decodingml&amp;utm_campaign=2024_q1\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Bytewax<\/strong><\/a>&nbsp;(a Rust streaming engine with a Python interface). Thus, in&nbsp;<strong>our<\/strong>&nbsp;specific&nbsp;<strong>use case<\/strong>, we will also&nbsp;<strong>refer to it<\/strong>&nbsp;as a&nbsp;<strong>streaming ingestion pipeline<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7848\">It is an&nbsp;<strong>entirely different service<\/strong>&nbsp;than the data collection pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bba9\"><strong>How does it communicate with the data pipeline?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6a4d\">As explained above, the&nbsp;<strong>feature pipeline communicates<\/strong>&nbsp;with the&nbsp;<strong>data<\/strong>&nbsp;<strong>pipeline<\/strong>&nbsp;through a RabbitMQ&nbsp;<strong>queue<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5588\">Currently, the streaming pipeline doesn\u2019t care how the data is generated or where it comes from.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2d5e\">It knows it has to listen to a given queue, consume messages from there and process them.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3a24\">By doing so, we&nbsp;<strong>decouple<\/strong>&nbsp;<strong>the two components<\/strong>&nbsp;entirely. In the future, we can easily add messages from multiple sources to the queue, and the streaming pipeline will know how to process them. The only rule is that the messages in the queue should always respect the same structure\/interface.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"68f8\"><strong>What is the scope of the feature pipeline?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"165c\">It represents the&nbsp;<strong>ingestion component<\/strong>&nbsp;of the&nbsp;<strong>RAG system<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d5fc\">It will&nbsp;<strong>take<\/strong>&nbsp;the&nbsp;<strong>raw data<\/strong>&nbsp;passed through the queue and:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>clean the data;<\/li>\n\n\n\n<li>chunk it;<\/li>\n\n\n\n<li>embed it using the embedding models from\u00a0<a href=\"https:\/\/superlinked.com\/?utm_source=community&amp;utm_medium=blog&amp;utm_campaign=oscourse\" target=\"_blank\" rel=\"noreferrer noopener\">Superlinked<\/a>;<\/li>\n\n\n\n<li>load it to the\u00a0<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>\u00a0vector DB.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5cb5\"><strong>Every type of data<\/strong>&nbsp;(post, article, code) will be&nbsp;<strong>processed independently<\/strong>&nbsp;through its own set of classes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7ede\">Even though all of them are text-based, we must clean, chunk and embed them using different strategies, as every type of data has its own particularities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6ca0\"><strong>What data will be stored?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1bda\">The&nbsp;<strong>training pipeline<\/strong>&nbsp;will have&nbsp;<strong>access<\/strong>&nbsp;<strong>only<\/strong>&nbsp;to the&nbsp;<strong>feature store<\/strong>, which, in our case, is represented by the&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;vector DB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a197\">Note that a vector DB can also be used as a NoSQL DB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5461\"><em>With these 2 things in mind, we will&nbsp;<\/em><strong><em>store<\/em><\/strong><em>&nbsp;in Qdrant&nbsp;<\/em><strong><em>2 snapshots of our data:<\/em><\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2519\">1. The c<strong>leaned data<\/strong>&nbsp;(without using vectors as indexes \u2014 store them in a NoSQL fashion).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"81ed\">2. The&nbsp;<strong>cleaned, chunked, and embedded data<\/strong>&nbsp;(leveraging the vector indexes of Qdrant)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1e65\">The&nbsp;<strong>training pipeline<\/strong>&nbsp;needs&nbsp;<strong>access<\/strong>&nbsp;to the&nbsp;<strong>data<\/strong>&nbsp;in<strong>&nbsp;both formats<\/strong>&nbsp;as we want to fine-tune the LLM on standard and augmented prompts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f72e\">With the&nbsp;<strong>cleaned data<\/strong>, we will create the prompts and answers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b1dd\">With the&nbsp;<strong>chunked data<\/strong>, we will augment the prompts (aka RAG).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e8ad\"><strong>Why implement a streaming pipeline instead of a batch pipeline?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5b7d\">There are&nbsp;<strong>2 main reasons.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c0ba\">The first one is that, coupled with the&nbsp;<strong>CDC pattern<\/strong>, it is the most&nbsp;<strong>efficient<\/strong>&nbsp;way to&nbsp;<strong>sync two DBs<\/strong>&nbsp;between each other. Otherwise, you would have to implement batch polling or pushing techniques that aren\u2019t scalable when working with big data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e7fe\">Using CDC + a streaming pipeline, you process only the changes to the source DB without any overhead.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f8b3\">The second reason is that by doing so, your&nbsp;<strong>source<\/strong>&nbsp;and&nbsp;<strong>vector DB<\/strong>&nbsp;will&nbsp;<strong>always be in sync<\/strong>. Thus, you will always have access to the latest data when doing RAG.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"239b\"><strong>Why Bytewax?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"823c\"><a href=\"https:\/\/bytewax.io\/?utm_source=medium&amp;utm_medium=decodingml&amp;utm_campaign=2024_q1\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Bytewax<\/strong><\/a>&nbsp;is a streaming engine built in Rust that exposes a Python interface. We use Bytewax because it combines Rust\u2019s impressive speed and reliability with the ease of use and ecosystem of Python. It is incredibly light, powerful, and easy for a Python developer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9c90\"><strong>Where will the feature pipeline be deployed?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f9fb\">The feature pipeline will be deployed to AWS. We will also use the freemium serverless version of&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"b317\">3.3. The training pipeline<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"53d2\"><strong>How do we have access to the training features?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"18d6\">As highlighted in section 3.2, all the&nbsp;<strong>training data<\/strong>&nbsp;will be&nbsp;<strong>accessed<\/strong>&nbsp;from the&nbsp;<strong>feature store<\/strong>. In our case, the feature store is the&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Qdrant<\/strong><\/a><strong>&nbsp;vector DB<\/strong>&nbsp;that contains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the cleaned digital data from which we will create prompts &amp; answers;<\/li>\n\n\n\n<li>we will use the chunked &amp; embedded data for RAG to augment the cleaned data.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1969\"><em>We will implement a different vector DB retrieval client for each of our main types of data (posts, articles, code).<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d40e\">We must do this separation because we must preprocess each type differently before querying the vector DB, as each type has unique properties.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"04b6\">Also, we will add custom behavior for each client based on what we want to query from the vector DB. But more on this in its dedicated lesson.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e30f\"><strong>What will the training pipeline do?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"df07\">The training pipeline contains a&nbsp;<strong>data-to-prompt layer<\/strong>&nbsp;that will preprocess the data retrieved from the vector DB into prompts.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"dfdc\">It will also contain an&nbsp;<strong>LLM fine-tuning module<\/strong>&nbsp;that inputs a HuggingFace dataset and uses QLoRA to fine-tune a given LLM (e.g., Mistral). By using HuggingFace, we can easily switch between different LLMs so we won\u2019t focus too much on any specific LLM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34a4\">All the experiments will be logged into&nbsp;<a href=\"\/signup\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet\u2019s<\/a>&nbsp;<strong>experiment tracker<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f8fe\">We will use a bigger LLM (e.g., GPT4) to&nbsp;<strong>evaluate<\/strong>&nbsp;the results of our fine-tuned LLM. These results will be logged into&nbsp;<a href=\"\/signup\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet\u2019s<\/a>&nbsp;experiment tracker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c2ec\"><strong>Where will the production candidate LLM be stored?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c0cb\">We will compare multiple experiments, pick the best one, and issue an LLM production candidate for the model registry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d7a3\">After, we will inspect the LLM production candidate manually using&nbsp;<a href=\"\/signup\/?framework=llm&amp;utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet\u2019s<\/a>&nbsp;prompt monitoring dashboard. If this final manual check passes, we will flag the LLM from the model registry as accepted.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0c20\">A CI\/CD pipeline will trigger and deploy the new LLM version to the inference pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"999e\"><strong>Where will the training pipeline be deployed?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The training pipeline will be deployed to AWS SageMaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">AWS SageMaker is a solution for training and deploying ML models. It makes scaling your operation easy while you can focus on building.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8bde\">Also, we will use the freemium version of&nbsp;<a href=\"\/signup\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet&nbsp;<\/a>for the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>experiment tracker;<\/li>\n\n\n\n<li>model registry;<\/li>\n\n\n\n<li>prompt monitoring.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"52eb\">3.4. The inference pipeline<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9308\">The inference pipeline is the&nbsp;<strong>final component<\/strong>&nbsp;of the&nbsp;<strong>LLM system<\/strong>. It is the one the&nbsp;<strong>clients<\/strong>&nbsp;will&nbsp;<strong>interact with<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5de7\">It will be&nbsp;<strong>wrapped<\/strong>&nbsp;under a&nbsp;<strong>REST API<\/strong>. The clients can call it through HTTP requests, similar to your experience with ChatGPT or similar tools.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c0c1\"><strong>How do we access the features?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fecb\">To access the feature store, we will use the same&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;vector DB retrieval clients as in the training pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8ca1\">In this case, we will need the feature store to access the chunked data to do RAG.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"103d\"><strong>How do we access the fine-tuned LLM?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f941\">The fine-tuned LLM will always be downloaded from the model registry based on its tag (e.g., accepted) and version (e.g., v1.0.2, latest, etc.).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"22f3\"><strong>How will the fine-tuned LLM be loaded?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ce44\">Here we are in the inference world.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a8f3\">Thus, we want to optimize the LLM\u2019s speed and memory consumption as much as possible. That is why, after downloading the LLM from the model registry, we will quantize it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ee18\"><strong>What are the components of the inference pipeline?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6006\">The first one is the&nbsp;<strong>retrieval client<\/strong>&nbsp;used to access the vector DB to do RAG. This is the same module as the one used in the training pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9263\">After we have a&nbsp;<strong>query to prompt the layer,&nbsp;<\/strong>that will map the prompt and retrieved documents from&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;into a prompt.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f0d5\">After the LLM generates its answer, we will log it to&nbsp;<a href=\"\/signup\/?framework=llm&amp;utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\" target=\"_blank\" rel=\"noreferrer noopener\">Comet\u2019s<\/a>&nbsp;<strong>prompt monitoring dashboard<\/strong>&nbsp;and return it to the clients.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5de4\">For example, the client will request the inference pipeline to:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9136\">\u201cWrite a 1000-word LinkedIn post about LLMs,\u201d and the inference pipeline will go through all the steps above to return the generated post.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"82a7\"><strong>Where will the inference pipeline be deployed?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The inference pipeline will be deployed to AWS SageMaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">AWS SageMaker also offers autoscaling solutions and a nice dashboard to monitor all the production environment resources.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"24b2\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c109\">This is the 1st article of the<strong>&nbsp;<em>LLM Twin: Building Your Production-Ready AI Replica<\/em>&nbsp;<\/strong>free<strong>&nbsp;<\/strong>course.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3fc0\">In this lesson, we presented what&nbsp;<strong>you will build<\/strong>&nbsp;during the course.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3638\">After we briefly discussed how to design ML systems using&nbsp;<strong>the 3-pipeline design<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"541d\">Ultimately, we went through the&nbsp;<strong>system design<\/strong>&nbsp;of the course and presented the&nbsp;<strong>architecture<\/strong>&nbsp;of&nbsp;<strong>each microservice<\/strong>&nbsp;and how they&nbsp;<strong>interact with each other<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>The data collection pipeline<\/li>\n\n\n\n<li>The feature pipeline<\/li>\n\n\n\n<li>The training pipeline<\/li>\n\n\n\n<li>The inference pipeline<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a6a4\">In&nbsp;<strong>Lesson 2<\/strong>, we will dive deeper into the&nbsp;<strong>data collection pipeline<\/strong>, learn how to implement crawlers for various social media platforms, clean the gathered data, store it in a Mongo DB NoSQL database.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>\ud83d\udd17&nbsp;<\/em><strong><em>Check out<\/em><\/strong><em>&nbsp;<\/em><a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\"><em>the code on GitHub<\/em><\/a><em>&nbsp;[1] and support us with a \u2b50\ufe0f<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"9547\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b22e\">[1]&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">Your LLM Twin Course \u2014 GitHub Repository<\/a>&nbsp;(2024), Decoding ML GitHub Organization<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c6c2\">[2]&nbsp;<a href=\"https:\/\/ai.meta.com\/genai\/\" target=\"_blank\" rel=\"noreferrer noopener\">Introducing new AI experiences from Meta<\/a>&nbsp;(2023), Meta<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6fcb\">[3] Jim Dowling,&nbsp;<a href=\"https:\/\/www.hopsworks.ai\/post\/mlops-to-ml-systems-with-fti-pipelines\" target=\"_blank\" rel=\"noreferrer noopener\">From MLOps to ML Systems with Feature\/Training\/Inference Pipelines<\/a>&nbsp;(2023), Hopsworks<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"dbf2\">[4]&nbsp;<a href=\"https:\/\/www.databricks.com\/glossary\/extract-transform-load\" target=\"_blank\" rel=\"noreferrer noopener\">Extract Transform Load (ETL)<\/a>, Databricks Glossary<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d141\">[5] Daniel Svonava and Paolo Perrone,&nbsp;<a href=\"https:\/\/superlinked.com\/vectorhub\/12-data-modality?utm_source=community&amp;utm_medium=blog&amp;utm_campaign=oscourse\" target=\"_blank\" rel=\"noreferrer noopener\">Understanding the different Data Modality \/ Types<\/a>&nbsp;(2023), Superlinked<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to Lesson 1 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":9611,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,6,7],"tags":[],"coauthors":[222,223],"class_list":["post-9538","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-llmops","category-machine-learning","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Free Course: Build a Production-Ready LLM System That Writes Like You<\/title>\n<meta name=\"description\" content=\"Follow this step-by-step tutorial to build an AI character that writes like you and learn key LLM development practices along the way.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin\" \/>\n<meta property=\"og:description\" content=\"Follow this step-by-step tutorial to build an AI character that writes like you and learn key LLM development practices along the way.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-25T23:19:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:35:54+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1396\" \/>\n\t<meta property=\"og:image:height\" content=\"796\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"18 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Free Course: Build a Production-Ready LLM System That Writes Like You","description":"Follow this step-by-step tutorial to build an AI character that writes like you and learn key LLM development practices along the way.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/","og_locale":"en_US","og_type":"article","og_title":"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin","og_description":"Follow this step-by-step tutorial to build an AI character that writes like you and learn key LLM development practices along the way.","og_url":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-03-25T23:19:53+00:00","article_modified_time":"2025-04-29T12:35:54+00:00","og_image":[{"width":1396,"height":796,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","type":"image\/png"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"18 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin","datePublished":"2024-03-25T23:19:53+00:00","dateModified":"2025-04-29T12:35:54+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/"},"wordCount":4001,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","articleSection":["LLMOps","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/","url":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/","name":"Free Course: Build a Production-Ready LLM System That Writes Like You","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","datePublished":"2024-03-25T23:19:53+00:00","dateModified":"2025-04-29T12:35:54+00:00","description":"Follow this step-by-step tutorial to build an AI character that writes like you and learn key LLM development practices along the way.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","width":1396,"height":796,"caption":"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin by Paul Iusztin of DecodingML"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/03\/Screenshot-2024-03-27-at-10.01.21\u202fAM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9538","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9538"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9538\/revisions"}],"predecessor-version":[{"id":15794,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9538\/revisions\/15794"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/9611"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9538"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9538"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9538"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9538"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}