{"id":9951,"date":"2024-05-20T06:55:50","date_gmt":"2024-05-20T14:55:50","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9951"},"modified":"2025-04-29T12:46:40","modified_gmt":"2025-04-29T12:46:40","slug":"llm-fine-tuning-dataset","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/","title":{"rendered":"Turning Raw Data Into Fine-Tuning Datasets"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Welcome to&nbsp;<strong>Lesson 6 of 12<\/strong>&nbsp;in our free course series,&nbsp;<strong>LLM Twin: Building Your Production-Ready AI Replica<\/strong>. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">Lesson 1<\/a>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lessons<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6fe9\">Large language models (LLMs) have changed how we interact with machines. These powerful models have a remarkable understanding of human language, enabling them to translate text, write different kinds of creative content formats, and answer your questions in an informative way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"19a4\">But how do we take these LLMs and make them even better?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ff29\">The answer lies in&nbsp;<strong>fine-tuning.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"34cd\"><strong>Fine-tuning<\/strong>&nbsp;is the process of taking a pre-trained LLM and adapting it to a specific task or domain.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e571\">One important aspect of fine-tuning is&nbsp;<strong>dataset preparation.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5617\">Remember the quote from 2018: \u201cgarbage in, garbage out.\u201d<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"374a\"><strong>The quality of your dataset<\/strong>&nbsp;directly impacts how well your fine-tuned model will perform.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"728c\">Why does data matter?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ddaf\">Let\u2019s explore why a well-prepared, high-quality dataset is essential for successful LLM fine-tuning:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Specificity is Key:<\/strong>\u00a0LLMs like Mistral are trained on massive amounts of general text data. This gives them a broad understanding of language, but it doesn\u2019t always align with the specific task you want the model to perform. A carefully curated dataset helps the model understand the nuances of your domain, vocabulary, and the types of outputs you expect.<\/li>\n\n\n\n<li><strong>Contextual Learning:<\/strong>\u00a0High-quality datasets offer rich context that the LLM can use to learn patterns and relationships between words within your domain. This context enables the model to generate more relevant and accurate responses for your specific application.<\/li>\n\n\n\n<li><strong>Avoiding Bias:<\/strong>\u00a0Unbalanced or poorly curated datasets can introduce biases into the LLM, impacting its performance and leading to unfair or undesirable results. A well-prepared dataset helps to mitigate these risks.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Today, we will learn how to generate a custom dataset for our&nbsp;<strong>specific task,<\/strong>&nbsp;which is<strong>&nbsp;content generation.<\/strong><\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"17c9\">Understanding the Data Types<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9363\">Our data consists of two primary types: posts and articles. Each type serves a different purpose and is structured to accommodate specific needs:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Posts<\/strong>: Typically shorter and more dynamic, posts are often user-generated content from social platforms or forums. They are characterized by varied formats and informal language, capturing real-time user interactions and opinions.<\/li>\n\n\n\n<li><strong>Articles<\/strong>: These are more structured and content-rich, usually sourced from news outlets or blogs. Articles provide in-depth analysis or reporting and are formatted to include headings, subheadings, and multiple paragraphs, offering comprehensive information on specific topics.<\/li>\n\n\n\n<li><strong>Code<\/strong>: Sourced from repositories like GitHub, this data type encompasses scripts and programming snippets crucial for LLMs to learn and understand technical language<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Both data types require careful handling during insertion to preserve their integrity and ensure they are stored correctly for further processing and analysis in MongoDB. This includes managing formatting issues and ensuring data consistency across the database.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/format:webp\/1*KJRLcR2rPXS0mGi3tFBmMA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Fine-tuning instruct dataset generation process<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"25b5\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#1864\">Generating fine-tuning instruct datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#eec8\">Storing the dataset in a data registry<\/a><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6c7d\">\ud83d\udd17&nbsp;<strong>Check out<\/strong>&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">the code on GitHub<\/a>&nbsp;[1] and support us with a \u2b50\ufe0f<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1864\">1. Generating fine-tuning instruct datasets<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>The Challenge:<\/strong>\u00a0Manually creating a dataset for fine-tuning a language model like Mistral-7B can be time-consuming and prone to errors.<\/li>\n\n\n\n<li><strong>The Solution:<\/strong>\u00a0Instruction Datasets Instruction datasets offer an efficient way to guide a language model toward a specific task like news classification.<\/li>\n\n\n\n<li><strong>Methods:<\/strong>\u00a0While instruction datasets can be built manually or derived from existing sources, we\u2019ll leverage a powerful LLM like OpenAI\u2019s GPT 3.5-turbo due to our time and budget constraints.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Using the cleaned data from Qdrant<\/strong><br>Let\u2019s analyze the sample data point from Qdrant to demonstrate how we can derive instructions for generating our instruction dataset (which we cleaned within our feature pipeline in Lesson 4):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> {\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\\n.\\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture &amp; LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs &amp; vector DBs.\\nWe will primarily focus on the engineering &amp; MLOps aspects.\\nThus, by the end of this series, you will know how to build &amp; deploy a real ML system, not some isolated code in Notebooks.\\nThere are 3 components you will learn to build during the course:\\n- a real-time streaming pipeline\\n- a fine-tuning pipeline\\n- an inference pipeline\\n.\\nWe have already released the code and video lessons of the Hands-on LLM course.\\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\\nWe have already released the first lesson of the series  \\nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps:  \\n&#91;URL]\\n  In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\\n- LLMs\\n- vector DBs\\n- a streaming engine\\n- LLMOps\\n.\\n  The rest of the articles will be released by the end of January 2024.\\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons:  \\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\",\n  \"platform\": \"linkedin\",\n  \"type\": \"posts\"\n},\n{\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"RAG systems are far from perfect   This free course teaches you how to improve your RAG system.\\nI recently finished the Advanced Retrieval for AI with Chroma free course from\\nDeepLearning.AI\\nIf you are into RAG, I find it among the most valuable learning sources.\\nThe course already assumes you know what RAG is.\\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\\n- query expansion\\n- cross-encoder re-ranking\\n- embedding adaptors\\nI am not affiliated with\\nDeepLearning.AI\\n(I wouldn't mind though).\\nThis is a great course you should take if you are into RAG systems.\\nThe good news is that it is free and takes only 1 hour.\\nCheck it out  \\n  Advanced Retrieval for AI with Chroma:\\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\\n.\\n  Follow me for daily lessons about ML engineering and MLOps.&#91;URL]\",\n  \"image\": null,\n  \"platform\": \"linkedin\",\n  \"type\": \"posts\"\n} <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Process:<\/strong><br><strong>Generating instructions:<\/strong>&nbsp;We can leverage the&nbsp;<em>\u201ccleaned_content\u201d<\/em>&nbsp;to automatically generate instructions (using GPT-4o or other LLM) for each piece of content, such as:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instruction 1: \u201cWrite a LinkedIn post promoting a new educational course on building LLM systems focusing on LLMOps. Use relevant hashtags and a tone that is both informative and engaging.\u201d<\/li>\n\n\n\n<li>Instruction 2: \u201cWrite a LinkedIn post explaining the benefits of using LLMs and vector databases in real-time financial advising applications. Highlight the importance of LLMOps for successful deployment.\u201d<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Generating the dataset with GPT-4o<\/strong><br>The process can be split into 3 main stages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Query the Qdrant vector DB for cleaned content.<\/li>\n\n\n\n<li>Split it into smaller, more granular paragraphs.<\/li>\n\n\n\n<li>Feed each paragraph to GPT-4o to generate an instruction.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Result<\/strong>: This process would yield a dataset of instruction-output pairs designed to fine-tune a Llama 3.1 8B (or other LLM) for tweaking the writing style of the LLM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Let\u2019s dig into the code!<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The example will simulate creating a training dataset for an LLM using the strategy we\u2019ve explained above.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine that we want to go from this \u2193<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> {\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\\n.\\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture &amp; LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs &amp; vector DBs.\\nWe will primarily focus on the engineering &amp; MLOps aspects.\\nThus, by the end of this series, you will know how to build &amp; deploy a real ML system, not some isolated code in Notebooks.\\nThere are 3 components you will learn to build during the course:\\n- a real-time streaming pipeline\\n- a fine-tuning pipeline\\n- an inference pipeline\\n.\\nWe have already released the code and video lessons of the Hands-on LLM course.\\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\\nWe have already released the first lesson of the series  \\nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps:  \\n&#91;URL]\\n  In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\\n- LLMs\\n- vector DBs\\n- a streaming engine\\n- LLMOps\\n.\\n  The rest of the articles will be released by the end of January 2024.\\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons:  \\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\",\n},\n{\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"RAG systems are far from perfect   This free course teaches you how to improve your RAG system.\\nI recently finished the Advanced Retrieval for AI with Chroma free course from\\nDeepLearning.AI\\nIf you are into RAG, I find it among the most valuable learning sources.\\nThe course already assumes you know what RAG is.\\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\\n- query expansion\\n- cross-encoder re-ranking\\n- embedding adaptors\\nI am not affiliated with\\nDeepLearning.AI\\n(I wouldn't mind though).\\nThis is a great course you should take if you are into RAG systems.\\nThe good news is that it is free and takes only 1 hour.\\nCheck it out  \\n  Advanced Retrieval for AI with Chroma:\\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\\n.\\n  Follow me for daily lessons about ML engineering and MLOps.&#91;URL]\",\n} <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">to this \u2193<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> &#91;\n  {\n    \"instruction\": \"Share the announcement of the upcoming Medium series on building hands-on LLM systems using good LLMOps practices, focusing on the 3-pipeline architecture and real-time financial advisor development. Follow the Decoding ML publication on Medium for notifications on future lessons.\",\n    \"content\": \"Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\\n.\\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture &amp; LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs &amp; vector DBs.\\nWe will primarily focus on the engineering &amp; MLOps aspects.\\nThus, by the end of this series, you will know how to build &amp; deploy a real ML system, not some isolated code in Notebooks.\\nThere are 3 components you will learn to build during the course:\\n- a real-time streaming pipeline\\n- a fine-tuning pipeline\\n- an inference pipeline\\n.\\nWe have already released the code and video lessons of the Hands-on LLM course.\\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\\nWe have already released the first lesson of the series  \\nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps:  \\n&#91;URL]\\n  In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\\n- LLMs\\n- vector DBs\\n- a streaming engine\\n- LLMOps\\n.\\n  The rest of the articles will be released by the end of January 2024.\\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons:  \\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\"\n  },\n  {\n    \"instruction\": \"Promote the free course 'Advanced Retrieval for AI with Chroma' from DeepLearning.AI that aims to improve RAG systems and takes only 1 hour to complete. Share the course link and encourage followers to check it out for the latest techniques in query expansion, cross-encoder re-ranking, and embedding adaptors.\",\n    \"content\": \"RAG systems are far from perfect   This free course teaches you how to improve your RAG system.\\nI recently finished the Advanced Retrieval for AI with Chroma free course from\\nDeepLearning.AI\\nIf you are into RAG, I find it among the most valuable learning sources.\\nThe course already assumes you know what RAG is.\\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\\n- query expansion\\n- cross-encoder re-ranking\\n- embedding adaptors\\nI am not affiliated with\\nDeepLearning.AI\\n(I wouldn't mind though).\\nThis is a great course you should take if you are into RAG systems.\\nThe good news is that it is free and takes only 1 hour.\\nCheck it out  \\n  Advanced Retrieval for AI with Chroma:\\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\\n.\\n  Follow me for daily lessons about ML engineering and MLOps.&#91;URL]\"\n  },. <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">First, let\u2019s inspect a couple of cleaned documents from which we want to generate instruction-answer data points for SFT fine-tuning:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> {\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"Do you want to learn to build hands-on LLM systems using good LLMOps practices? A new Medium series is coming up for the Hands-on LLMs course\\n.\\nBy finishing the Hands-On LLMs free course, you will learn how to use the 3-pipeline architecture &amp; LLMOps good practices to design, build, and deploy a real-time financial advisor powered by LLMs &amp; vector DBs.\\nWe will primarily focus on the engineering &amp; MLOps aspects.\\nThus, by the end of this series, you will know how to build &amp; deploy a real ML system, not some isolated code in Notebooks.\\nThere are 3 components you will learn to build during the course:\\n- a real-time streaming pipeline\\n- a fine-tuning pipeline\\n- an inference pipeline\\n.\\nWe have already released the code and video lessons of the Hands-on LLM course.\\nBut we are excited to announce an 8-lesson Medium series that will dive deep into the code and explain everything step-by-step.\\nWe have already released the first lesson of the series  \\nThe LLMs kit: Build a production-ready real-time financial advisor system using streaming pipelines, RAG, and LLMOps:  \\n&#91;URL]\\n  In Lesson 1, you will learn how to design a financial assistant using the 3-pipeline architecture (also known as the FTI architecture), powered by:\\n- LLMs\\n- vector DBs\\n- a streaming engine\\n- LLMOps\\n.\\n  The rest of the articles will be released by the end of January 2024.\\nFollow us on Medium's Decoding ML publication to get notified when we publish the other lessons:  \\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\",\n  \"platform\": \"linkedin\",\n  \"type\": \"posts\"\n},\n{\n  \"author_id\": \"2\",\n  \"cleaned_content\": \"RAG systems are far from perfect   This free course teaches you how to improve your RAG system.\\nI recently finished the Advanced Retrieval for AI with Chroma free course from\\nDeepLearning.AI\\nIf you are into RAG, I find it among the most valuable learning sources.\\nThe course already assumes you know what RAG is.\\nIts primary focus is to show you all the current issues of RAG and why it is far from perfect.\\nAfterward, it shows you the latest SoTA techniques to improve your RAG system, such as:\\n- query expansion\\n- cross-encoder re-ranking\\n- embedding adaptors\\nI am not affiliated with\\nDeepLearning.AI\\n(I wouldn't mind though).\\nThis is a great course you should take if you are into RAG systems.\\nThe good news is that it is free and takes only 1 hour.\\nCheck it out  \\n  Advanced Retrieval for AI with Chroma:\\n&#91;URL]\\nhashtag\\n#\\nmachinelearning\\nhashtag\\n#\\nmlops\\nhashtag\\n#\\ndatascience\\n.\\n  Follow me for daily lessons about ML engineering and MLOps.&#91;URL]\",\n  \"image\": null,\n  \"platform\": \"linkedin\",\n  \"type\": \"posts\"\n} <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We\u2019ll use the&nbsp;<strong>DataFormatter<\/strong>&nbsp;class to format these data points into a structured prompt for the LLM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s how you would use the class to prepare the content:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> class DataFormatter:\n    @classmethod\n    def get_system_prompt(cls, data_type: str) -&gt; str:\n        return (\n            f\"I will give you batches of contents of {data_type}. Please generate me exactly 1 instruction for each of them. The {data_type} text \"\n            f\"for which you have to generate the instructions is under Content number x lines. Please structure the answer in json format,\"\n            f\"ready to be loaded by json.loads(), a list of objects only with fields called instruction and content. For the content field, copy the number of the content only!.\"\n            f\"Please do not add any extra characters and make sure it is a list with objects in valid json format!\\n\"\n        )\n\n    @classmethod\n    def format_data(cls, data_points: list, is_example: bool, start_index: int) -&gt; str:\n        text = \"\"\n        for index, data_point in enumerate(data_points):\n            if not is_example:\n                text += f\"Content number {start_index + index }\\n\"\n            text += str(data_point) + \"\\n\"\n\n        return text\n\n    @classmethod\n    def format_batch(cls, context_msg: str, data_points: list, start_index: int) -&gt; str:\n        delimiter_msg = context_msg\n        delimiter_msg += cls.format_data(data_points, False, start_index)\n\n        return delimiter_msg\n\n    @classmethod\n    def format_prompt(\n        cls, inference_posts: list, data_type: str, start_index: int\n    ) -&gt; str:\n        initial_prompt = cls.get_system_prompt(data_type)\n        initial_prompt += f\"You must generate exactly a list of {len(inference_posts)} json objects, using the contents provided under CONTENTS FOR GENERATION\\n\"\n        initial_prompt += cls.format_batch(\n            \"\\nCONTENTS FOR GENERATION: \\n\", inference_posts, start_index\n        )\n\n        return initial_prompt <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Output of the&nbsp;<em><strong>format_prompt<\/strong><\/em>&nbsp;function:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> prompt = \"\"\"\nI will give you batches of contents of articles. \n\nPlease generate me exactly 1 instruction for each of them. \nThe articles text for which you have to generate the instructions is under Content number x lines. \nPlease structure the answer in json format,ready to be loaded by json.loads(), a list of objects only with fields called instruction and content. \n\nFor the content field, copy the number of the content only!\nPlease do not add any extra characters and make sure it is a list with objects in valid json format!\\n\n\nYou must generate exactly a list of 3 json objects, using the contents provided under CONTENTS FOR GENERATION\\n\n\nCONTENTS FOR GENERATION: \n\nContent number 0\n...\n\nContent number 1\n...\n\nContent number 2\n...\n\nContent number MAX_BATCH\n... <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We batch the data into multiple prompts to avoid hitting the maximum number of tokens. Thus, we will send multiple prompts to the LLM.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To&nbsp;<strong>automate<\/strong>&nbsp;the generation of fine tuning data, we designed the DatasetGenerator class. This class is designed to streamline the process from fetching data to logging the training data into Comet:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> class DatasetGenerator:\n    def __init__(\n        self,\n        file_handler: FileHandler,\n        api_communicator: GptCommunicator,\n        data_formatter: DataFormatter,\n    ) -&gt; None:\n        self.file_handler = file_handler\n        self.api_communicator = api_communicator\n        self.data_formatter = data_formatter <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The generate_training_data()<\/strong>&nbsp;method from the&nbsp;<strong>DatasetGenerator class<\/strong>&nbsp;handles the entire lifecycle of data generation and calls the LLM for each batch:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def generate_training_data(\n        self, collection_name: str, data_type: str, batch_size: int = 3\n    ) -&gt; None:\n        assert (\n            settings.COMET_API_KEY\n        ), \"COMET_API_KEY must be set in settings, fill it in your .env file.\"\n        assert (\n            settings.COMET_WORKSPACE\n        ), \"COMET_PROJECT must be set in settings, fill it in your .env file.\"\n        assert (\n            settings.COMET_WORKSPACE\n        ), \"COMET_PROJECT must be set in settings, fill it in your .env file.\"\n        assert (\n            settings.OPENAI_API_KEY\n        ), \"OPENAI_API_KEY must be set in settings, fill it in your .env file.\"\n\n        cleaned_documents = self.fetch_all_cleaned_content(collection_name)\n        cleaned_documents = chunk_documents(cleaned_documents)\n        num_cleaned_documents = len(cleaned_documents)\n\n        generated_instruct_dataset = &#91;]\n        for i in range(0, num_cleaned_documents, batch_size):\n            batch = cleaned_documents&#91;i : i + batch_size]\n            prompt = data_formatter.format_prompt(batch, data_type, i)\n            batch_instructions = self.api_communicator.send_prompt(prompt)\n\n            if len(batch_instructions) != len(batch):\n                logger.error(\n                    f\"Received {len(batch_instructions)} instructions for {len(batch)} documents. \\\n                    Skipping this batch...\"\n                )\n                continue\n\n            for instruction, content in zip(batch_instructions, batch):\n                instruction&#91;\"content\"] = content\n                generated_instruct_dataset.append(instruction)\n\n        train_test_split = self._split_dataset(generated_instruct_dataset)\n\n        self.push_to_comet(train_test_split, data_type, collection_name) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We could further optimize this by parallelizing the calls on different threads using the&nbsp;<strong>ThreadPoolExecutor<\/strong>&nbsp;class from Python. For our small example, doing everything sequentially is fine.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The<strong>&nbsp;fetch_all_cleaned_content()<\/strong>&nbsp;method retrieves the cleaned documents from a Qdrant collection:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>     def fetch_all_cleaned_content(self, collection_name: str) -&gt; list:\n        all_cleaned_contents = &#91;]\n\n        scroll_response = client.scroll(collection_name=collection_name, limit=10000)\n        points = scroll_response&#91;0]\n\n        for point in points:\n            cleaned_content = point.payload&#91;\"cleaned_content\"]\n            if cleaned_content:\n                all_cleaned_contents.append(cleaned_content)\n\n        return all_cleaned_contents <\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">2. Storing the dataset in a data registry<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"eec8\">In this section, we focus on a critical aspect of MLOps: data versioning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9214\">We\u2019ll specifically look at how to implement this using&nbsp;<a href=\"\/signup\/\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>, a platform that facilitates experiment management and reproducibility in machine learning projects.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"15f5\"><a href=\"\/signup\/\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>&nbsp;is a cloud-based platform that provides tools for tracking, comparing, explaining, and optimizing experiments and models in machine learning. CometML helps data scientists and teams to better manage and collaborate on machine learning experiments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"b143\">Why Use Comet?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Artifacts<\/strong>:\u00a0Leverages\u00a0<a href=\"https:\/\/www.comet.com\/docs\/v2\/guides\/artifacts\/using-artifacts\/\">artifact management<\/a>\u00a0to capture, version, and manage data snapshots and models, which helps maintain data integrity and trace experiment lineage effectively.<\/li>\n\n\n\n<li><strong>Experiment Tracking<\/strong>:\u00a0<a href=\"\/signup\/\" target=\"_blank\" rel=\"noreferrer noopener\">Comet<\/a>\u00a0automatically tracks your code, experiments, and results, allowing you to compare between different runs and configurations visually.<\/li>\n\n\n\n<li><strong>Model Optimization<\/strong>: It offers tools to compare different models side by side, analyze hyperparameters, and track model performance across various metrics.<\/li>\n\n\n\n<li><strong>Collaboration and Sharing<\/strong>: Share findings and models with colleagues or the ML community, enhancing team collaboration and knowledge transfer.<\/li>\n\n\n\n<li><strong>Reproducibility<\/strong>: By logging every detail of the experiment setup, Comet ensures experiments are reproducible, making it easier to debug and iterate.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7f18\">Maybe you\u2019re asking why not to choose&nbsp;<strong>MLFlow<\/strong>&nbsp;for example [<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">2<\/a>]:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"\/signup\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Comet<\/strong><\/a>\u00a0excels in user interface design, providing a clean, intuitive experience for tracking experiments and models.<\/li>\n\n\n\n<li>It offers robust collaboration tools, making it easier for teams to work together on machine learning projects.<\/li>\n\n\n\n<li><a href=\"\/signup\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Comet<\/strong><\/a>\u00a0provides comprehensive security features that help protect data and models, an important consideration for enterprises.<\/li>\n\n\n\n<li>It has superior scalability, supporting larger datasets and more complex model training scenarios.<\/li>\n\n\n\n<li>The platform allows for more detailed tracking and analysis of experiments compared to MLflow.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2a93\">Comet Variables<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2faf\">When integrating Comet into your projects, you\u2019ll need to set up several environment variables to manage the authentication and configuration:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>COMET_API_KEY<\/code>: Your unique API key that authenticates your interactions with the Comet API.<\/li>\n\n\n\n<li><code>COMET_PROJECT<\/code>: The project name under which your experiments will be logged.<\/li>\n\n\n\n<li><code>COMET_WORKSPACE<\/code>: The workspace name that organizes various projects and experiments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"352d\">The Importance of Data Versioning in MLOps<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"bee0\"><strong>Data versioning<\/strong>&nbsp;is the practice of keeping a record of multiple versions of datasets used in training machine learning models. This practice is essential for several reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Reproducibility<\/strong>: It ensures that experiments can be reproduced using the exact same data, which is crucial for validating and comparing machine learning models.<\/li>\n\n\n\n<li><strong>Model lineage and auditing<\/strong>: If a model\u2019s performance changes unexpectedly, data versioning allows teams to revert to previous data states to identify issues.<\/li>\n\n\n\n<li><strong>Collaboration and Experimentation<\/strong>: Teams can experiment with different data versions to see how changes affect model performance without losing the original data setups.<\/li>\n\n\n\n<li><strong>Regulatory Compliance<\/strong>: In many industries, keeping track of data modifications and training environments is required for compliance with regulations.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"351d\">Comet\u2019s Artifacts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Version Control<\/strong>: Artifacts in Comet are versioned, allowing you to track changes and iterate on datasets and models efficiently.<\/li>\n\n\n\n<li><strong>Immutability<\/strong>: Once created, artifacts are immutable, ensuring that data integrity is maintained throughout the lifecycle of your projects.<\/li>\n\n\n\n<li><strong>Metadata and Tagging<\/strong>: You can enhance artifacts with metadata and tags, making them easier to search and organize within Comet.<\/li>\n\n\n\n<li><strong>Alias Management<\/strong>: Artifacts can be assigned aliases to simplify references to versions, streamlining workflow and reference.<\/li>\n\n\n\n<li><strong>External Storage<\/strong>: Supports integration with external storage solutions like Amazon S3, enabling scalable and secure data management.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2c20\">The provided&nbsp;<code>push_to_comet<\/code>&nbsp;function is a key part of this process.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def push_to_comet(\n        self,\n        train_test_split: tuple&#91;list&#91;dict], list&#91;dict]],\n        data_type: str,\n        collection_name: str,\n        output_dir: Path = Path(\"generated_dataset\"),\n    ) -&gt; None:\n        output_dir.mkdir(exist_ok=True)\n\n        try:\n            logger.info(f\"Starting to push data to Comet: {collection_name}\")\n\n            experiment = start()\n\n            training_data, testing_data = train_test_split\n\n            file_name_training_data = output_dir \/ f\"{collection_name}_training.json\"\n            file_name_testing_data = output_dir \/ f\"{collection_name}_testing.json\"\n\n            logging.info(f\"Writing training data to file: {file_name_training_data}\")\n            with file_name_training_data.open(\"w\") as f:\n                json.dump(training_data, f)\n\n            logging.info(f\"Writing testing data to file: {file_name_testing_data}\")\n            with file_name_testing_data.open(\"w\") as f:\n                json.dump(testing_data, f)\n\n            logger.info(\"Data written to file successfully\")\n\n            artifact = Artifact(f\"{data_type}-instruct-dataset\")\n            artifact.add(file_name_training_data)\n            artifact.add(file_name_testing_data)\n            logger.info(f\"Artifact created.\")\n\n            experiment.log_artifact(artifact)\n            experiment.end()\n            logger.info(\"Artifact pushed to Comet successfully.\")\n\n        except Exception:\n            logger.exception(\n                f\"Failed to create Comet artifact and push it to Comet.\",\n            ) <\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Experiment Initialization<\/strong>: An experiment is created using the project settings. This ties all actions, like logging artifacts, to a specific experimental run.<\/li>\n\n\n\n<li><strong>Data Saving<\/strong>: Data is saved locally as a JSON file. This file format is versatile and widely used, making it a good choice for data interchange.<\/li>\n\n\n\n<li><strong>Artifact Creation and Logging<\/strong>: An artifact is a versioned object in Comet that can be associated with an experiment. By logging artifacts, you keep a record of all data versions used throughout the project lifecycle.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6d8d\">After running the script that invokes the&nbsp;<code>push_to_comet<\/code>&nbsp;function, Comet will update with new data artifacts, each representing a different dataset version. This is a crucial step in ensuring that all your data versions are logged and traceable within your MLOps environment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"f8ec\">What to Expect in Comet<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0d78\">Here is what you should see in Comet after successfully executing the script:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Artifacts Section<\/strong>: Navigate to the \u201cArtifacts\u201d tab in your Comet dashboard.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*33dazDcgppFs36NEbeJz9w.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Screenshot from Comet\u2019s dashboard<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>List of Artifacts<\/strong>: You will see entries for each type of data you\u2019ve processed and saved. For example, if you have cleaned and versioned articles and posts, they will appear as separate artifacts.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*3MvzLL9YBYOgzVSgUNUc2A.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Screenshot from Comet\u2019s Artifact dashboard<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Artifact Versions<\/strong>: Each artifact can have multiple versions. Each time you run the script with a new or updated dataset, a new version of the respective artifact is created.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*V46XcY4V29xMB0Ml42tETg.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Screenshot from a Comet Artifact<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Each version is timestamped and stored with a unique ID, allowing you to track changes over time or revert to previous versions if necessary.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"59da\">We will have a training and testing JSON file:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*Fy1ExcuZBP28mLVeec9tCg.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Inspecting a specific version of a Comet artifact<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"82c6\">Here\u2019s an example of what the final version of&nbsp;<code>cleaned_articles_training.json<\/code>&nbsp;might look like, ready for the fine-tuning task:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*qkhDX-1DV-4svhIMovXEgw.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Inspecting a specific file of a Comet ML artifact<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Also, we made our&nbsp;<strong>artifacts publicly available<\/strong>, so you can take a look, play around with them, and even use them to fine-tune the LLM in case you don\u2019t want to compute them yourself:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/articles-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">articles-instruct-dataset<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/posts-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">posts-instruct-dataset<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/repositories-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">repositories-instruct-dataset<\/a><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lesson taught you how to generate custom instruct datasets from your raw data using other LLMs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, we\u2019ve shown you how to load the dataset to a data registry, such as Comet ML\u2019s artifacts, to version, track, and share it within your system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Lesson 7, you will learn to use the generated dataset to finetune a Llama 3.1 8B LLM as your LLM Twin using Unsloth, TRL and AWS SageMaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud83d\udd17 Consider checking out the GitHub repository [1] and support us with a \u2b50\ufe0f<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"da55\">References<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"5f10\">Literature<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"158f\">[1]&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">Your LLM Twin Course \u2014 GitHub Repository<\/a>&nbsp;(2024), Decoding ML GitHub Organization<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"0be5\">[2]&nbsp;<a href=\"https:\/\/neptune.ai\/blog\/best-mlflow-alternatives\" target=\"_blank\" rel=\"noreferrer noopener\">MLFlow Alternatives<\/a>, Neptune.ai<\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"bfa1\">Images<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"425b\">If not otherwise stated, all images are created by the author.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to&nbsp;Lesson 6 of 12&nbsp;in our free course series,&nbsp;LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":9963,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,6,7],"tags":[14,64,85,86,89,90,52],"coauthors":[222,223],"class_list":["post-9951","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-llmops","category-machine-learning","category-tutorials","tag-comet-ml","tag-cometllm","tag-data-pipeline","tag-data-quality","tag-feature-engineering","tag-feature-pipeline","tag-llm"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Dataset Preparation for LLM Fine Tuning<\/title>\n<meta name=\"description\" content=\"Follow this code tutorial to auto-generate instruction datasets for fine-tuning LLMs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Turning Raw Data Into Fine-Tuning Datasets\" \/>\n<meta property=\"og:description\" content=\"Follow this code tutorial to auto-generate instruction datasets for fine-tuning LLMs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-20T14:55:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:46:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset-1024x585.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"585\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Dataset Preparation for LLM Fine Tuning","description":"Follow this code tutorial to auto-generate instruction datasets for fine-tuning LLMs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/","og_locale":"en_US","og_type":"article","og_title":"Turning Raw Data Into Fine-Tuning Datasets","og_description":"Follow this code tutorial to auto-generate instruction datasets for fine-tuning LLMs.","og_url":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-05-20T14:55:50+00:00","article_modified_time":"2025-04-29T12:46:40+00:00","og_image":[{"width":1024,"height":585,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset-1024x585.png","type":"image\/png"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"Turning Raw Data Into Fine-Tuning Datasets","datePublished":"2024-05-20T14:55:50+00:00","dateModified":"2025-04-29T12:46:40+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/"},"wordCount":2162,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset.png","keywords":["Comet ML","CometLLM","Data Pipeline","Data Quality","Feature Engineering","Feature pipeline","LLM"],"articleSection":["LLMOps","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/","url":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/","name":"Dataset Preparation for LLM Fine Tuning","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset.png","datePublished":"2024-05-20T14:55:50+00:00","dateModified":"2025-04-29T12:46:40+00:00","description":"Follow this code tutorial to auto-generate instruction datasets for fine-tuning LLMs.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset.png","width":1792,"height":1024,"caption":"artistic rendering of a human face with neural networks to represent llm fine tuning"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Turning Raw Data Into Fine-Tuning Datasets"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/05\/fine-tuning-llm-dataset.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9951","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9951"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9951\/revisions"}],"predecessor-version":[{"id":15801,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9951\/revisions\/15801"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/9963"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9951"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9951"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9951"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9951"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}