{"id":9973,"date":"2024-06-21T13:05:01","date_gmt":"2024-06-21T21:05:01","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9973"},"modified":"2025-04-29T12:46:00","modified_gmt":"2025-04-29T12:46:00","slug":"mistral-llm-fine-tuning","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/","title":{"rendered":"8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Welcome to&nbsp;<strong>Lesson 7 of 12<\/strong>&nbsp;in our free course series,&nbsp;<strong>LLM Twin: Building Your Production-Ready AI Replica<\/strong>. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">Lesson 1<\/a>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lessons<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">This lesson will show you how to fine-tune open-source LLMs from Hugging Face using Unsloth, TRL, AWS SageMaker and Comet ML to ensure the following:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>MLOPs best practices using Hugging Face and Comet ML;<\/li>\n\n\n\n<li>Use VRAM optimally during fine-tuning using Unsloth and TRL;<\/li>\n\n\n\n<li>Operationalize your training pipelines using AWS SageMaker.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We will primarily focus on engineering scalable and reproducible fine-tuning pipelines (using LLMOps and SWE best practices) rather than digging into fine-tuning techniques.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will stick to what usually works for fine-tuning, such as using LoRA for supervised fine-tuning (SFT).<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*EreDp_DC_E1UOOnXqkPF3g.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 1: LLM fine-tuning production-ready pipeline with SageMaker, Unsloth and Comet<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"d0c9\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#3d5g\">Loading the training dataset from the data registry<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#bnb8\">Digging into SFT using Unsloth, TRL and Comet<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#bnb6\">Saving the fine-tuned LLM to a model registry<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#34fb\">Scaling fine-tuning with AWS SageMaker<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#ff78\">Running the training pipeline on AWS SageMaker<\/a><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"3d5g\">1. Loading the training dataset from the data registry<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In Lesson 6, we taught you how to generate an instruct fine-tuning dataset from raw custom data collected from various socials.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ultimately, we stored and versioned the fine-tuning dataset into a data registry powered by Comet ML. The data registry uses artifacts to track large files and metadata such as tags, versions, and dataset size.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can observe all the available artifacts from Comet ML in Figure 2.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, we made our artifacts publicly available, so you can take a look, play around with them, and even use them to fine-tune the LLM in case you don\u2019t want to compute them yourself:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/articles-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">articles-instruct-dataset<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/posts-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">posts-instruct-dataset<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/decodingml\/artifacts\/repositories-instruct-dataset?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">repositories-instruct-dataset<\/a><\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*KCjG_4frmJ3wbtIIznpjmQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 2: Comet ML fine-tuning datasets artifacts.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For example, in Figure 3, you can observe what our&nbsp;<strong>articles-instruct-dataset<\/strong>&nbsp;artifact looks like. It has 3 versions available, while the latest one is version 12.0.0.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By versioning your fine-tuning data, you ensure lineage, which means you always know what data you train your model on. A critical aspect of ensuring reproducibility which is one of the pillars of MLOps.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*bvlZ2H3mXaT7QV8-7wme2A.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 3: How the articles-instruct-dataset looks like in Comet.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">How can we work with these artifacts?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you are familiar with working with HuggingFace datasets, you will see Comet artifacts are similar. Conceptually, they are the same thing, but Comet allows you to quickly build a private data registry on top of your private data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s dig into the code to see how they work.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> class DatasetClient:\n    def __init__(\n        self,\n        output_dir: Path = Path(\".\/finetuning_dataset\"),\n    ) -&gt; None:\n        self.output_dir = output_dir\n        self.output_dir.mkdir(parents=True, exist_ok=True) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">First, we define a&nbsp;<strong>DatasetClient<\/strong>&nbsp;class. It creates a dedicated directory for storing our downloaded datasets.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def download_dataset(self, dataset_id: str, split: str = \"train\") -&gt; Dataset:\n      assert split in &#91;\"train\", \"test\"], \"Split must be either 'train' or 'test'\"\n  \n      if \"\/\" in dataset_id:\n          tokens = dataset_id.split(\"\/\")\n          assert (\n              len(tokens) == 2\n          ), f\"Wrong format for the {dataset_id}. It should have a maximum one '\/' character following the next template: 'comet_ml_workspace\/comet_ml_artiface_name'\"\n          workspace, artifact_name = tokens\n          experiment = Experiment(workspace=workspace)\n      else:\n          artifact_name = dataset_id\n          experiment = Experiment()\n  \n      artifact = self._download_artifact(artifact_name, experiment)\n      asset = self._artifact_to_asset(artifact, split)\n      dataset = self._load_data(asset)\n  \n      experiment.end()\n  \n      return dataset <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This is our primary entry point method \u2014 a high-level interface that orchestrates the entire dataset download process. It handles workspace parsing, validates inputs, and coordinates the three main steps: downloading, asset extraction, and data loading.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>  def _download_artifact(self, artifact_name: str, experiment) -&gt; Artifact:\n      try:\n          logged_artifact = experiment.get_artifact(artifact_name)\n          artifact = logged_artifact.download(self.output_dir)\n      except Exception as e:\n          print(f\"Error retrieving artifact: {str(e)}\")\n          raise\n  \n      print(f\"Successfully downloaded  {artifact_name} at location {self.output_dir}\")\n      return artifact <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This section manages the actual download of artifacts from Comet. It includes error handling and logging to ensure smooth data retrieval operations.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>  def _artifact_to_asset(self, artifact: Artifact, split: str) -&gt; ArtifactAsset:\n      if len(artifact.assets) == 0:\n          raise RuntimeError(\"Artifact has no assets\")\n      elif len(artifact.assets) != 2:\n          raise RuntimeError(\n              f\"Artifact has more {len(artifact.assets)} assets, which is invalid. It should have only 2.\"\n          )\n  \n      print(f\"Picking split = '{split}'\")\n      asset = &#91;asset for asset in artifact.assets if split in asset.logical_path]&#91;0]\n      return asset <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here, we handle the validation and extraction of specific dataset splits (train\/test) from our artifacts. It ensures we work with the correct data partitions and maintains data integrity.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def _load_data(self, asset: ArtifactAsset) -&gt; Dataset:\n      data_file_path = asset.local_path_or_data\n      with open(data_file_path, \"r\") as file:\n          data = json.load(file)\n  \n      dataset_dict = {k: &#91;str(d&#91;k]) for d in data] for k in data&#91;0].keys()}\n      dataset = Dataset.from_dict(dataset_dict)\n  \n      print(\n          f\"Successfully loaded dataset from artifact, num_samples = {len(dataset)}\",\n      )\n  \n      return dataset <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The final piece transforms our raw data into a HuggingFace Dataset object well-supported within the LLM tooling ecosystem, such as TRL, which we will use for fine-tuning.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What does our data look like?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We have ~300 training samples stored in our Comet ML artifacts that follow the structure below:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> &#91;\n...\n {\n    \"instruction\": \"Describe the old architecture of the RAG feature pipeline and its robust design principles.\",\n    \"content\": \"Our goal is to help enterprises put vectors at the center of their\\n&gt; data &amp; compute infrastructure, to build smarter and more reliable\\n&gt; software._\\n\\nTo conclude, Superlinked is a framework that puts the vectors in the center of\\ntheir universe and allows you to:\\n\\n  * chunk and embed embeddings;\\n\\n  * store multi-index vectors in a vector DB;\\n\\n  * do complex vector search queries on top of your data. Screenshot from Superlinkeds landing page\\n\\n* * *\\n\\n## **2\\\\. The old architecture of the RAG feature pipeline**\\n\\nHere is a quick recap of the critical aspects of the architecture of the RAG\\nfeature pipeline presented in the 4th lesson of the LLM Twin course. _We are working with**3 different data categories** :_\\n\\n  * posts (e.g., LinkedIn, Twitter)\\n\\n  * articles (e.g., Medium, Substack, or any other blog)\\n\\n  * repositories (e.g., GitHub, GitLab)\\n\\nEvery data category has to be preprocessed differently. For example, you want\\nto chunk the posts into smaller documents while keeping the articles in bigger\\nones. _The**solution** is based on **CDC** , a **queue,** a **streaming engine,**\\nand a **vector DB:**_\\n\\n-&gt; The raw data is collected from multiple social platforms and is stored in MongoDB. (Lesson 2)\\n\\n  CDC adds any change made to the MongoDB to a RabbitMQ queue (Lesson 3). the RabbitMQ queue stores all the events until they are processed. The Bytewax streaming engine reads the messages from the RabbitMQ queue and\\ncleans, chunks, and embeds them. The processed data is uploaded to a Qdrant vector DB. The old feature\/streaming pipeline architecture that was presented in Lesson\\n4. ### **Why is this design robust?**\\n\\nHere are 4 core reasons:\\n\\n  1. The **data** is **processed** in **real-time**. 2. **Out-of-the-box recovery system:** If the streaming pipeline fails to process a message, it will be added back to the queue\\n\\n  3. **Lightweight:** No need for any diffs between databases or batching too many records\\n\\n  4.\"\n  },\n...\n] <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">300 samples are not enough for SFT. Usually, you need somewhere between 10k and 100k instruct-answer pairs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, they are sufficient to teach you an end-to-end LLM architecture that can easily support 100k datasets if you want to use it and adapt it to your needs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bnb8\">2.Digging into SFT using Unsloth, TRL and Comet<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The next step is to define our fine-tuning strategy. We will do only an SFT step using LoRA to keep it simple and cost-effective.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will use&nbsp;<a href=\"https:\/\/github.com\/unslothai\/unsloth\">Unsloth<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/github.com\/huggingface\/trl\">TRL<\/a>&nbsp;to define our fine-tuning script.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unsloth is the new kid on the block of fine-tuning LLMs, making training 2x faster and 60% more memory-efficient than directly HuggingFace.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This translates to faster experiments, which means more iterations, feedback, and novelty with lower costs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, we will use&nbsp;<a href=\"https:\/\/www.comet.com\/site\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">Comet<\/a>&nbsp;as our experiment tracker to log all our training metrics between multiple experiments, compare them, and pick the best one to push to production.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>\ud83d\udd17 See a&nbsp;<strong>concrete example<\/strong>&nbsp;of an&nbsp;<strong>experiment tracker<\/strong>&nbsp;by checking out one of our&nbsp;<a href=\"https:\/\/www.comet.com\/decodingml\/llm-twin\/4e649019cdbb49e1967b5f1b33ff9c2d?compareXAxis=step&amp;experiment-tab=panels&amp;showOutliers=true&amp;smoothing=0&amp;utm_content=medium&amp;utm_medium=partner&amp;utm_source=decoding_ml&amp;xAxis=step\">experiments<\/a>&nbsp;\u2190<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let\u2019s dig into the code. Unsloth and TRL make it straightforward.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> ALPACA_TEMPLATE = \"\"\"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{}\n\n### Response:\n{}\"\"\" <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We will use the Alpaca format, which is expected by Llama models, to format our instruct dataset into prompts.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def finetune(\n    model_name: str,\n    output_dir: str,\n    dataset_id: str,\n    max_seq_length: int = 2048,\n    load_in_4bit: bool = False,\n    lora_rank: int = 32,\n    lora_alpha: int = 32,\n    lora_dropout: float = 0.0,\n    target_modules: List&#91;str] = &#91;\n        \"q_proj\", \"k_proj\", \"v_proj\",\n        \"up_proj\", \"down_proj\", \"o_proj\",\n        \"gate_proj\",\n    ],\n    chat_template: str = \"chatml\",\n    learning_rate: float = 3e-4,\n    num_train_epochs: int = 3,\n    per_device_train_batch_size: int = 2,\n    gradient_accumulation_steps: int = 8,\n    is_dummy: bool = True,\n) -&gt; tuple: <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we define the fine-tuning function and its parameters, including model configurations, LoRA parameters, and training hyperparameters.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>   model, tokenizer = load_model(\n          model_name, max_seq_length, load_in_4bit,\n          lora_rank, lora_alpha, lora_dropout,\n          target_modules, chat_template,\n      )\n      EOS_TOKEN = tokenizer.eos_token\n      print(f\"Setting EOS_TOKEN to {EOS_TOKEN}\")\n  \n      if is_dummy is True:\n          num_train_epochs = 1\n          print(f\"Training in dummy mode. Setting num_train_epochs to '{num_train_epochs}'\")\n          print(f\"Training in dummy mode. Reducing dataset size to '400'.\") <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we load the model and tokenizer and handle dummy mode settings for quick testing.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def format_samples_sft(examples):\n        text = &#91;]\n        for instruction, output in zip(\n            examples&#91;\"instruction\"], examples&#91;\"content\"], strict=False\n        ):\n            message = ALPACA_TEMPLATE.format(instruction, output) + EOS_TOKEN\n            text.append(message)\n        return {\"text\": text} <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This inner function handles the formatting of training examples into the desired template structure.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> dataset_client = DatasetClient()\n    custom_dataset = dataset_client.download_dataset(dataset_id=dataset_id)\n    static_dataset = load_dataset(\"mlabonne\/FineTome-Alpaca-100k\", split=\"train&#91;:10000]\")\n    dataset = concatenate_datasets(&#91;custom_dataset, static_dataset])\n    if is_dummy:\n        dataset = dataset.select(range(400))\n    print(f\"Loaded dataset with {len(dataset)} samples.\")\n\n    dataset = dataset.map(\n        format_samples_sft, batched=True, remove_columns=dataset.column_names\n    )\n    dataset = dataset.train_test_split(test_size=0.05)\n\n    print(\"Training dataset example:\")\n    print(dataset&#91;\"train\"]&#91;0]) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we handle dataset loading, combining custom and static datasets, and preprocessing the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As we don\u2019t have enough fine-tuning data, we enrich our custom dataset with a standard fine-tuning dataset to keep the SFT training step stable and avoid breaking the model.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> trainer = SFTTrainer(\n          model=model,\n          tokenizer=tokenizer,\n          train_dataset=dataset&#91;\"train\"],\n          eval_dataset=dataset&#91;\"test\"],\n          dataset_text_field=\"text\",\n          max_seq_length=max_seq_length,\n          dataset_num_proc=2,\n          packing=True,\n          args=TrainingArguments(\n              learning_rate=learning_rate,\n              num_train_epochs=num_train_epochs,\n              per_device_train_batch_size=per_device_train_batch_size,\n              gradient_accumulation_steps=gradient_accumulation_steps,\n              fp16=not is_bfloat16_supported(),\n              bf16=is_bfloat16_supported(),\n              logging_steps=1,\n              optim=\"adamw_8bit\",\n              weight_decay=0.01,\n              lr_scheduler_type=\"linear\",\n              per_device_eval_batch_size=per_device_train_batch_size,\n              warmup_steps=10,\n              output_dir=output_dir,\n              report_to=\"comet_ml\",\n              seed=0,\n          ),\n      )\n  \n      trainer.train()\n  \n      return model, tokenizer <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This final section sets up the&nbsp;<em>SFT (Supervised Fine-Tuning)<\/em>&nbsp;trainer with all necessary parameters and executes the training process.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To enable experiment tracking with Comet ML is as simple as setting the&nbsp;<strong>report_to=\u201dcomet_ml\u201d<\/strong>&nbsp;parameter to the&nbsp;<strong>TrainingArguments<\/strong>&nbsp;class and having the&nbsp;<strong>`COMET_API_KEY`, `COMET_WORKSPACE`<\/strong>&nbsp;and&nbsp;<strong>`COMET_PROJECT`<\/strong>&nbsp;environment variables loaded up in memory.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s dig further into how the model is defined using Unsloth.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def load_model(\n    model_name: str,\n    max_seq_length: int,\n    load_in_4bit: bool,\n    lora_rank: int,\n    lora_alpha: int,\n    lora_dropout: float,\n    target_modules: List&#91;str],\n    chat_template: str,\n) -&gt; tuple: <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The load_model function takes several essential parameters:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>model_name:<\/strong>\u00a0The identifier of the pre-trained model (e.g., \u201cmeta-llama\/Meta-Llama-3.1\u20138B\u201d)<\/li>\n\n\n\n<li><strong>max_seq_length:<\/strong>\u00a0Maximum sequence length for input tokens<\/li>\n\n\n\n<li><strong>load_in_4bit:<\/strong>\u00a0Boolean flag for 4-bit quantization<\/li>\n\n\n\n<li><strong>lora_rank, lora_alpha, lora_dropout:<\/strong>\u00a0LoRA (Low-Rank Adaptation) parameters<\/li>\n\n\n\n<li><strong>target_modules:<\/strong>\u00a0List of model layers to apply LoRA to<\/li>\n\n\n\n<li><strong>chat_template:<\/strong>\u00a0The conversation format template to use<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>  model, tokenizer = FastLanguageModel.from_pretrained(\n      model_name=model_name,\n      max_seq_length=max_seq_length,\n      load_in_4bit=load_in_4bit,\n  ) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This step from the&nbsp;<strong>load_model()<\/strong>&nbsp;function loads the pre-trained model and its tokenizer using Unsloth\u2019s&nbsp;<strong>FastLanguageModel<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<strong>load_in_4bit<\/strong>&nbsp;parameter is particularly interesting as it enables 4-bit quantization, significantly reducing the model\u2019s memory footprint while maintaining good performance.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>   model = FastLanguageModel.get_peft_model(\n      model,\n      r=lora_rank,\n      lora_alpha=lora_alpha,\n      lora_dropout=lora_dropout,\n      target_modules=target_modules,\n  ) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s where the magic of LoRA happens. Instead of fine-tuning all model parameters, LoRA adds small trainable rank decomposition matrices to specific layers (defined in&nbsp;<strong>target_modules)<\/strong>. This makes fine-tuning much more efficient in terms of memory and computation.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>lora_rank (r):<\/strong>\u00a0Determines the rank of the LoRA update matrices.<\/li>\n\n\n\n<li><strong>lora_alpha:<\/strong>\u00a0Scaling factor for the LoRA updates.<\/li>\n\n\n\n<li><strong>lora_dropout:<\/strong>\u00a0Adds regularization to prevent overfitting.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>  tokenizer = get_chat_template(\n      tokenizer,\n      chat_template=chat_template,\n  ) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we configure the tokenizer with a specific chat template. This ensures that the model understands the structure of conversations during training and inference. Standard templates include \u201cchatml\u201d (ChatML format) or other custom formats.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This loading pipeline is crucial for efficient fine-tuning because it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enables memory-efficient training through 4-bit quantization.<\/li>\n\n\n\n<li>Implements LoRA for parameter-efficient fine-tuning.<\/li>\n\n\n\n<li>Ensures consistent conversation formatting through chat templates.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Using this approach, you can fine-tune LLMs on consumer-grade hardware while achieving excellent results.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To dig deeper into the theory of fine-tuning with LoRA, consider checking out this&nbsp;<a href=\"https:\/\/mlabonne.github.io\/blog\/posts\/2024-07-29_Finetune_Llama31.html\">article<\/a>&nbsp;written by Maxime: Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth [2].<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"bnb6\">3. Saving the fine-tuned LLM to a model registry<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The same as storing, tracking and versioning your data in a data registry, we have to do it for our fine-tuned model by pushing it to a model registry.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A common strategy when working with open-source models is to use the Hugging Face model registry to store and share your models, which we will also do in this lesson.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> base_model_suffix = args.base_model_name.split(\"\/\")&#91;-1]\nsft_output_model_repo_id = f\"{huggingface_workspace}\/LLMTwin-{base_model_suffix}\"\n\nsave_model(\n        model,\n        tokenizer,\n        \"model_sft\",\n        push_to_hub=True,\n        repo_id=sft_output_model_repo_id,\n    ) <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">First, we compute the output model ID based on our Hugging Face workspace (e.g., pauliusztin) and the new model name. Out of simplicity, we prefixed the base model name with \u201cLLMTwin\u201d.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> def save_model(\n    model: Any,\n    tokenizer: Any,\n    output_dir: str,\n    push_to_hub: bool = False,\n    repo_id: Optional&#91;str] = None,\n) -&gt; None:\n    model.save_pretrained_merged(output_dir, tokenizer, save_method=\"merged_16bit\")\n\n    if push_to_hub and repo_id:\n        model.push_to_hub_merged(repo_id, tokenizer, save_method=\"merged_16bit\") <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We save the model locally and push it to Hugging Face, as seen in Figure 3.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:1400\/1*XUgzVaO-EDtGt2v0s5yL-Q.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 4: Fine-tuned model stored in Hugging Face model registry. Access our fine-tuned LLM here.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Further, you can load a specific version of the model from the model registry for evaluation or serving.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Almost all ML platforms offer a model registry, such as Comet, W&amp;B, Neptune and more, but HuggingFace is a common choice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For example, the beauty of model registries is that, in case you haven\u2019t fine-tuned your LLMTwin, you can use ours to finish the course:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2192 Link to our&nbsp;<a href=\"http:\/\/pauliusztin\/LLMTwin-Meta-Llama-3.1-8B\">pauliusztin\/LLMTwin-Meta-Llama-3.1\u20138B<\/a>&nbsp;model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2192 Full code: the&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/training_pipeline\/finetune.py\">finetune.py<\/a>&nbsp;script.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"34fb\">4.Scaling fine-tuning with AWS SageMaker<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">So far, we have walked you through the fine-tuning script. A standard approach is to run it on Google Colab locally or using similar approaches based on Notebooks, but what if we want to scale or automate the training?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A 7\u20138B LLM could fit on a Google Colab machine while using LoRA\/QLoRA, but it can get trickier for larger models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another issue is that when working with open-source datasets, it\u2019s easy to work with Google Colab, but what if you work with terabytes or petabytes of data?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is where tools such as AWS SageMaker kick in. They allow you to hook your fine-tuning script to GPU clusters running on AWS and provide robust access to datasets of various sizes (public or private) powered by S3 (you could host your Comet ML artifacts on S3).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Code-wise, SageMaker makes it easy to set everything up, as seen in the code snippet below, where we:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Locate the requirements.txt file with the Python dependencies used for training.<\/li>\n\n\n\n<li>Grab your Hugging Face user.<\/li>\n\n\n\n<li>Define the SageMaker job using a wrapper dedicated to training jobs that use Hugging Face. They are Docker images preinstalled with the transformer and torch libraries.<\/li>\n\n\n\n<li>Kick off the training.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Beautiful and easy.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> from huggingface_hub import HfApi\nfrom sagemaker.huggingface import HuggingFace\n\n\nfinetuning_dir = Path(__file__).resolve().parent\nfinetuning_requirements_path = finetuning_dir \/ \"requirements.txt\"\n\ndef run_finetuning_on_sagemaker(\n    num_train_epochs: int = 3,\n    per_device_train_batch_size: int = 2,\n    learning_rate: float = 3e-4,\n    is_dummy: bool = False,\n) -&gt; None:\n    if not finetuning_requirements_path.exists():\n        raise FileNotFoundError(\n            f\"The file {finetuning_requirements_path} does not exist.\"\n        )\n\n    api = HfApi()\n    user_info = api.whoami(token=settings.HUGGINGFACE_ACCESS_TOKEN)\n    huggingface_user = user_info&#91;\"name\"]\n    logger.info(f\"Current Hugging Face user: {huggingface_user}\")\n\n    hyperparameters = {\n        \"base_model_name\": settings.HUGGINGFACE_BASE_MODEL_ID,\n        \"dataset_id\": settings.DATASET_MODEL_ID,\n        \"num_train_epochs\": num_train_epochs,\n        \"per_device_train_batch_size\": per_device_train_batch_size,\n        \"learning_rate\": learning_rate,\n        \"model_output_huggingface_workspace\": huggingface_user,\n    }\n    if is_dummy:\n        hyperparameters&#91;\"is_dummy\"] = True\n\n    # Create the HuggingFace SageMaker estimator\n    huggingface_estimator = HuggingFace(\n        entry_point=\"finetune.py\",\n        source_dir=str(finetuning_dir),\n        instance_type=\"ml.g5.2xlarge\",\n        instance_count=1,\n        role=settings.AWS_ARN_ROLE,\n        transformers_version=\"4.36\",\n        pytorch_version=\"2.1\",\n        py_version=\"py310\",\n        hyperparameters=hyperparameters,\n        requirements_file=finetuning_requirements_path,\n        environment={\n            \"HUGGING_FACE_HUB_TOKEN\": settings.HUGGINGFACE_ACCESS_TOKEN,\n            \"COMET_API_KEY\": settings.COMET_API_KEY,\n            \"COMET_WORKSPACE\": settings.COMET_WORKSPACE,\n            \"COMET_PROJECT_NAME\": settings.COMET_PROJECT,\n        },\n    )\n\n    # Start the training job on SageMaker.\n    huggingface_estimator.fit() <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<strong>hyperparameters<\/strong>&nbsp;dictionary will be sent to the fine-tuning script as CLI arguments, while the environment dictionary will be set as environment variables. That\u2019s why we send only the credentials through the environment argument.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As we train an 8B LLM, we managed to fit the training into a single \u201c<strong>ml.g5.2xlarge<\/strong>\u201d instance, which has a single NVIDIA A10G GPU with 24 VRAM, which costs ~2$ \/ hour.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But the catch is that this is possible only because we fine-tune using Unsloth, which reduces our memory consumption. Without it, we fit the training job only on a \u201c<strong>ml.g5.12xlarge<\/strong>\u201d instance with x4 A10G GPUs, which cost ~9$ \/ hour.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, yes, Unsloth is incredible!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That is a&nbsp;<strong>77.77% reduction in costs<\/strong>&nbsp;(and we are not even considering that Unsloth experiments run faster due to the framework itself and less IO overhead as we use a single GPU).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">More on&nbsp;<a href=\"https:\/\/aws.amazon.com\/sagemaker\/pricing\/\">SageMaker pricing<\/a>&nbsp;\u2190<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> from pydantic_settings import BaseSettings, SettingsConfigDict\n\n\nclass Settings(BaseSettings):\n    model_config = SettingsConfigDict(env_file=ROOT_DIR, env_file_encoding=\"utf-8\")\n\n    HUGGINGFACE_BASE_MODEL_ID: str = \"meta-llama\/Meta-Llama-3.1-8B\"\n    HUGGINGFACE_ACCESS_TOKEN: str | None = None\n\n    COMET_API_KEY: str | None = None\n    COMET_WORKSPACE: str | None = None\n    COMET_PROJECT: str = \"llm-twin\"\n\n    DATASET_ID: str = \"decodingml\/articles-instruct-dataset\"\n\n    # AWS Authentication\n    AWS_REGION: str = \"eu-central-1\"\n    AWS_ACCESS_KEY: str | None = None\n    AWS_SECRET_KEY: str | None = None\n    AWS_ARN_ROLE: str | None = None\n <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">In our&nbsp;<strong>settings<\/strong>&nbsp;object (populated by the&nbsp;<strong>.env<\/strong>&nbsp;file), we have to set the base&nbsp;<strong>model_id,<\/strong>&nbsp;the&nbsp;<strong>dataset_id<\/strong>&nbsp;(loaded from Comet artifacts), and credentials for Hugging Face, Comet, and AWS.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2192 Full code: the&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/training_pipeline\/run_on_sagemaker.py\">run_on_sagemaker.py<\/a>&nbsp;script<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"ff78\">5.Running the training pipeline on AWS SageMaker<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">To run the fine-tuning job, first, you must create an IAM execution role used by AWS SageMaker to access other AWS resources. This is standard practice when working with SageMaker.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make create-sagemaker-execution-role<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You must add this to your .env file as your&nbsp;<strong>AWS_ARN_ROLE<\/strong>&nbsp;env var. Thus, your&nbsp;<strong>.env<\/strong>&nbsp;file should look something like this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> QDRANT_APIKEY=...\n\n# AWS Authentication\nAWS_ARN_ROLE=...\nAWS_REGION=eu-central-1\nAWS_ACCESS_KEY=...\nAWS_SECRET_KEY=... <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then, you can kick off a dummy training that uses less data and epochs by running:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make start-training-pipeline-dummy-mode<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">And the entire training by running:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code> make start-training-pipeline <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After you call any training commands, your CLI should look similar to Figure 5.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*CCl8dAR6jjdxgMywSER0sQ.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 5: AWS SageMaker training job provisioning time.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">After the EC2 machine is provisioned for your training job, your CLI should look similar to Figure 6.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:2000\/1*AVubCGjrQ3pnhedOdOceeA.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Figure 6: AWS SageMaker training job starting.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">After the training image is downloaded, your requirements are installed from the&nbsp;<strong>requirements.txt<\/strong>&nbsp;file, and the fine-tuning script starts running.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Find&nbsp;<strong>step-by-step instructions<\/strong>&nbsp;on installing and running&nbsp;<strong>the entire course<\/strong>&nbsp;in our&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/INSTALL_AND_USAGE.md\">INSTALL_AND_USAGE<\/a>&nbsp;document from the repository.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this lesson, you\u2019ve got your hands dirty with fine-tuning an open-source LLM from HuggingFace using Unsloth, TRL.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, you\u2019ve learned why storing, versioning and using your data from a data registry (e.g., using&nbsp;<a href=\"https:\/\/www.comet.com\/site\/?utm_source=decoding_ml&amp;utm_medium=partner&amp;utm_content=medium\">Comet artifacts<\/a>) is critical for reproducibility.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ultimately, you\u2019ve seen how easy it is to automate your training processes using AWS SageMaker.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Continue the course with Lesson 8 on evaluating the fine-tuned LLM and RAG pipeline using Opik.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud83d\udd17 Consider checking out the GitHub repository [1] and support us with a \u2b50\ufe0f<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Literature<\/strong><br>[1] Decodingml. (n.d.). GitHub \u2014 decodingml\/llm-twin-course. GitHub. https:\/\/github.com\/decodingml\/llm-twin-course[2] Maxime Labonne (2024), Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth, Maxime\u2019s Labonne blog. https:\/\/mlabonne.github.io\/blog\/posts\/2024-07-29_Finetune_Llama31.html<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Images<\/strong><br>If not otherwise stated, all images are created by the author.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to&nbsp;Lesson 7 of 12&nbsp;in our free course series,&nbsp;LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":10007,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[7],"tags":[64,31],"coauthors":[222,223],"class_list":["post-9973","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tutorials","tag-cometllm","tag-llmops"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>The Ultimate LLM Fine-Tuning Pipeline<\/title>\n<meta name=\"description\" content=\"Learn how to operationalize scalable and reproducible LLM training pipelines following MLOps best practices.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline\" \/>\n<meta property=\"og:description\" content=\"Learn how to operationalize scalable and reproducible LLM training pipelines following MLOps best practices.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-06-21T21:05:01+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-29T12:46:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct-1024x585.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"585\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"The Ultimate LLM Fine-Tuning Pipeline","description":"Learn how to operationalize scalable and reproducible LLM training pipelines following MLOps best practices.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/","og_locale":"en_US","og_type":"article","og_title":"8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline","og_description":"Learn how to operationalize scalable and reproducible LLM training pipelines following MLOps best practices.","og_url":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-06-21T21:05:01+00:00","article_modified_time":"2025-04-29T12:46:00+00:00","og_image":[{"width":1024,"height":585,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct-1024x585.png","type":"image\/png"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline","datePublished":"2024-06-21T21:05:01+00:00","dateModified":"2025-04-29T12:46:00+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/"},"wordCount":2331,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct.png","keywords":["CometLLM","LLMOps"],"articleSection":["Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/","url":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/","name":"The Ultimate LLM Fine-Tuning Pipeline","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct.png","datePublished":"2024-06-21T21:05:01+00:00","dateModified":"2025-04-29T12:46:00+00:00","description":"Learn how to operationalize scalable and reproducible LLM training pipelines following MLOps best practices.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct.png","width":1792,"height":1024},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/06\/fine-tune-mistral7binstruct.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9973","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9973"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9973\/revisions"}],"predecessor-version":[{"id":15800,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9973\/revisions\/15800"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/10007"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9973"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9973"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9973"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9973"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}