{"id":9673,"date":"2024-04-03T07:55:26","date_gmt":"2024-04-03T15:55:26","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=9673"},"modified":"2025-11-17T20:40:44","modified_gmt":"2025-11-17T20:40:44","slug":"the-importance-of-data-pipelines-in-the-era-of-generative-ai","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/","title":{"rendered":"Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Welcome to Lesson 2 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">Lesson 1<\/a>.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Lessons<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">An End-to-End Framework for Production-Ready LLM Systems by Building Your LLM Twin<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\">Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-twin-3-change-data-capture\/\">I Replaced 1000 Lines of Polling Code with 50 Lines of CDC Magic<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/streaming-pipelines-for-fine-tuning-llms\/\">SOTA Python Streaming Pipelines for Fine-tuning LLMs and RAG \u2014 in Real-Time!<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/advanced-rag-algorithms-optimize-retrieval\/\">The 4 Advanced RAG Algorithms You Must Know to Implement<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-fine-tuning-dataset\/\">Turning Raw Data Into Fine-Tuning Datasets<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/mistral-llm-fine-tuning\/\">8B Parameters, 1 GPU, No Problems: The Ultimate LLM Fine-tuning Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-evaluation-best-practices\/\">The Engineer\u2019s Framework for LLM &amp; RAG Evaluation<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/llm-rag-inference-pipelines\/\">Beyond Proof of Concept: Building RAG Systems That Scale<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/rag-evaluation-framework-ragas\/\">The Ultimate Prompt Monitoring Pipeline<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/refactoring-rag-retrieval\/\">[Bonus] Build a scalable RAG ingestion pipeline using 74.3% less code<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/multi-index-rag-apps\/\">[Bonus] Build Multi-Index Advanced RAG Apps<\/a><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e7db\">We have data everywhere. Linkedin, Medium, Github, Substack, and many other platforms. To be able to build your Digital Twin, you need&nbsp;<strong>data.&nbsp;<\/strong>Not all types of data, but&nbsp;<strong>organized<\/strong>,&nbsp;<strong>clean<\/strong>, and&nbsp;<strong>normalized<\/strong>&nbsp;data. In&nbsp;<strong>Lesson 2,&nbsp;<\/strong>we will learn how to think and build a&nbsp;<strong>data pipeline&nbsp;<\/strong>by aggregating data from:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Medium<\/li>\n\n\n\n<li>Linkedin<\/li>\n\n\n\n<li>Github<\/li>\n\n\n\n<li>Substack<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8bff\">We will&nbsp;<strong>present<\/strong>&nbsp;all our&nbsp;<strong>architectural decisions<\/strong>&nbsp;regarding the design of the data collection pipeline for social media data and why separating raw data and feature data is essential.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Note: This Blog Post is the Second Part of a series for the<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\">&nbsp;LLM Twin Course<\/a>. Click&nbsp;<a href=\"https:\/\/www.comet.com\/site\/blog\/an-end-to-end-framework-for-production-ready-llm-systems-by-building-your-llm-twin\/\">here<\/a>&nbsp;to read the first part!<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"25a2\">In<strong>&nbsp;Lesson 3, we will present the CDC (change data capture) pattern, a database architecture, and a design for data management systems<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9a7e\">CDC\u2019s primary purpose is to identify and capture changes made to database data, such as insertions, updates, and deletions, which we will detail in Lesson 3.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*q17lML8JiD2v5DY8Xzcl2w.png\" alt=\"\"\/><figcaption class=\"wp-element-caption\">Data Pipeline System Architecture<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"e104\">Table of Contents<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#5e59\">What is a data pipeline? The critical point in any AI project<\/a>.<\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#44d7\">Data crawling. How to collect your data?<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#03c8\">How do you store your data?<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#80d1\">Raw data vs. Features data<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#d162\">Digging into the dispatcher and AWS Lambda<\/a><\/li>\n\n\n\n<li><a href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#5b21\">Run everything and populate your MongoDB data warehouse<\/a><\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"5e59\">1. What is a data pipeline? The critical point in any AI project.<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c4c4\">Data is the lifeblood of any successful AI project, and a well-engineered data pipeline is the key to harnessing its power.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"7353\">This automated system acts as the engine, seamlessly moving data through various stages and transforming it from raw form into actionable insights.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8291\">But what exactly is a data pipeline, and why is it so critical?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"dd0b\"><strong>A data pipeline is a series of automated steps that guide data on a purpose.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3e63\">It starts with&nbsp;<strong>data collection, gathering information from diverse sources, such as LinkedIn<\/strong>, Medium, Substack, Github, etc.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"debb\">The pipeline then tackles the raw data, performing&nbsp;<strong>cleaning and transformation<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c1b3\">This step removes inconsistencies and irrelevant information and transforms the data into a format suitable for analysis and ML models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2939\"><strong>But why are data pipelines so crucial in AI projects?<\/strong>&nbsp;Here are some key reasons:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Efficiency and Automation:<\/strong>&nbsp;Manual data handling is slow and prone to errors. Pipelines automate the process, ensuring speed and accuracy, especially when dealing with massive data volumes.<\/li>\n\n\n\n<li><strong>Scalability:<\/strong>&nbsp;AI projects often grow in size and complexity. A well-designed pipeline can scale seamlessly, accommodating this growth without compromising performance.<\/li>\n\n\n\n<li><strong>Quality and Consistency:<\/strong>&nbsp;Pipelines standardize data handling, ensuring consistent and high-quality data throughout the project lifecycle, leading to more reliable AI models.<\/li>\n\n\n\n<li><strong>Flexibility and Adaptability:<\/strong>&nbsp;The AI landscape is constantly evolving. A robust data pipeline can adapt to changing requirements without a complete rebuild, ensuring long-term value.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e5dd\"><strong>Data is the engine of any ML model. If we don\u2019t give it enough importance, the model\u2019s output<\/strong>&nbsp;will be very unexpected.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/0*wkjJQaVVlpyvStPK.jpg\" alt=\"The Importance of Data Pipelines in the Era of Generative AI, Decoding ML\"\/><figcaption class=\"wp-element-caption\">Importance of Data [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ca90\">But how can we transform the raw data into&nbsp;<strong>actionable insights?<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"44d7\">2. Data crawling. How to collect your data?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"69d8\">The first step in building a database of relevant data is choosing our data sources. In this lesson, we will focus on four sources:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Linkedin<\/li>\n\n\n\n<li>Medium<\/li>\n\n\n\n<li>Github<\/li>\n\n\n\n<li>Substack<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6620\">Why do we choose 4 data sources? We need&nbsp;<strong>complexity<\/strong>&nbsp;and<strong>&nbsp;diversity<\/strong>&nbsp;in our data to build a powerful LLM twin. To obtain these characteristics, we will focus on building three collections of data:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Articles<\/li>\n\n\n\n<li>Social Media Posts<\/li>\n\n\n\n<li>Code<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"1376\">For the data crawling module, we will focus on&nbsp;<strong>two libraries<\/strong>:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>BeautifulSoup<\/strong>: A Python library for parsing HTML and XML documents. It creates parse trees that help us extract the data quickly, but BeautifulSoup needs to fetch the web page for us. That\u2019s why we need to use it alongside libraries like&nbsp;<code>requests<\/code>&nbsp;or&nbsp;<code>Selenium<\/code>&nbsp;which can fetch the page for us.<\/li>\n\n\n\n<li><strong>Selenium<\/strong>: A tool for automating web browsers. It\u2019s used here to interact with web pages programmatically (like logging into LinkedIn, navigating through profiles, etc.). Selenium can work with various browsers, but this code configures it to work with Chrome. We created a base crawler class to respect the best software engineering practices.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2a54\">The&nbsp;<code>BaseAbstractCrawler&nbsp;<\/code>class in a web crawling context is essential for several key reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Code Reusability and Efficiency<\/strong>: It contains standard methods and properties used by different scrapers, reducing code duplication and promoting efficient development.<\/li>\n\n\n\n<li><strong>Simplification and Structure<\/strong>: This base class abstracts complex or repetitive code, allowing derived scraper classes to focus on specific tasks. It enforces a consistent structure across different scrapers.<\/li>\n\n\n\n<li><strong>Ease of Extension<\/strong>: New types of scrapers can easily extend this base class, making the system adaptable and scalable for future requirements.<\/li>\n\n\n\n<li><strong>Maintenance and Testing<\/strong>: Updates or fixes to standard functionalities must be made only once in the base class, simplifying maintenance and testing.<\/li>\n<\/ol>\n\n\n\n<pre class=\"wp-block-code\"><code>import time\nfrom abc import ABC, abstractmethod\nfrom tempfile import mkdtemp\n\nfrom core.db.documents import BaseDocument\n\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.options import Options\n\n\nclass BaseCrawler(ABC):\n    model: type&#91;BaseDocument]\n\n    @abstractmethod\n    def extract(self, link: str, **kwargs) -&gt; None: ...\n\n\nclass BaseAbstractCrawler(BaseCrawler, ABC):\n    def __init__(self, scroll_limit: int = 5) -&gt; None:\n        options = webdriver.ChromeOptions()\n\n        options.add_argument(\"--no-sandbox\")\n        options.add_argument(\"--headless=new\")\n        options.add_argument(\"--disable-dev-shm-usage\")\n        options.add_argument(\"--log-level=3\")\n        options.add_argument(\"--disable-popup-blocking\")\n        options.add_argument(\"--disable-notifications\")\n        options.add_argument(\"--disable-extensions\")\n        options.add_argument(\"--disable-background-networking\")\n        options.add_argument(\"--ignore-certificate-errors\")\n        options.add_argument(f\"--user-data-dir={mkdtemp()}\")\n        options.add_argument(f\"--data-path={mkdtemp()}\")\n        options.add_argument(f\"--disk-cache-dir={mkdtemp()}\")\n        options.add_argument(\"--remote-debugging-port=9226\")\n\n        self.set_extra_driver_options(options)\n\n        self.scroll_limit = scroll_limit\n        self.driver = webdriver.Chrome(\n            options=options,\n        )\n\n    def set_extra_driver_options(self, options: Options) -&gt; None:\n        pass\n\n    def login(self) -&gt; None:\n        pass\n\n    def scroll_page(self) -&gt; None:\n        \"\"\"Scroll through the LinkedIn page based on the scroll limit.\"\"\"\n        current_scroll = 0\n        last_height = self.driver.execute_script(\"return document.body.scrollHeight\")\n        while True:\n            self.driver.execute_script(\n                \"window.scrollTo(0, document.body.scrollHeight);\"\n            )\n            time.sleep(5)\n            new_height = self.driver.execute_script(\"return document.body.scrollHeight\")\n            if new_height == last_height or (\n                self.scroll_limit and current_scroll &gt;= self.scroll_limit\n            ):\n                break\n            last_height = new_height\n            current_scroll += 1 <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The base classes can be found at&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/data_crawling\/crawlers\/base.py\">data_crawling\/crawlers\/base.py.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a069\">We created separate crawlers for each collection (posts, articles, and repositories), which you can find in the&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/tree\/main\/src\/data_crawling\/crawlers\">data_crawling\/crawlers folder<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c633\">Every crawler extends the&nbsp;<strong>BaseCrawler<\/strong>&nbsp;or&nbsp;<strong>BaseAbstractCrawler<\/strong>&nbsp;class, depending on the purpose.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"a626\">The&nbsp;<em>MediumCrawler<\/em>, and&nbsp;<em>LinkedinCrawler<\/em>&nbsp;extend the&nbsp;<em>BaseAbstractCrawler<\/em>&nbsp;(as they depend on the login and scrolling functionality).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"00be\">Here is what the&nbsp;Medium<em>Crawler<\/em>&nbsp;looks like \u2193<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from aws_lambda_powertools import Logger\nfrom bs4 import BeautifulSoup\nfrom core.db.documents import ArticleDocument\nfrom selenium.webdriver.common.by import By\n\nfrom crawlers.base import BaseAbstractCrawler\n\nlogger = Logger(service=\"llm-twin-course\/crawler\")\n\n\nclass MediumCrawler(BaseAbstractCrawler):\n    model = ArticleDocument\n\n    def set_extra_driver_options(self, options) -&gt; None:\n        options.add_argument(r\"--profile-directory=Profile 2\")\n\n    def extract(self, link: str, **kwargs) -&gt; None:\n        logger.info(f\"Starting scrapping Medium article: {link}\")\n\n        self.driver.get(link)\n        self.scroll_page()\n\n        soup = BeautifulSoup(self.driver.page_source, \"html.parser\")\n        title = soup.find_all(\"h1\", class_=\"pw-post-title\")\n        subtitle = soup.find_all(\"h2\", class_=\"pw-subtitle-paragraph\")\n\n        data = {\n            \"Title\": title&#91;0].string if title else None,\n            \"Subtitle\": subtitle&#91;0].string if subtitle else None,\n            \"Content\": soup.get_text(),\n        }\n\n        logger.info(f\"Successfully scraped and saved article: {link}\")\n        self.driver.close()\n        instance = self.model(\n            platform=\"medium\", content=data, link=link, author_id=kwargs.get(\"user\")\n        )\n        instance.save()\n\n    def login(self):\n        \"\"\"Log in to Medium with Google\"\"\"\n        self.driver.get(\"https:\/\/medium.com\/m\/signin\")\n        self.driver.find_element(By.TAG_NAME, \"a\").click()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"44fd\">For example, the GitHub crawler is a static crawler that doesn\u2019t need a login function,&nbsp;<em>scroll_page<\/em>&nbsp;function, or driver. It uses only git commands.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"3996\">The&nbsp;<em>GithubCrawler<\/em>&nbsp;extends the&nbsp;<em>BaseCrawler<\/em>&nbsp;class and uses the extract method to retrieve the desired repository.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import os\nimport shutil\nimport subprocess\nimport tempfile\n\nfrom crawlers.base import BaseCrawler\nfrom documents import RepositoryDocument\n\nclass GithubCrawler(BaseCrawler):\n    model = RepositoryDocument\n\n    def __init__(self, ignore=(\".git\", \".toml\", \".lock\", \".png\")):\n        super().__init__()\n        self._ignore = ignore\n\n    def extract(self, link: str, **kwargs):\n        repo_name = link.rstrip(\"\/\").split(\"\/\")&#91;-1]\n        local_temp = tempfile.mkdtemp()\n        try:\n            os.chdir(local_temp)\n            subprocess.run(&#91;\"git\", \"clone\", link])\n            repo_path = os.path.join(local_temp, os.listdir(local_temp)&#91;0])\n            tree = {}\n            for root, dirs, files in os.walk(repo_path):\n                dir = root.replace(repo_path, \"\").lstrip(\"\/\")\n                if dir.startswith(self._ignore):\n                    continue\n                for file in files:\n                    if file.endswith(self._ignore):\n                        continue\n                    file_path = os.path.join(dir, file)\n                    with open(os.path.join(root, file), \"r\", errors=\"ignore\") as f:\n                        tree&#91;file_path] = f.read().replace(\" \", \"\")\n            instance = self.model(\n                name=repo_name, link=link, content=tree, owner_id=kwargs.get(\"user\")\n            )\n            instance.save()\n        except Exception:\n            raise\n        finally:\n            shutil.rmtree(local_temp)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"03c8\">3. How do you store your data? An ODM approach<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"5daf\">Object Document Mapping (ODM) is a technique that maps between an object model in an application and a document database.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e0ef\">By abstracting database interactions through model classes, it simplifies the process of storing and managing data in a document-oriented database like MongoDB. This approach is particularly beneficial in applications where data structures align well with object-oriented programming paradigms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9120\">The&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/module1\/documents.py\" target=\"_blank\" rel=\"noreferrer noopener\">documents.py<\/a>&nbsp;module serves as a foundational framework for interacting with MongoDB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"febd\">Our data modeling centers on creating specific document classes \u2014&nbsp;<strong>UserDocument<\/strong>,&nbsp;<strong>RepositoryDocument<\/strong>,&nbsp;<strong>PostDocument<\/strong>, and&nbsp;<strong>ArticleDocument<\/strong>&nbsp;\u2014 that mirror the structure of our MongoDB collections.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"fca2\">These classes define the schema for each data type we store, such as users\u2019 details, repository metadata, post content, and article information.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d847\">By using these classes, we can ensure that the data inserted into our database is consistent, valid, and easily retrievable for further operations.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import uuid\nfrom typing import List, Optional\n\nfrom pydantic import UUID4, BaseModel, ConfigDict, Field\nfrom pymongo import errors\n\nimport core.logger_utils as logger_utils\nfrom core.db.mongo import connection\nfrom core.errors import ImproperlyConfigured\n\n_database = connection.get_database(\"twin\")\n\nlogger = logger_utils.get_logger(__name__)\n\n\nclass BaseDocument(BaseModel):\n    id: UUID4 = Field(default_factory=uuid.uuid4)\n\n    model_config = ConfigDict(from_attributes=True, populate_by_name=True)\n\n    @classmethod\n    def from_mongo(cls, data: dict):\n        \"\"\"Convert \"_id\" (str object) into \"id\" (UUID object).\"\"\"\n        if not data:\n            return data\n\n        id = data.pop(\"_id\", None)\n        return cls(**dict(data, id=id))\n\n    def to_mongo(self, **kwargs) -&gt; dict:\n        \"\"\"Convert \"id\" (UUID object) into \"_id\" (str object).\"\"\"\n        exclude_unset = kwargs.pop(\"exclude_unset\", False)\n        by_alias = kwargs.pop(\"by_alias\", True)\n\n        parsed = self.model_dump(\n            exclude_unset=exclude_unset, by_alias=by_alias, **kwargs\n        )\n\n        if \"_id\" not in parsed and \"id\" in parsed:\n            parsed&#91;\"_id\"] = str(parsed.pop(\"id\"))\n\n        return parsed\n\n    def save(self, **kwargs):\n        ...\n\n    @classmethod\n    def get_or_create(cls, **filter_options) -&gt; Optional&#91;str]:\n        ...\n\n    @classmethod\n    def find(cls, **filter_options):\n        ...\n\n    @classmethod\n    def bulk_insert(cls, documents: List, **kwargs) -&gt; Optional&#91;List&#91;str]]:\n        ...\n\n    @classmethod\n    def _get_collection_name(cls):\n        if not hasattr(cls, \"Settings\") or not hasattr(cls.Settings, \"name\"):\n            raise ImproperlyConfigured(\n                \"Document should define an Settings configuration class with the name of the collection.\"\n            )\n\n        return cls.Settings.name\n\n\nclass UserDocument(BaseDocument):\n    first_name: str\n    last_name: str\n\n    class Settings:\n        name = \"users\"\n\n\nclass RepositoryDocument(BaseDocument):\n    name: str\n    link: str\n    content: dict\n    owner_id: str = Field(alias=\"owner_id\")\n\n    class Settings:\n        name = \"repositories\"\n\n\nclass PostDocument(BaseDocument):\n    platform: str\n    content: dict\n    author_id: str = Field(alias=\"author_id\")\n\n    class Settings:\n        name = \"posts\"\n\n\nclass ArticleDocument(BaseDocument):\n    platform: str\n    link: str\n    content: dict\n    author_id: str = Field(alias=\"author_id\")\n\n    class Settings:\n        name = \"articles\"<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4fc6\">In our ODM approach for MongoDB, key CRUD operations are integrated:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Conversion<\/strong>: The&nbsp;<code>to_mongo<\/code>&nbsp;method transforms model instances into MongoDB-friendly formats.<\/li>\n\n\n\n<li><strong>Inserting<\/strong>: The&nbsp;<code>save<\/code>&nbsp;method uses PyMongo\u2019s&nbsp;<code>insert_one<\/code>&nbsp;for adding documents, returning MongoDB\u2019s acknowledgment as the inserted ID.<\/li>\n\n\n\n<li><strong>Bulk Operations<\/strong>:&nbsp;<code>bulk_insert<\/code>&nbsp;employs&nbsp;<code>insert_many<\/code>&nbsp;for adding multiple documents and returning their IDs.<\/li>\n\n\n\n<li><strong>Upserting<\/strong>:&nbsp;<code>get_or_create<\/code>&nbsp;either fetches an existing document or creates a new one, ensuring seamless data updates.<\/li>\n\n\n\n<li><strong>Validation and Transformation<\/strong>: Using Pydantic models, each class ensures data is correctly structured and validated before database entry.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Full code at&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/core\/db\/documents.py\">core\/db\/documents.py<\/a><\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"80d1\">4. Raw data vs. features data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"c0a2\">Now that we understand the critical role of data pipelines in preparing raw data let\u2019s explore how we can transform this data into a usable format for our LLM twin. This is where the concept of features comes into play.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b181\"><strong>Features&nbsp;<\/strong>are the processed building blocks used to fine-tune your LLM twin.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\" id=\"368a\">Imagine you\u2019re teaching someone your writing style. You wouldn\u2019t just hand them all your social media posts! Instead, you might point out your frequent use of specific keywords, the types of topics you write about, or the overall sentiment you convey. Features work similarly for your LLM twin.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"ad3e\"><strong>Raw data<\/strong>, on the other hand, is the unrefined information collected from various sources. Social media posts might contain emojis, irrelevant links, or even typos. This raw data needs cleaning and transformation before it can be used effectively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"b2c0\">In our&nbsp;<strong>data flow<\/strong>, raw data is initially captured and stored in MongoDB, which remains unprocessed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"6a5e\">Then, we process this data to create features \u2014 key details we use to teach our LLM twin \u2014 and keep these in&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Qdrant<\/strong><\/a>. We do this to keep our raw data intact in case we need it again, while&nbsp;<a href=\"https:\/\/qdrant.tech\/?utm_source=decodingml&amp;utm_medium=referral&amp;utm_campaign=llm-course\" target=\"_blank\" rel=\"noreferrer noopener\">Qdrant<\/a>&nbsp;holds the ready-to-use features for efficient machine learning.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"d162\">5. Digging into the dispatcher and AWS Lambda<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"df9a\">In this section, we will focus on how to constantly update our database with the most recent data from the 3 data sources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"95d2\">Before diving into how to build the infrastructure of our data pipeline, I would like to show you how to \u201cthink\u201d through the whole process before stepping into the details of AWS.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d245\">The first step in doing an infrastructure is to draw a high-level overview of my components.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9250\">So, the components of our data pipeline are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Linkedin crawler<\/li>\n\n\n\n<li>Medium crawler<\/li>\n\n\n\n<li>Github crawler<\/li>\n\n\n\n<li>CustomArticle crawler<\/li>\n\n\n\n<li>MongoDB (Data Collector)<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*48hc90bpvgBqIeuCfDE3sg.png\" alt=\"The Importance of Data Pipelines in the Era of Generative AI, Decoding ML\"\/><figcaption class=\"wp-element-caption\">High-Level AWS Infrastructure [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"9305\">Every&nbsp;<strong>crawler<\/strong>&nbsp;is a .py file. Since this data pipeline must be constantly updated, we will design a system based on lambda functions, where every&nbsp;<strong>AWS<\/strong>&nbsp;L<strong>ambda function&nbsp;<\/strong>represents a crawler.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"e46b\">What is an&nbsp;<strong>AWS L<\/strong><strong>ambda function<\/strong>&nbsp;in the&nbsp;<strong>AWS Environment?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"aa8c\"><strong>AWS Lambda<\/strong>&nbsp;is a serverless computing service that allows you to run code without provisioning or managing servers. It executes your code only when needed and scales automatically, from a few daily requests to thousands per second.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"297f\">Here\u2019s how Lambda fits within the AWS environment and what makes it particularly powerful:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Event-Driven:<\/strong>&nbsp;AWS Lambda is designed to use events as triggers. These events could be changes to data in an Amazon S3 bucket, updates to a DynamoDB table, HTTP requests via Amazon API Gateway, or direct invocation via SDKs from other applications. In the diagram I provided, the events would likely be new or updated content on LinkedIn, Medium, or GitHub.<\/li>\n\n\n\n<li><strong>Scalable:<\/strong>&nbsp;AWS Lambda can run as many instances of the function as needed to respond to the rate of incoming events. This could mean running dozens or even hundreds of cases of your function in parallel.<\/li>\n\n\n\n<li><strong>Managed Execution Environment:<\/strong>&nbsp;AWS handles all the administration of the underlying infrastructure, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring, and logging. This allows you to focus on your code.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"41c4\">How can we put the medium crawler on an AWS Lambda function?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"d470\">We need a&nbsp;<strong>handler<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"f8d7\">The&nbsp;<code>handler<\/code>&nbsp;function is the entry point for the AWS Lambda function. In AWS Lambda, the&nbsp;<code>handler<\/code>&nbsp;function is invoked when an event triggers the Lambda function.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>from aws_lambda_powertools import Logger\nfrom aws_lambda_powertools.utilities.typing import LambdaContext\n\nfrom core import lib\nfrom core.db.documents import UserDocument\n\nfrom crawlers import CustomArticleCrawler, GithubCrawler, LinkedInCrawler\nfrom dispatcher import CrawlerDispatcher\n\nlogger = Logger(service=\"llm-twin-course\/crawler\")\n\n_dispatcher = CrawlerDispatcher()\n_dispatcher.register(\"medium\", CustomArticleCrawler)\n_dispatcher.register(\"linkedin\", LinkedInCrawler)\n_dispatcher.register(\"github\", GithubCrawler)\n\n\ndef handler(event, context: LambdaContext | None = None) -&gt; dict&#91;str, Any]:\n    first_name, last_name = lib.split_user_full_name(event.get(\"user\"))\n\n    user_id = UserDocument.get_or_create(first_name=first_name, last_name=last_name)\n\n    link = event.get(\"link\")\n    crawler = _dispatcher.get_crawler(link)\n\n    try:\n        crawler.extract(link=link, user=user_id)\n\n        return {\"statusCode\": 200, \"body\": \"Link processed successfully\"}\n    except Exception as e:\n        return {\"statusCode\": 500, \"body\": f\"An error occurred: {str(e)}\"}<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Full code at&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/data_crawling\/main.py\">data_crawling\/main.py<\/a><\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"8fda\">Each crawler function is tailored to its data source: fetching posts from&nbsp;<strong>LinkedIn<\/strong>, articles from&nbsp;<strong>Medium,<\/strong>&nbsp;and repository data from&nbsp;<strong>GitHub<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/miro.medium.com\/v2\/resize:fit:700\/1*lsdtZ_KBvjdAGrqYhwk8qA.png\" alt=\"The Importance of Data Pipelines in the Era of Generative AI, Decoding ML\"\/><figcaption class=\"wp-element-caption\">AWS High Level Architecture \u2014 Overview [Image by the Author]<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"4732\">To trigger the lambda function, we have created a&nbsp;<strong>python dispatcher&nbsp;<\/strong>which is responsible to manage the crawlers for specific domains.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"65e3\">You can register crawlers for different domains and then use the&nbsp;<code>get_crawler<\/code>&nbsp;method to get the appropriate crawler for a given URL, defaulting to the&nbsp;<em>CustomArticleCrawler<\/em>&nbsp;if the domain is not registered.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import re\n\nfrom aws_lambda_powertools import Logger\nfrom crawlers.base import BaseCrawler\nfrom crawlers.custom_article import CustomArticleCrawler\n\nlogger = Logger(service=\"llm-twin-course\/crawler\")\n\n\nclass CrawlerDispatcher:\n    def __init__(self) -&gt; None:\n        self._crawlers = {}\n\n    def register(self, domain: str, crawler: type&#91;BaseCrawler]) -&gt; None:\n        self._crawlers&#91;r\"https:\/\/(www\\.)?{}.com\/*\".format(re.escape(domain))] = crawler\n\n    def get_crawler(self, url: str) -&gt; BaseCrawler:\n        for pattern, crawler in self._crawlers.items():\n            if re.match(pattern, url):\n                return crawler()\n        else:\n            logger.warning(\n                f\"No crawler found for {url}. Defaulting to CustomArticleCrawler.\"\n            )\n\n            return CustomArticleCrawler()<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The responsible crawler processes its respective data and then passes it to the MongoDB data warehouse.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2192 Full code at&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/src\/data_crawling\/dispatcher.py\">data_crawling\/dispatcher.py<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The MongoDB component acts as a unified data store, collecting and managing the data harvested by the AWS Lambda functions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This&nbsp;<strong>infrastructure<\/strong>&nbsp;is designed for efficient and scalable data extraction, transformation, and loading (ETL) from diverse sources into a single database.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"5b21\">6. Run everything and populate your MongoDB data warehouse<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The first step is to spin up your local infrastructure using Docker by running:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>make local-start<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"2214\">Now, you can test the crawler that is running locally as a Lambda function by running the following to crawl a test Medium article:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>make local-test-medium<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, you can test it with a GitHub URL:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>local-test-github<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To&nbsp;<strong>populate<\/strong>&nbsp;the&nbsp;<strong>MongoDB data warehouse<\/strong>&nbsp;with&nbsp;<strong>all our supported links,<\/strong>&nbsp;run the following:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>make local-ingest-data<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Which will crawl all the links from the&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/data\/links.txt\">data\/links.txt<\/a>&nbsp;file.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Find&nbsp;<strong>step-by-step instructions<\/strong>&nbsp;on installing and running&nbsp;<strong>the entire course<\/strong>&nbsp;in our&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\/blob\/main\/INSTALL_AND_USAGE.md\">INSTALL_AND_USAGE<\/a>&nbsp;document from the repository.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"9e10\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this lesson of the LLM Twin course, you\u2019ve learned how to build crawlers for various data sources such as LinkedIn, GitHub, Medium and custom sites.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, you\u2019ve learned how to standardize, clean and store the results in a MongoDB.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By leveraging the dispatcher pattern, we have a central point that knows what crawler to use for what particular link.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ultimately, we showed you how to wrap the dispatcher under the interface expected by AWS Lambda to deploy it to AWS quickly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this lesson, we presented&nbsp;<strong>how to build a data pipeline<\/strong>&nbsp;and why it\u2019s so essential in an ML project:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Lesson 3, we will dive deeper into the&nbsp;<strong>change data capture (CDC) pattern<\/strong>&nbsp;and explain how it can connect data engineering to the AI world.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud83d\udd17&nbsp;<strong>Check out<\/strong>&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">the code on GitHub<\/a>&nbsp;[1] and support us with a&nbsp;<em>\u2b50\ufe0f<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\" id=\"200d\">[1]&nbsp;<a href=\"https:\/\/github.com\/decodingml\/llm-twin-course\" target=\"_blank\" rel=\"noreferrer noopener\">Your LLM Twin Course \u2014 GitHub Repository<\/a>&nbsp;(2024), Decoding ML GitHub Organization<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Welcome to Lesson 2 of 12 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You\u2019ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready \u201cLLM twin\u201d of yourself. This AI character will write like you, incorporating your style, personality, and voice [&hellip;]<\/p>\n","protected":false},"author":128,"featured_media":9720,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,6,7],"tags":[],"coauthors":[222,223],"class_list":["post-9673","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-llmops","category-machine-learning","category-tutorials"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Turn Social Media Content Into an LLM Training Dataset<\/title>\n<meta name=\"description\" content=\"A guide on how to build a data pipeline for an LLM system using various data sources.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training\" \/>\n<meta property=\"og:description\" content=\"A guide on how to build a data pipeline for an LLM system using various data sources.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2024-04-03T15:55:26+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-11-17T20:40:44+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM-1024x584.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1024\" \/>\n\t<meta property=\"og:image:height\" content=\"584\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Paul Iusztin, Decoding ML\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Paul Iusztin, Decoding ML\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How to Turn Social Media Content Into an LLM Training Dataset","description":"A guide on how to build a data pipeline for an LLM system using various data sources.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/","og_locale":"en_US","og_type":"article","og_title":"Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training","og_description":"A guide on how to build a data pipeline for an LLM system using various data sources.","og_url":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2024-04-03T15:55:26+00:00","article_modified_time":"2025-11-17T20:40:44+00:00","og_image":[{"width":1024,"height":584,"url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM-1024x584.png","type":"image\/png"}],"author":"Paul Iusztin, Decoding ML","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Paul Iusztin, Decoding ML","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/"},"author":{"name":"Paul Iusztin","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560"},"headline":"Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training","datePublished":"2024-04-03T15:55:26+00:00","dateModified":"2025-11-17T20:40:44+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/"},"wordCount":2444,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM.png","articleSection":["LLMOps","Machine Learning","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/","url":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/","name":"How to Turn Social Media Content Into an LLM Training Dataset","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#primaryimage"},"thumbnailUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM.png","datePublished":"2024-04-03T15:55:26+00:00","dateModified":"2025-11-17T20:40:44+00:00","description":"A guide on how to build a data pipeline for an LLM system using various data sources.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#primaryimage","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM.png","width":1626,"height":928},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/the-importance-of-data-pipelines-in-the-era-of-generative-ai\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"Your Content is Gold: I Turned 3 Years of Blog Posts into an LLM Training"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/87bf0cb600025605b68dcd2f0d597560","name":"Paul Iusztin","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/82264b94fb97af87b79646edc7e4fd81","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2026\/05\/cropped-paul-iusztin-96x96.webp","caption":"Paul Iusztin"},"sameAs":["https:\/\/decodingml.substack.com\/"],"url":"https:\/\/www.comet.com\/site\/blog\/author\/paul-iusztin\/"}]}},"jetpack_featured_media_url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2024\/04\/Screenshot-2024-04-03-at-4.40.11\u202fPM.png","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9673","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/128"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=9673"}],"version-history":[{"count":3,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9673\/revisions"}],"predecessor-version":[{"id":18469,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/9673\/revisions\/18469"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media\/9720"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=9673"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=9673"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=9673"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=9673"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}