{"id":8279,"date":"2023-11-30T06:57:42","date_gmt":"2023-11-30T14:57:42","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8279"},"modified":"2025-04-24T17:03:59","modified_gmt":"2025-04-24T17:03:59","slug":"langchain-document-loaders-for-web-data","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/","title":{"rendered":"LangChain Document Loaders for Web\u00a0Data"},"content":{"rendered":"\n<section class=\"section section--body\">\n<div class=\"section-divider\"><\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<h2 class=\"graf graf--h4\">And An Assessment of How They Impact Your ragas Metrics<\/h2>\n<figure class=\"graf graf--figure\">\n<\/figure><\/div><\/div><\/section>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@ilyapavlov?utm_source=medium&amp;utm_medium=referral\">Ilya Pavlov<\/a> on\u00a0<a href=\"http:\/\/Unsplash.com\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"graf graf--p\">If you\u2019ve ever wondered how the quality of information sourced by language models affects their outputs, you\u2019re in the right place.&nbsp;I\u2019m trying to unpack how different document loaders in LangChain impact a Retrieval Augmented Generation (RAG) system.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Why is this important?&nbsp;<\/strong><\/p>\n\n\n\n<p class=\"graf graf--p\">RAG is a game-changer. It cleverly combines retrieving information from external documents with the generative capabilities of language models. However, the effectiveness of this system hinges on one critical aspect\u200a\u2014\u200athe method used to retrieve documents.<\/p>\n\n\n\n<p class=\"graf graf--p\">This blog is about exploring and understanding this pivotal element.<\/p>\n\n\n\n<p class=\"graf graf--p\">We\u2019ll focus on three key players in LangChain:<\/p>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li>WebBaseLoader<\/li>\n\n\n\n<li>SeleniumURLLoader,<\/li>\n\n\n\n<li>NewsURLLoader.<\/li>\n<\/ol>\n\n\n\n<p class=\"graf graf--p\">Each has its approach to fetching information, and we will find out how these methods shape the final output of RAG models.<\/p>\n\n\n\n<p class=\"graf graf--p\">I invite you to join this exploration\u200a\u2014\u200ait\u2019s not just an exploration of code and algorithms but a journey to enhance the intelligence and responsiveness of AI systems.<\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">\ud83e\uddd1\ud83c\udffd\u200d\ud83d\udcbb Let\u2019s write some&nbsp;code!<\/h3>\n\n\n\n<p class=\"graf graf--p\">Start with some preliminaries and setting the environment.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">%%capture\n!pip install langchain openai unstructured selenium newspaper3k textstat tiktoken faiss-cpu\n\n<span class=\"hljs-keyword\">import<\/span> os\n<span class=\"hljs-keyword\">import<\/span> getpass\n<span class=\"hljs-keyword\">from<\/span> langchain.document_loaders <span class=\"hljs-keyword\">import<\/span> WebBaseLoader, UnstructuredURLLoader, NewsURLLoader, SeleniumURLLoader\n\n<span class=\"hljs-keyword\">import<\/span> tiktoken\n<span class=\"hljs-keyword\">import<\/span> matplotlib.pyplot <span class=\"hljs-keyword\">as<\/span> plt\n<span class=\"hljs-keyword\">import<\/span> pandas <span class=\"hljs-keyword\">as<\/span> pd\n<span class=\"hljs-keyword\">import<\/span> nltk\n<span class=\"hljs-keyword\">from<\/span> nltk.tokenize <span class=\"hljs-keyword\">import<\/span> sent_tokenize, word_tokenize\n<span class=\"hljs-keyword\">from<\/span> nltk.corpus <span class=\"hljs-keyword\">import<\/span> stopwords\n<span class=\"hljs-keyword\">from<\/span> textstat <span class=\"hljs-keyword\">import<\/span> flesch_reading_ease\n<span class=\"hljs-keyword\">from<\/span> collections <span class=\"hljs-keyword\">import<\/span> Counter\n\n<span class=\"hljs-keyword\">from<\/span> langchain.embeddings.openai <span class=\"hljs-keyword\">import<\/span> OpenAIEmbeddings\n<span class=\"hljs-keyword\">from<\/span> langchain.vectorstores <span class=\"hljs-keyword\">import<\/span> FAISS\n<span class=\"hljs-keyword\">from<\/span> langchain.chat_models <span class=\"hljs-keyword\">import<\/span> ChatOpenAI\n<span class=\"hljs-keyword\">from<\/span> langchain.chains <span class=\"hljs-keyword\">import<\/span> RetrievalQA\n<span class=\"hljs-keyword\">from<\/span> langchain.text_splitter <span class=\"hljs-keyword\">import<\/span> RecursiveCharacterTextSplitter\nos.environ[<span class=\"hljs-string\">'OPENAI_API_KEY'<\/span>] = getpass.getpass(<span class=\"hljs-string\">\"Input your Open AI Key:\"<\/span>)<\/span><\/pre>\n\n\n\n<p class=\"graf graf--p\">For this demonstration, we\u2019ll use <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/phys.org\/news\/2023-11-qa-dont-blame-chatbots.html\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/phys.org\/news\/2023-11-qa-dont-blame-chatbots.html\">this website<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">website = <span class=\"hljs-string\">\"https:\/\/phys.org\/news\/2023-11-qa-dont-blame-chatbots.html\"<\/span><\/span><\/pre>\n\n\n\n<p class=\"graf graf--p\">The function below will load the website into a LangChain document object:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">load_document<\/span>(<span class=\"hljs-params\">loader_class, website_url<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Load a document using the specified loader class and website URL.\n\n    Args:\n    loader_class (class): The class of the loader to be used.\n    website_url (str): The URL of the website from which to load the document.\n\n    Returns:\n    str: The loaded document.\n    \"\"\"<\/span>\n    loader = loader_class([website_url])\n    <span class=\"hljs-keyword\">return<\/span> loader.load()<\/span><\/pre>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">Understanding the WebBaseLoader<\/strong><\/h3>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*79twjt9NMCwotiTX\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@emilep?utm_source=medium&amp;utm_medium=referral\">Emile Perron<\/a> on\u00a0<a href=\"http:\/\/Unsplash.com\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"graf graf--p\">When extracting text from websites, the WebBaseLoader in LangChain is a tool you need to know about.<\/p>\n\n\n\n<p class=\"graf graf--p\">It\u2019s like a skilled miner adept at digging through the layers of a website to retrieve the valuable textual content beneath. Let\u2019s explain exactly how it works and what this means for embedding documents into a vector database.<\/p>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<blockquote class=\"graf graf--pullquote\"><p>Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? <a class=\"markup--anchor markup--pullquote-anchor\" href=\"https:\/\/www.comet.com\/production\/site\/llm-course\/?utm_source=Heartbeat&amp;utm_medium=referral&amp;utm_content=Medium&amp;utm_campaign=Heartbeat_LangChain_Series_HS\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/www.comet.com\/production\/site\/llm-course\/?utm_source=Heartbeat&amp;utm_medium=referral&amp;utm_content=Medium&amp;utm_campaign=Heartbeat_LangChain_Series_HS\">Check out this free LLMOps course<\/a> from industry expert Elvis Saravia of&nbsp;DAIR.AI!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<h4 class=\"graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">How WebBaseLoader Retrieves Text<\/strong><\/h4>\n<p class=\"graf graf--p\">The WebBaseLoader uses HTTP requests, a basic yet powerful way to communicate with web servers. Think of it as sending a letter to a website asking for its content. Once the website replies, WebBaseLoader takes over, sifting through the HTML\u200a\u2014\u200athe foundational code of web pages.<\/p>\n<p class=\"graf graf--p\">This is where BeautifulSoup, a Python library, comes into play. WebBaseLoader uses BeautifulSoup to parse the HTML, effectively reading and extracting the text. It\u2019s like having a translator who can interpret the complex language of HTML and present you with just the readable text.<\/p>\n<h4 class=\"graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Impact on Document Embedding and Vector Databases<\/strong><\/h4>\n<p class=\"graf graf--p\">When this extracted text is embedded into a vector database, there are a few implications:<\/p>\n<ol class=\"postList\">\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Quality of Extracted Text<\/strong>: WebBaseLoader relies on HTML structure and excels with well-structured websites. However, it might struggle with JavaScript-generated dynamic content, which is increasingly common in modern web design. This means the text it retrieves is as good as the HTML it interprets.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Efficiency<\/strong>: WebBaseLoader is efficient and fast, handling multiple requests seamlessly. This efficiency translates into quicker embedding of documents into your vector database, which is crucial for large-scale applications.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Relevance<\/strong>: The relevance of the extracted text can vary. In cases where websites are loaded with ads or unrelated content alongside the main text, WebBaseLoader might fetch some noise and valuable data. This could impact the precision of your RAG system\u2019s outputs.<\/li>\n<\/ol>\n<p class=\"graf graf--p\">While it\u2019s efficient and effective for static content, its performance can be limited by dynamic web elements. Remember the content you\u2019re targeting as we dive into document embedding and vector databases.<\/p>\n<p class=\"graf graf--p\">If you focus on static, well-structured websites, WebBaseLoader could be your workhorse in the RAG pipeline.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\">wb_loader_doc = load_document(WebBaseLoader, website)<\/span><\/pre>\n<p class=\"graf graf--p\">&nbsp;You can examine the extracted content from any of the loaders with the following pattern:<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\">wb_loader_doc[<span class=\"hljs-number\">0<\/span>].page_content<\/span><\/pre>\n<h3 class=\"graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">Grasping the SeleniumURLLoader<\/strong><\/h3>\n<figure class=\"graf graf--figure\">\n<\/figure><\/div><\/div><\/section>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*p-wAgGe5MoDAfNvJ\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/unsplash.com\/@markusspiske?utm_source=medium&amp;utm_medium=referral\">Markus Spiske<\/a> on\u00a0<a href=\"http:\/\/Unsplash.com\">Unsplash<\/a><\/figcaption><\/figure>\n\n\n\n<p class=\"graf graf--p\">Imagine a scenario where you need to extract text from a website as dynamic as a bustling city street\u200a\u2014\u200achanging every moment, filled with interactive elements and content that loads as you scroll.<\/p>\n\n\n\n<p class=\"graf graf--p\">The SeleniumURLLoader steps in, bringing a different skill set than the WebBaseLoader.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">How SeleniumURLLoader Retrieves Text<\/strong><\/h4>\n\n\n\n<p class=\"graf graf--p\">The SeleniumURLLoader is like an undercover agent in the world of web browsers.<\/p>\n\n\n\n<p class=\"graf graf--p\">It doesn\u2019t just send a request to a website; it navigates the web as a user would. Using Selenium, a powerful tool for browser automation, it opens an actual browser window (in headless mode, meaning without a graphical interface) and interacts with the webpage. This ability to simulate user interactions is crucial for websites where content is rendered through JavaScript\u200a\u2014\u200aa common scenario in modern web development.<\/p>\n\n\n\n<p class=\"graf graf--p\">Before extracting the text, the loader waits for the page to load, including any dynamic content.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Impact on Document Embedding and Vector Databases<\/strong><\/h4>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Comprehensive Text Retrieval<\/strong>: Since it interacts with web pages like a human user, SeleniumURLLoader can retrieve text that other methods might miss. This includes content that appears due to user interactions or is dynamically loaded by JavaScript.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Performance Considerations<\/strong>: The thoroughness of SeleniumURLLoader comes at a cost. It\u2019s slower and more resource-intensive than more straightforward HTTP request methods. When embedding documents into a vector database, this could mean longer processing times, especially for large volumes of data.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Accuracy and Relevance<\/strong>: The text retrieved by SeleniumURLLoader tends to be highly accurate and reflective of the user\u2019s experience on the website. This can lead to more relevant and context-rich embeddings in your vector database, potentially enhancing the quality of your RAG system\u2019s outputs.<\/li>\n<\/ol>\n\n\n\n<p class=\"graf graf--p\">The SeleniumURLLoader is your toolkit\u2019s Swiss army knife for dealing with dynamic, JavaScript-heavy websites. It offers a depth of text retrieval unmatched by more straightforward methods but requires more resources and time.<\/p>\n\n\n\n<p class=\"graf graf--p\">A RAG pipeline is the ideal choice when your focus is on comprehensively capturing the essence of modern, interactive web pages.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">selenium_loader_doc = load_document(SeleniumURLLoader, website)<\/span><\/pre>\n\n\n\n<p class=\"graf graf--p\">With the WebBaseLoader and SeleniumURLLoader covered, we\u2019ll next explore the NewsURLLoader, a specialized tool for news content.<\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Unveiling the NewsURLLoader in LangChain<\/h3>\n\n\n\n<p class=\"graf graf--p\">NewsURLLoader is designed specifically for news articles.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">How NewsURLLoader Retrieves Text<\/strong><\/h4>\n\n\n\n<p class=\"graf graf--p\">The NewsURLLoader doesn\u2019t just fetch text; it\u2019s adept at navigating through the unique structure of news articles.<\/p>\n\n\n\n<p class=\"graf graf--p\">Using the <code class=\"markup--code markup--p-code\">newspaper<\/code> library, a Python package tailored for news extraction, performs a more refined retrieval. This loader not only fetches the article but also understands the typical layout of news websites, effectively separating the main content from the clutter of ads and sidebars. Moreover, the NewsURLLoader can perform light NLP (Natural Language Processing) tasks.<\/p>\n\n\n\n<p class=\"graf graf--p\">This means it doesn\u2019t just hand you the text; it can also provide summaries and extract keywords, offering a more concise and focused insight into the content.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Impact on Document Embedding and Vector Databases<\/strong><\/h4>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Targeted and Clean Extraction<\/strong>: The NewsURLLoader is designed explicitly for news content, which means it can efficiently extract clean and relevant text from news articles. This leads to high-quality document embeddings, especially valuable for news-related queries in an RAG system.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">NLP Enhancements<\/strong>: The optional NLP features of the NewsURLLoader add an extra layer of value. By embedding summarized content and key terms, your vector database can become more efficient, focusing on the essence rather than the bulk of news articles.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Scope Limitation<\/strong>: While it\u2019s a powerhouse for news content, the NewsURLLoader\u2019s specialization is also its limitation. It\u2019s different than the tool for general-purpose web scraping or for handling dynamic, interactive content like the SeleniumURLLoader.<\/li>\n<\/ol>\n\n\n\n<p class=\"graf graf--p\">The NewsURLLoader shines in its domain, making it an excellent choice for RAG systems focused on current events, journalism, or news analysis. It offers clean, concise, and relevant text extraction, with the bonus of NLP processing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Analyzing the content from each&nbsp;loader<\/h3>\n\n\n\n<p class=\"graf graf--p\">In this analysis, you\u2019ll dive deep into the text extracted by the three document loaders: WebBaseLoader, SeleniumURLLoader, and NewsURLLoader.<\/p>\n\n\n\n<p class=\"graf graf--p\">You\u2019ll compare their outputs based on specific metrics: the total number of characters, the count of alphanumeric characters, the number of newline characters, and the total number of tokens as determined by GPT-4 encoding.<\/p>\n\n\n\n<p class=\"graf graf--p\">The goal is to quantitatively assess the nature and quality of text each loader extracts. This technical analysis will provide clear insights into the efficiency and accuracy of these loaders, helping us understand their impact on a Retrieval Augmented Generation system.<\/p>\n\n\n\n<p class=\"graf graf--p\">You\u2019ll present our findings through concise bar plots, comparing each loader\u2019s performance straightforwardly.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">count_alphanumeric<\/span>(<span class=\"hljs-params\">text<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Count the number of alphanumeric characters in a given text.\n\n    Args:\n    text (str): The text to be analyzed.\n\n    Returns:\n    int: The total number of alphanumeric characters in the text.\n    \"\"\"<\/span>\n    <span class=\"hljs-keyword\">return<\/span> <span class=\"hljs-built_in\">sum<\/span>(char.isalnum() <span class=\"hljs-keyword\">for<\/span> char <span class=\"hljs-keyword\">in<\/span> text)\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">num_tokens_from_string<\/span>(<span class=\"hljs-params\">string: <span class=\"hljs-built_in\">str<\/span><\/span>) -&gt; <span class=\"hljs-built_in\">int<\/span>:\n    <span class=\"hljs-string\">\"\"\"Returns the number of tokens in a text string.\"\"\"<\/span>\n    encoding = tiktoken.encoding_for_model(<span class=\"hljs-string\">\"gpt-4-1106-preview\"<\/span>)\n    num_tokens = <span class=\"hljs-built_in\">len<\/span>(encoding.encode(string))\n    <span class=\"hljs-keyword\">return<\/span> num_tokens\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">analyze_texts<\/span>(<span class=\"hljs-params\">texts<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Analyze the given texts to count total, alphanumeric, and newline characters.\n\n    Args:\n    texts (dict): A dictionary where keys are identifiers (e.g., loader names) and\n                  values are the corresponding text strings.\n\n    Returns:\n    tuple of dicts: A tuple containing three dictionaries, each with counts of\n                    total characters, alphanumeric characters, and newline characters respectively.\n    \"\"\"<\/span>\n    total_characters = {loader: <span class=\"hljs-built_in\">len<\/span>(text) <span class=\"hljs-keyword\">for<\/span> loader, text <span class=\"hljs-keyword\">in<\/span> texts.items()}\n    alphanumeric_characters = {loader: count_alphanumeric(text) <span class=\"hljs-keyword\">for<\/span> loader, text <span class=\"hljs-keyword\">in<\/span> texts.items()}\n    newline_characters = {loader: text.count(<span class=\"hljs-string\">'\\n'<\/span>) <span class=\"hljs-keyword\">for<\/span> loader, text <span class=\"hljs-keyword\">in<\/span> texts.items()}\n    token_count = {loader: num_tokens_from_string(text) <span class=\"hljs-keyword\">for<\/span> loader, text <span class=\"hljs-keyword\">in<\/span> texts.items()}\n    <span class=\"hljs-keyword\">return<\/span> total_characters, alphanumeric_characters, newline_characters, token_count\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">plot_data<\/span>(<span class=\"hljs-params\">data, title<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Create a bar plot for the given data.\n\n    Args:\n    data (dict): A dictionary containing the data to be plotted. Keys are considered as labels\n                 and values as the corresponding data points.\n    title (str): The title of the plot.\n\n    Note:\n    The bars in the plot are colored blue, green, and red, in the order of the dictionary keys.\n    \"\"\"<\/span>\n    plt.bar(data.keys(), data.values(), color=[<span class=\"hljs-string\">'blue'<\/span>, <span class=\"hljs-string\">'green'<\/span>, <span class=\"hljs-string\">'red'<\/span>])\n    plt.title(title)\n    plt.ylabel(<span class=\"hljs-string\">'Count'<\/span>)\n    plt.xticks(rotation=<span class=\"hljs-number\">45<\/span>)\n    plt.show()\n\ntotal_chars, alphanumeric_chars, newline_chars, token_count = analyze_texts(texts)<\/span><\/pre>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Analyzing the Extracted Text<\/h3>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*I7WKvlA2KZgYzx_211E8Ew.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">WebBaseLoader (16,191 Characters)<\/strong><\/h3>\n\n\n\n<p class=\"graf graf--p\">The text includes a mix of the main article content, website navigation elements, metadata, and other peripheral information. This indicates that WebBaseLoader extracts all text from the HTML without differentiating between the main content and other page elements.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Potential Challenges for RAG&nbsp;System<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Noise in Data: The presence of non-relevant text (e.g., menu items, footer information) can introduce noise, potentially impacting the accuracy and relevance of the RAG system\u2019s outputs.<\/li>\n\n\n\n<li>Need for Post-Processing: To enhance the quality of embeddings, you might need to post-process this text to filter out irrelevant parts and focus on the main content.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">SeleniumURLLoader (23,598 Characters)<\/strong><\/h3>\n\n\n\n<p class=\"graf graf--p\">&nbsp;The highest character count comes from the SeleniumURLLoader. This can be attributed to its method of loading pages as a browser would, capturing the primary content and potentially more of the surrounding elements and dynamically loaded content.<\/p>\n\n\n\n<p class=\"graf graf--p\">This text, similar to the WebBaseLoader\u2019s output, includes the main article content and additional elements like website headers, footers, and navigation links. However, it\u2019s more focused on the article, suggesting a better capture of the intended content.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Potential Challenges for RAG&nbsp;System<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Reduced Noise, But Still Present: While there\u2019s less irrelevant text compared to the WebBaseLoader output, the presence of some non-article elements can still introduce noise.<\/li>\n\n\n\n<li>Post-Processing Consideration: Like with the WebBaseLoader, filtering out irrelevant parts will enhance the quality of embeddings.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">NewsURLLoader (7,580 Characters)<\/strong><\/h3>\n\n\n\n<p class=\"graf graf--p\">The NewsURLLoader shows the lowest character count.<\/p>\n\n\n\n<p class=\"graf graf--p\">The text appears more focused and streamlined than WebBaseLoader and SeleniumURLLoader&#8217;s outputs. It mainly consists of the main article content, with minimal peripheral information. This indicates that the NewsURLLoader is effectively targeting and extracting the core content of the news article.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Potential Challenges for RAG&nbsp;System<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>High Relevance and Quality: The content&#8217;s higher relevance and focused nature mean it\u2019s more likely to produce accurate and contextually relevant embeddings in a vector database.<\/li>\n\n\n\n<li>Limited Need for Post-Processing: Unlike the other two loaders, the NewsURLLoader requires minimal post-processing to filter out noise, as it already provides a clean extraction of the news content.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\"><strong class=\"markup--strong markup--h3-strong\">Implications<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">WebBaseLoader<\/strong>: Offers a balance between breadth and depth, suitable for general-purpose web scraping where capturing a wide range of content is necessary.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">SeleniumURLLoader<\/strong>: Ideal for scenarios where comprehensive text capture, including dynamic content, is crucial. However, this can lead to a larger volume of data, potentially increasing processing time and resource usage.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">NewsURLLoader<\/strong>: Best suited for applications where focused and relevant content extraction is key, such as news aggregation and analysis, providing clean and concise outputs.<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p\">These insights help in understanding how each loader functions and in choosing the right tool depending on the specific requirements of your application, especially in an RAG pipeline.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">More analysis<\/h4>\n\n\n\n<p class=\"graf graf--p\">You can do a similar analysis as above across different axes:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-built_in\">plot_data<\/span>(alphanumeric_chars, 'Number of Alphanumeric Characters')<\/span><\/pre>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*W2Fwk68n6t2-wqXceiv6Iw.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-built_in\">plot_data<\/span>(newline_chars, 'Number of Newline Characters')<\/span><\/pre>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*VjlJAUZVqyJjcTuUvyqVng.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-built_in\">plot_data<\/span>(token_count, 'Number of Tokens')<\/span><\/pre>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*FBGtL9enVuUDdSrvtwbUkw.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Why do we see this difference?<\/h3>\n\n\n\n<p class=\"graf graf--p\">The discrepancy in the number of characters each loader extracted can be attributed to their distinct methodologies and the source code that drives their functionality.<\/p>\n\n\n\n<p class=\"graf graf--p\">Here\u2019s a breakdown based on the source code and operational differences:<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">WebBaseLoader<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Methodology<\/strong>: It performs direct HTML fetching using HTTP requests and parses the HTML content with BeautifulSoup.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Why the Difference<\/strong>: This loader extracts all text content from the HTML, including main content, navigation elements, headers, footers, and possibly some script elements. However, it does not execute JavaScript, so any content loaded dynamically (which is common in modern web pages) is not captured. This can lead to a moderate character count\u200a\u2014\u200asubstantial but not exhaustive.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">SeleniumURLLoader<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Methodology<\/strong>: Uses Selenium for browser automation, which launches a browser instance (often in headless mode) and interacts with the page like a human user. It can execute JavaScript and capture dynamically loaded content.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Why the Difference<\/strong>: The higher character count is likely due to this loader\u2019s ability to capture more comprehensive content, including dynamic elements that only load upon user interaction or as a part of JavaScript execution. This method fetches the static HTML content and the additional text that becomes available as the page fully renders in a browser environment. This thorough approach results in capturing a larger volume of text.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">NewsURLLoader<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Methodology<\/strong>: Utilizes the <code class=\"markup--code markup--li-code\">newspaper<\/code> library designed explicitly for scraping and curating news articles. It is optimized for extracting article content while excluding unrelated material.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Why the Difference<\/strong>: The lower character count reflects its focused extraction. The <code class=\"markup--code markup--li-code\">newspaper<\/code> library targets the core article text and is adept at ignoring extraneous content like ads, sidebars, or site navigation elements. This results in a cleaner and more concise text extraction, focusing primarily on the main news content.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\"><strong class=\"markup--strong markup--h4-strong\">Summary<\/strong><\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">WebBaseLoader<\/strong>: Provides a broad capture of HTML content but misses dynamic content, leading to a moderate character count.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">SeleniumURLLoader<\/strong>: Captures a complete picture of the webpage, including dynamic content, which results in the highest character count.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">NewsURLLoader<\/strong>: Highly specialized and focused on news content, leading to the lowest character count due to its targeted extraction.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Splitting text for retrieval using RecursiveCharacterTextSplitter<\/h3>\n\n\n\n<p class=\"graf graf--p\">RecursiveCharacterTextSplitter is designed to split text into chunks based on a list of separators, which can be tailored for different programming languages or text formats.<\/p>\n\n\n\n<p class=\"graf graf--p\">The class employs a recursive approach to splitting, ensuring that if one separator doesn&#8217;t result in a split, it falls back to the next one in the list.<\/p>\n\n\n\n<p class=\"graf graf--p\">Here\u2019s a breakdown of the key components and functionalities of this class:<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Separators<\/strong>: The class takes a list of separators (<code class=\"markup--code markup--li-code\">separators<\/code>) which are used to split the text. The default list includes common separators like new lines and spaces. The separators can be regular expressions if <code class=\"markup--code markup--li-code\">is_separator_regex<\/code> is set to <code class=\"markup--code markup--li-code\">True<\/code>.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Recursive Splitting<\/strong>: The method <code class=\"markup--code markup--li-code\">_split_text<\/code> attempts to split the text using the provided separators. If a separator doesn&#8217;t successfully split the text, or if the resulting chunks are too large (exceed the specified chunk size), the method recursively tries with the next separator in the list.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Language-Specific Separators<\/strong>: The class can adapt its separators based on the programming language of the text, as indicated by the <code class=\"markup--code markup--li-code\">get_separators_for_language<\/code> method. This method returns a list of separators appropriate for programming languages like Python, Java, C++, etc.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Chunk Size and Merging<\/strong>: The class ensures that the resulting chunks are within a certain size limit (<code class=\"markup--code markup--li-code\">_chunk_size<\/code>). If smaller chunks are created, they can be merged back together to ensure that each chunk is of a reasonable size.<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p\">To use this class effectively:<\/p>\n\n\n\n<ol class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Instantiate the Class<\/strong>: Create an instance of <code class=\"markup--code markup--li-code\">RecursiveCharacterTextSplitter<\/code>, specify the separators if the default ones are unsuitable for your text.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Split Texts<\/strong>: Use the <code class=\"markup--code markup--li-code\">split_text<\/code> method to split the texts from each of your loaders.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Post-Processing<\/strong>: After splitting, you may need to post-process the chunks, especially if the splitting results in broken sentences or contexts.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Further Analysis<\/strong>: Once the text is split into manageable chunks, you can proceed with your analysis by creating embeddings or pushing them to a vector database.<\/li>\n<\/ol>\n\n\n\n<p class=\"graf graf--p\">This class is handy when dealing with large text files or texts where a simple split by a single character (like a newline) is insufficient. It allows for a more nuanced and flexible approach to text splitting, catering to the specific structural nuances of different text types.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">text_splitter = RecursiveCharacterTextSplitter(\n    <span class=\"hljs-comment\"># Set a really small chunk size, just to show.<\/span>\n    chunk_size = <span class=\"hljs-number\">250<\/span>,\n    chunk_overlap  = <span class=\"hljs-number\">5<\/span>,\n    length_function = <span class=\"hljs-built_in\">len<\/span>\n)\n\ntexts = {\n    <span class=\"hljs-string\">'Web Base Loader'<\/span>: wb_loader_doc[<span class=\"hljs-number\">0<\/span>].page_content,\n    <span class=\"hljs-string\">'Selenium Loader'<\/span>: selenium_loader_doc[<span class=\"hljs-number\">0<\/span>].page_content,\n    <span class=\"hljs-string\">'News URL Loader'<\/span>: newsurl_docs[<span class=\"hljs-number\">0<\/span>].page_content\n}\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">create_chunks<\/span>(<span class=\"hljs-params\">document<\/span>):\n    <span class=\"hljs-keyword\">return<\/span> text_splitter.split_documents(document)\n\n<span class=\"hljs-comment\"># Creating chunks for each document<\/span>\nwb_chunks = create_chunks(wb_loader_doc)\nselenium_chunks = create_chunks(selenium_loader_doc)\nnewsurl_chunks = create_chunks(newsurl_docs)\n\nchunk_counts = {\n    <span class=\"hljs-string\">'WebBase Loader'<\/span>: <span class=\"hljs-built_in\">len<\/span>(wb_chunks),\n    <span class=\"hljs-string\">'Selenium Loader'<\/span>: <span class=\"hljs-built_in\">len<\/span>(selenium_chunks),\n    <span class=\"hljs-string\">'News URL Loader'<\/span>: <span class=\"hljs-built_in\">len<\/span>(newsurl_chunks)\n}\n\nplot_data(chunk_counts, <span class=\"hljs-string\">'Number of Chunks in Each Document'<\/span>)<\/span><\/pre>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*eVXFVGhn2j57o83Mbct0BQ.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">RAG Pipeline<\/h3>\n\n\n\n<p class=\"graf graf--p\">To set up your RAG pipeline, you must create vector store retrievers. The following code will do that for you:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">create_index_and_retriever<\/span>(<span class=\"hljs-params\">chunks, embeddings<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Create an index and retriever for the given chunks using the specified embeddings.\n\n    Args:\n    chunks (list): List of text chunks to be indexed.\n    embeddings (Embeddings object): Embedding model used for creating the index.\n\n    Returns:\n    retriever (Retriever object): The retriever object for the created index.\n    \"\"\"<\/span>\n    index = FAISS.from_documents(chunks, embeddings)\n    retriever = index.as_retriever()\n    <span class=\"hljs-keyword\">return<\/span> retriever\n\n\n<span class=\"hljs-comment\"># Embedding and Language Model setup<\/span>\nembeddings = OpenAIEmbeddings(show_progress_bar=<span class=\"hljs-literal\">True<\/span>)\nllm = ChatOpenAI(model=<span class=\"hljs-string\">\"gpt-4-1106-preview\"<\/span>)\n\n<span class=\"hljs-comment\"># Creating indexes and retrievers<\/span>\nwb_retriever = create_index_and_retriever(wb_chunks, embeddings)\nselenium_retriever = create_index_and_retriever(selenium_chunks, embeddings)\nnews_url_retriever = create_index_and_retriever(newsurl_chunks, embeddings)<\/span><\/pre>\n\n\n\n<p class=\"graf graf--p\">The following are the questions (queries )and ground truth answers (answers) that you\u2019ll use to assess the performance of each retriever.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-attr\">queries<\/span> = [\n    <span class=\"hljs-string\">\"What are educators' main concerns regarding using AI chatbots like ChatGPT by students?\"<\/span>,\n    <span class=\"hljs-string\">\"Why do the Stanford researchers believe that concerns about AI chatbots leading to increased student cheating are misdirected?\"<\/span>,\n    <span class=\"hljs-string\">\"What findings have the Stanford researchers gathered about the prevalence of cheating among U.S. high school students in the context of AI chatbots?\"<\/span>,\n    <span class=\"hljs-string\">\"What alternative reasons might explain why students cheat, according to the article?\"<\/span>,\n    <span class=\"hljs-string\">\"What recommendations or strategies do the article or researchers suggest for addressing academic dishonesty in schools?\"<\/span>\n]\n\n<span class=\"hljs-attr\">answers<\/span> = [\n    <span class=\"hljs-string\">\"Educators are concerned about students using AI chatbots like ChatGPT to cheat by passing off AI-generated writing as their own.\"<\/span>,\n    <span class=\"hljs-string\">\"Stanford researchers believe concerns about AI chatbots leading to increased cheating are misdirected because cheating predates these technologies, and when students cheat, it's typically for reasons unrelated to technology access.\"<\/span>,\n    <span class=\"hljs-string\">\"Their research shows that 60% to 70% of students admitted to cheating before the advent of AI chatbots, and this rate has remained constant or even slightly decreased in 2023.\"<\/span>,\n    <span class=\"hljs-string\">\"Alternative reasons for cheating include struggling with material, excessive homework, assignments feeling like busywork, and overwhelming pressure to achieve.\"<\/span>,\n    <span class=\"hljs-string\">\"Recommended strategies include helping students feel more engaged and valued, addressing deeper systemic problems, and promoting a sense of belonging, purpose, and connection in the educational environment.\"<\/span>\n]<\/span><\/pre>\n\n\n\n<p class=\"graf graf--p\">The <code class=\"markup--code markup--p-code\">QAChainRunner<\/code> is a pivotal component designed to streamline the querying and retrieving answers using a RetrievalQA chain.<\/p>\n\n\n\n<p class=\"graf graf--p\">This class, engineered for flexibility and efficiency, is a centralized conduit between the user&#8217;s queries and the complex machinery of language model-based retrieval systems. Upon initialization, it accepts a pre-defined language model (LLM), setting the stage for sophisticated query-processing operations.<\/p>\n\n\n\n<p class=\"graf graf--p\">In action, the <code class=\"markup--code markup--p-code\">QAChainRunner<\/code> takes a retriever object and a query as input.<\/p>\n\n\n\n<p class=\"graf graf--p\">It then dynamically constructs a RetrievalQA chain, leveraging the power of the provided language model to interpret and process the query. The real strength of this class lies in its ability to handle multiple queries seamlessly, returning a structured and comprehensive set of results. Each result includes the original query, the generated answer, and the source documents that informed the response, offering an insightful peek into the retrieval process.<\/p>\n\n\n\n<p class=\"graf graf--p\">In essence, <code class=\"markup--code markup--p-code\">QAChainRunner<\/code> acts as an intelligent intermediary, transforming simple queries into insightful answers, making it an indispensable tool for any application or system focused on advanced information retrieval and question-answering tasks.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-keyword\">from<\/span> typing <span class=\"hljs-keyword\">import<\/span> <span class=\"hljs-type\">List<\/span>, <span class=\"hljs-type\">Dict<\/span>, <span class=\"hljs-type\">Any<\/span>\n<span class=\"hljs-keyword\">from<\/span> datasets <span class=\"hljs-keyword\">import<\/span> Dataset\n\n<span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title class_\">QAChainRunner<\/span>:\n    <span class=\"hljs-string\">\"\"\"\n    Class to handle running queries through a RetrievalQA chain.\n    \"\"\"<\/span>\n    <span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">__init__<\/span>(<span class=\"hljs-params\">self, llm<\/span>):\n        self.llm = llm\n\n    <span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">run_retrieval_qa<\/span>(<span class=\"hljs-params\">self, retriever, query<\/span>):\n        <span class=\"hljs-string\">\"\"\"\n        Run a query through the RetrievalQA chain.\n\n        Args:\n        retriever (Retriever object): The retriever to use.\n        query (str): The query to process.\n\n        Returns:\n        dict: The response including the query, result, and source documents.\n        \"\"\"<\/span>\n        <span class=\"hljs-keyword\">try<\/span>:\n            qa_chain = RetrievalQA.from_chain_type(llm=self.llm,\n                                                   retriever=retriever,\n                                                   verbose=<span class=\"hljs-literal\">True<\/span>,\n                                                   return_source_documents=<span class=\"hljs-literal\">True<\/span>)\n            <span class=\"hljs-keyword\">return<\/span> qa_chain.invoke(query)\n        <span class=\"hljs-keyword\">except<\/span> Exception <span class=\"hljs-keyword\">as<\/span> e:\n            <span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">f\"Error in running RetrievalQA: <span class=\"hljs-subst\">{e}<\/span>\"<\/span>)\n            <span class=\"hljs-keyword\">return<\/span> {<span class=\"hljs-string\">\"query\"<\/span>: query, <span class=\"hljs-string\">\"result\"<\/span>: <span class=\"hljs-literal\">None<\/span>, <span class=\"hljs-string\">\"source_documents\"<\/span>: []}\n\n    <span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">run_queries<\/span>(<span class=\"hljs-params\">self, retriever, queries: <span class=\"hljs-type\">List<\/span>[<span class=\"hljs-built_in\">str<\/span>]<\/span>) -&gt; <span class=\"hljs-type\">List<\/span>[<span class=\"hljs-type\">Dict<\/span>[<span class=\"hljs-built_in\">str<\/span>, <span class=\"hljs-type\">Any<\/span>]]:\n        <span class=\"hljs-string\">\"\"\"\n        Run multiple queries through the RetrievalQA chain.\n\n        Args:\n        retriever (Retriever object): The retriever to use.\n        queries (List[str]): List of queries to process.\n\n        Returns:\n        List[Dict[str, Any]]: List of responses for each query.\n        \"\"\"<\/span>\n        <span class=\"hljs-keyword\">return<\/span> [self.run_retrieval_qa(retriever, query) <span class=\"hljs-keyword\">for<\/span> query <span class=\"hljs-keyword\">in<\/span> queries]\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">parse_retrieval_qa_results<\/span>(<span class=\"hljs-params\">results, ground_truths<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Parse the results from the RetrievalQA pipeline into a structured format.\n\n    Args:\n    results (List[Dict[str, Any]]): Results from the RetrievalQA pipeline.\n    ground_truths (List[str]): Ground truth answers.\n\n    Returns:\n    Dict[str, List[Any]]: Parsed results including questions, answers, contexts, and ground truths.\n    \"\"\"<\/span>\n    parsed_results = {<span class=\"hljs-string\">'question'<\/span>: [], <span class=\"hljs-string\">'answer'<\/span>: [], <span class=\"hljs-string\">'contexts'<\/span>: [], <span class=\"hljs-string\">'ground_truths'<\/span>: []}\n\n    <span class=\"hljs-keyword\">for<\/span> i, result <span class=\"hljs-keyword\">in<\/span> <span class=\"hljs-built_in\">enumerate<\/span>(results):\n        query = result.get(<span class=\"hljs-string\">'query'<\/span>)\n        answer = result.get(<span class=\"hljs-string\">'result'<\/span>)\n        source_documents = result.get(<span class=\"hljs-string\">'source_documents'<\/span>, [])\n\n        <span class=\"hljs-comment\"># Transform Document objects into a compatible format (e.g., string or dict)<\/span>\n        contexts = []\n        <span class=\"hljs-keyword\">for<\/span> doc <span class=\"hljs-keyword\">in<\/span> source_documents:\n            <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-built_in\">hasattr<\/span>(doc, <span class=\"hljs-string\">'page_content'<\/span>):\n                <span class=\"hljs-comment\"># Assuming doc is a Document object with a 'page_content' attribute<\/span>\n                contexts.append(doc.page_content)\n            <span class=\"hljs-keyword\">elif<\/span> <span class=\"hljs-built_in\">isinstance<\/span>(doc, <span class=\"hljs-built_in\">dict<\/span>):\n                <span class=\"hljs-comment\"># If doc is already a dictionary, use as is or convert to string<\/span>\n                contexts.append(<span class=\"hljs-built_in\">str<\/span>(doc))\n            <span class=\"hljs-keyword\">else<\/span>:\n                <span class=\"hljs-comment\"># Fallback for other types<\/span>\n                contexts.append(<span class=\"hljs-built_in\">str<\/span>(doc))\n\n        parsed_results[<span class=\"hljs-string\">'question'<\/span>].append(query)\n        parsed_results[<span class=\"hljs-string\">'answer'<\/span>].append(answer)\n        parsed_results[<span class=\"hljs-string\">'contexts'<\/span>].append(contexts)\n        parsed_results[<span class=\"hljs-string\">'ground_truths'<\/span>].append(ground_truths[i] <span class=\"hljs-keyword\">if<\/span> i &lt; <span class=\"hljs-built_in\">len<\/span>(ground_truths) <span class=\"hljs-keyword\">else<\/span> [])\n\n    <span class=\"hljs-keyword\">return<\/span> parsed_results\n\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">create_hf_dataset_from_dict<\/span>(<span class=\"hljs-params\">parsed_results: <span class=\"hljs-type\">Dict<\/span>[<span class=\"hljs-built_in\">str<\/span>, <span class=\"hljs-type\">List<\/span>[<span class=\"hljs-type\">Any<\/span>]]<\/span>) -&gt; Dataset:\n    <span class=\"hljs-string\">\"\"\"\n    Convert parsed results into a Hugging Face Dataset object.\n\n    Args:\n    parsed_results (Dict[str, List[Any]]): Parsed results from the RetrievalQA pipeline.\n\n    Returns:\n    Dataset: A Hugging Face Dataset object.\n    \"\"\"<\/span>\n    <span class=\"hljs-keyword\">try<\/span>:\n        <span class=\"hljs-keyword\">return<\/span> Dataset.from_dict(parsed_results)\n    <span class=\"hljs-keyword\">except<\/span> Exception <span class=\"hljs-keyword\">as<\/span> e:\n        <span class=\"hljs-built_in\">print<\/span>(<span class=\"hljs-string\">f\"Error in creating dataset: <span class=\"hljs-subst\">{e}<\/span>\"<\/span>)\n        <span class=\"hljs-keyword\">return<\/span> <span class=\"hljs-literal\">None<\/span><\/span><\/pre>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Context Recall and Context Precision in&nbsp;ragas<\/h3>\n\n\n\n<p class=\"graf graf--p\">Context Recall is a metric for information retrieval and natural language processing, particularly in systems like Retrieval-Augmented Generation (RAG). It measures the effectiveness of a retrieval system in fetching relevant information or documents that contribute meaningfully to generating accurate and contextually appropriate answers to queries.<\/p>\n\n\n\n<p class=\"graf graf--p\">Context Precision, similar to Context Recall, is a vital metric for evaluating information retrieval systems, particularly in contexts like Retrieval-Augmented Generation (RAG) models. While Context Recall focuses on the proportion of relevant documents retrieved from the total ground truths, Context Precision measures the relevance of the retrieved documents against all the documents retrieved.<\/p>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">How Context Recall&nbsp;Works<\/h3>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Information Retrieval<\/strong>: In RAG systems, when a query is posed, the model retrieves a set of documents or contexts that it believes are relevant to the query.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Answer Generation<\/strong>: The model then uses these contexts and language understanding capabilities to generate an answer.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">What Context Recall&nbsp;Measures<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Relevance of Retrieved Contexts<\/strong>: Context Recall assesses how many retrieved documents are relevant or helpful in answering the query.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Effectiveness of the Retrieval Component<\/strong>: It evaluates the retrieval component of the model, which is crucial for the overall quality of the answer.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">How Context Recall is&nbsp;Measured<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Comparison With Ground Truths<\/strong>: Typically, for each query, there is a set of ground truth documents known to contain relevant information. Context Recall measures how many of these ground truth documents were retrieved by the model.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Calculation<\/strong>: It can be calculated as the proportion or count of relevant documents retrieved out of the total ground truth documents. This is often represented as a percentage or a ratio.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">Importance in AI&nbsp;Models<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Improves Answer Quality<\/strong>: High Context Recall indicates that the model effectively fetches relevant information, which is crucial for generating accurate and comprehensive answers.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Model Optimization<\/strong>: By measuring Context Recall, developers can fine-tune the retrieval component of the model for better performance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">How Context Precision Works<\/h3>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Information Retrieval<\/strong>: When a query is posed in a RAG system, the system retrieves a set of documents or contexts.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Relevance Assessment<\/strong>: Context Precision assesses how many of these retrieved documents are relevant to the query.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">What Context Precision Measures<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Accuracy of Retrieved Contexts<\/strong>: It measures the proportion of the retrieved documents relevant to the query.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Efficiency of the Retrieval Component<\/strong>: High Context Precision indicates that the model&#8217;s retrieval component is active and accurate, fetching more relevant documents than irrelevant ones.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">How Context Precision is&nbsp;Measured<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Comparison With Relevant Documents<\/strong>: Context Precision is calculated by dividing the number of relevant documents retrieved by the total number of documents retrieved for each query.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Calculation<\/strong>: Often expressed as a percentage, it indicates the retrieval system\u2019s accuracy.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">Importance in AI&nbsp;Models<\/h4>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li><strong class=\"markup--strong markup--li-strong\">Enhances the Quality of Generated Answers<\/strong>: Context Precision helps generate more accurate and contextually correct answers by ensuring that the retrieved documents are primarily relevant.<\/li>\n\n\n\n<li><strong class=\"markup--strong markup--li-strong\">Model Optimization and Balancing<\/strong>: Alongside Context Recall, Context Precision helps fine-tune RAG models&#8217; retrieval components. A balance between Context Recall and Precision is often sought for optimal performance.<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\"><span class=\"hljs-keyword\">from<\/span> ragas <span class=\"hljs-keyword\">import<\/span> evaluate\n\n<span class=\"hljs-keyword\">from<\/span> ragas.<span class=\"hljs-property\">metrics<\/span> <span class=\"hljs-keyword\">import<\/span> context_recall, context_precision\n\nwb_result = evaluate(wb_dataset, metrics=[context_precision, context_recall])\n\nselenium_result = evaluate(selenium_dataset, metrics=[context_precision, context_recall])\n\nnewsurl_result = evaluate(newsurl_dataset, metrics=[context_precision, context_recall])<\/span><\/pre>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\">Interpretation of&nbsp;Results<\/h3>\n\n\n\n<p class=\"graf graf--p\">I\u2019m not gonna lie: I am pretty surprised by the results. I was expecting to see the NewsURLLoader win across all metrics.<\/p>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ZsTv-WnA_J6D7GE8-IK9ew.png\" alt=\"LangChain Document Loaders for Web\u00a0Data with Comet and CometLLM\"\/><figcaption class=\"wp-element-caption\">Graph by author<\/figcaption><\/figure>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Web Base Loader\u2019s Superiority<\/strong>: The Web Base loader has the highest RAGAS score, indicating it\u2019s the most effective overall in Retrieval Augmented Generation. Its high context precision and recall suggest it\u2019s adept at retrieving relevant documents without missing many important ones.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Selenium Loader\u2019s Balanced Performance<\/strong>: The Selenium loader shows a slightly lower RAGAS score but maintains a high context recall, equal to the Web Base loader. Its context precision is lower, though, which might suggest it retrieves more documents, but a slightly larger proportion of them are less relevant.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">News URL Loader\u2019s Lower Recall<\/strong>: While matching the Web Base loader in precision, the News URL loader falls behind in context recall and RAGAS score. This could indicate that while it\u2019s good at finding relevant documents, it misses many relevant ones compared to the other loaders.<\/p>\n\n\n\n<h4 class=\"wp-block-heading graf graf--h4\">The observation that the NewsURLLoader extracts cleaner text yet performs lower in terms of the overall RAGAS score and context recall is quite intriguing and points to a few potential reasons:<\/h4>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Precision vs. Quality of Content<\/strong>: While the NewsURLLoader might be retrieving cleaner, more precise text, the effectiveness of a retrieval system in a Retrieval-Augmented Generation (RAG) setup is not solely determined by text cleanliness. The key is to retrieve content that is not just clean but also highly relevant and comprehensive in answering the query. If the cleaner text is less comprehensive or slightly off-topic, it might contribute less effectively to the answer generation, impacting the RAGAS score.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Nature of Source Documents<\/strong>: The NewsURLLoader might be optimized for extracting text from news websites, which often have cleaner and more structured content. However, if the content from these sources is less diverse and rich in answering a wide array of queries compared to other sources, it might lead to lower recall and RAGAS scores.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Context Recall Challenge<\/strong>: The lower context recall score suggests that the NewsURLLoader, despite retrieving high-quality text, might be missing out on a significant number of relevant documents. This could be due to stricter or more conservative retrieval algorithms, which prefer precision over the breadth of retrieval.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Matching Query with Context<\/strong>: The effectiveness of a RAG system also depends on how well the retrieved context aligns with the nuances of the query. If the NewsURLLoader\u2019s algorithm is tuned to favour text cleanliness over nuanced matching, it might retrieve text that, while clean, is not as aligned with the specific needs of the query.<\/p>\n\n\n\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Integration with the RAG System<\/strong>: The overall architecture and integration of the NewsURLLoader with the RAG system could also play a role. Even if the text is cleaner, other aspects, like how the loader interfaces with the language model, the handling of metadata, and the overall synergy with the RAG process, are crucial.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>And An Assessment of How They Impact Your ragas Metrics If you\u2019ve ever wondered how the quality of information sourced by language models affects their outputs, you\u2019re in the right place.&nbsp;I\u2019m trying to unpack how different document loaders in LangChain impact a Retrieval Augmented Generation (RAG) system. Why is this important?&nbsp; RAG is a game-changer. [&hellip;]<\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,7],"tags":[70,71,52,31,34],"coauthors":[166],"class_list":["post-8279","post","type-post","status-publish","format-standard","hentry","category-llmops","category-tutorials","tag-langchain","tag-language-models","tag-llm","tag-llmops","tag-prompt-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>LangChain Document Loaders for Web\u00a0Data - Comet<\/title>\n<meta name=\"description\" content=\"The effectiveness of RAG hinges on the method used to retrieve documents. Explore 3 key LangChain document loaders + how they effect output\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"LangChain Document Loaders for Web\u00a0Data\" \/>\n<meta property=\"og:description\" content=\"The effectiveness of RAG hinges on the method used to retrieve documents. Explore 3 key LangChain document loaders + how they effect output\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-30T14:57:42+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T17:03:59+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz\" \/>\n<meta name=\"author\" content=\"Harpreet Sahota\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Harpreet Sahota\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"24 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"LangChain Document Loaders for Web\u00a0Data - Comet","description":"The effectiveness of RAG hinges on the method used to retrieve documents. Explore 3 key LangChain document loaders + how they effect output","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/","og_locale":"en_US","og_type":"article","og_title":"LangChain Document Loaders for Web\u00a0Data","og_description":"The effectiveness of RAG hinges on the method used to retrieve documents. Explore 3 key LangChain document loaders + how they effect output","og_url":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-30T14:57:42+00:00","article_modified_time":"2025-04-24T17:03:59+00:00","og_image":[{"url":"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz","type":"","width":"","height":""}],"author":"Harpreet Sahota","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Harpreet Sahota","Est. reading time":"24 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/"},"author":{"name":"Harpreet Sahota","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6"},"headline":"LangChain Document Loaders for Web\u00a0Data","datePublished":"2023-11-30T14:57:42+00:00","dateModified":"2025-04-24T17:03:59+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/"},"wordCount":3767,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz","keywords":["LangChain","Language Models","LLM","LLMOps","Prompt Engineering"],"articleSection":["LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/","url":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/","name":"LangChain Document Loaders for Web\u00a0Data - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz","datePublished":"2023-11-30T14:57:42+00:00","dateModified":"2025-04-24T17:03:59+00:00","description":"The effectiveness of RAG hinges on the method used to retrieve documents. Explore 3 key LangChain document loaders + how they effect output","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#primaryimage","url":"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz","contentUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/0*ou0C9vOnaGtzPThz"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/langchain-document-loaders-for-web-data\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"LangChain Document Loaders for Web\u00a0Data"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6","name":"Harpreet Sahota","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2d21512be19ba7e19a71a803309e2a88","url":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","caption":"Harpreet Sahota"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/theartistsofdatasciencegmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8279","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8279"}],"version-history":[{"count":1,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8279\/revisions"}],"predecessor-version":[{"id":15427,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8279\/revisions\/15427"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8279"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8279"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8279"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8279"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}