{"id":8267,"date":"2023-11-30T06:25:21","date_gmt":"2023-11-30T14:25:21","guid":{"rendered":"https:\/\/live-cometml.pantheonsite.io\/?p=8267"},"modified":"2026-01-02T20:57:43","modified_gmt":"2026-01-02T20:57:43","slug":"llamasherpa-document-chunking-for-llms","status":"publish","type":"post","link":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/","title":{"rendered":"LlamaSherpa: Document Chunking for\u00a0LLMs"},"content":{"rendered":"\n<section class=\"section section--body\">\n<div class=\"section-divider\"><\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<h2 class=\"graf graf--h4\">Smart Chunking Techniques for Enhanced RAG Pipeline Performance<\/h2>\n<figure class=\"graf graf--figure\">\n<\/figure><\/div><\/div><\/section>\n\n\n\n<figure class=\"wp-block-image alignnone graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg\" alt=\"Document Chunking with LlamaSherpa and CometML + CometLLM\"\/><figcaption class=\"wp-element-caption\">Generated by the author using&nbsp;SDXL<\/figcaption><\/figure>\n\n\n\n<figcaption class=\"imageCaption\"><\/figcaption>\n\n\n\n<p class=\"wp-block-paragraph\">A huge pain point for Retrieval Augmented Generation is the challenge of making the text in large documents, especially PDFs, available for LLMs due to the limitations of the LLM <a href=\"https:\/\/www.comet.com\/site\/blog\/context-window\/\">context window<\/a>.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">You could naively chunk your documents\u200a\u2014\u200aa straightforward method of breaking down large documents into smaller text chunks without considering the document\u2019s inherent structure or layout.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">Going this route, you divide the text based on a predetermined size or word count, such as fitting within the LLM context window (typically 2000\u20133000 words). The problem is that you can disrupt the semantics and context implied by the document\u2019s structure.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/open.substack.com\/users\/26558724-ambika-sukla?utm_source=mentions\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/open.substack.com\/users\/26558724-ambika-sukla?utm_source=mentions\">Ambika Sukla<\/a> <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/ambikasukla.substack.com\/p\/efficient-rag-with-document-layout\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/ambikasukla.substack.com\/p\/efficient-rag-with-document-layout\"><strong class=\"markup--strong markup--p-strong\">proposes a solution called \u201csmart chunking\u201d that is layout-aware and considers the document\u2019s structure.<\/strong><\/a><\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">This method:<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>Is aware of the document\u2019s layout structure, preserving the semantics and context.<\/li>\n\n\n\n<li>Identifies and retains sections, subsections, and their nesting structures.<\/li>\n\n\n\n<li>Merges lines into coherent paragraphs and maintains connections between sections and paragraphs.<\/li>\n\n\n\n<li>Preserves table layouts, headers, subheaders, and list structures.<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">To this end, he\u2019s created <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/nlmatics\/llmsherpa\" target=\"_blank\" rel=\"noopener\" data-href=\"https:\/\/github.com\/nlmatics\/llmsherpa\"><strong class=\"markup--strong markup--p-strong\">the LlamaSherpa library<\/strong><\/a><strong class=\"markup--strong markup--p-strong\">,<\/strong> which has a \u201cLayoutPDFReader,\u201d a tool designed to split text in PDFs into these layout-aware chunks, providing a more context-rich input for LLMs and enhancing their performance on large documents.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">Let\u2019s get some preliminaries out of the way:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">%%capture\n!pip install llmsherpa openai llama-index\n\n<span class=\"hljs-keyword\">from<\/span> llmsherpa.readers <span class=\"hljs-keyword\">import<\/span> LayoutPDFReader\n<span class=\"hljs-keyword\">import<\/span> openai\n<span class=\"hljs-keyword\">import<\/span> getpass\n<span class=\"hljs-keyword\">from<\/span> IPython.core.display <span class=\"hljs-keyword\">import<\/span> display, HTML\n<span class=\"hljs-keyword\">from<\/span> llama_index.llms <span class=\"hljs-keyword\">import<\/span> OpenAI\n\nopenai.api_key = getpass.getpass(<span class=\"hljs-string\">\"Whats your OpenAI Key:\"<\/span>)<\/span><\/pre>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">The following code sets up a PDF reader with a specific parser API endpoint, provides it a source (URL or path) to a PDF file, and instructs it to fetch, parse, and return the structured content of that PDF.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><span class=\"pre--content\">llmsherpa_api_url = <span class=\"hljs-string\">\"https:\/\/readers.llmsherpa.com\/api\/document\/developer\/parseDocument?renderFormat=all\"<\/span>\npdf_url = <span class=\"hljs-string\">\"https:\/\/arxiv.org\/pdf\/2310.14424.pdf\"<\/span> <span class=\"hljs-comment\"># also allowed is a file path e.g. \/home\/downloads\/xyz.pdf<\/span>\npdf_reader = LayoutPDFReader(llmsherpa_api_url)\ndoc = pdf_reader.read_pdf(pdf_url)<\/span><\/pre>\n\n\n\n<h3 class=\"wp-block-heading graf graf--h3\" id=\"h-step-by-step-explanation-of-what-just-happened-under-the-nbsp-hood\"><strong class=\"markup--strong markup--h3-strong\">Step-by-Step Explanation of what just happened under the&nbsp;hood<\/strong><\/h3>\n\n\n\n<figure class=\"graf graf--figure\">\n<\/figure>\n\n\n\n<figure class=\"wp-block-image alignnone graf-image\"><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*MnpBYmKdEq-d_uVXOQBIRg.jpeg\" alt=\"Document Chunking with LlamaSherpa and CometML + CometLLM\"\/><figcaption class=\"wp-element-caption\">Generated by the author using&nbsp;SDXL<\/figcaption><\/figure>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><strong class=\"markup--strong markup--p-strong\">Setting API Endpoint<\/strong>: The <code class=\"markup--code markup--p-code\">llmsherpa_api_url<\/code> variable is assigned the URL of the external parser API. This API is responsible for parsing the PDF files.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><strong class=\"markup--strong markup--p-strong\">Setting PDF Source<\/strong>: The <code class=\"markup--code markup--p-code\">pdf_url<\/code> variable is given a URL pointing to a PDF file. However, as mentioned, it can also be assigned a local file path.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><strong class=\"markup--strong markup--p-strong\">Initializing the PDF Reader<\/strong>:<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>The <code class=\"markup--code markup--li-code\">LayoutPDFReader<\/code> class is initialized with the <code class=\"markup--code markup--li-code\">llmsherpa_api_url<\/code>. This tells the reader which API to use for parsing PDFs.<\/li>\n\n\n\n<li>Internally, during this initialization, the class sets up two HTTP connection pools using <code class=\"markup--code markup--li-code\">urllib3<\/code>:<\/li>\n\n\n\n<li>One for downloading PDFs (<code class=\"markup--code markup--li-code\">self.download_connection<\/code>).<\/li>\n\n\n\n<li>Another for sending PDFs to the external parser API (<code class=\"markup--code markup--li-code\">self.api_connection<\/code>).<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><strong class=\"markup--strong markup--p-strong\">Reading and Parsing the PDF<\/strong>:<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">The <code class=\"markup--code markup--p-code\">read_pdf<\/code> method of the <code class=\"markup--code markup--p-code\">pdf_reader<\/code> object is invoked with the <code class=\"markup--code markup--p-code\">pdf_url<\/code> as its argument.<\/p>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\">Inside this method:<\/p>\n\n\n\n<ul class=\"wp-block-list postList\">\n<li>The class determines if the input is a URL or a local file path.<\/li>\n\n\n\n<li>If it\u2019s a URL, the <code class=\"markup--code markup--li-code\">_download_pdf<\/code> method is invoked to fetch the PDF file from the given URL. It uses the <code class=\"markup--code markup--li-code\">download_connection<\/code> to make the HTTP request, impersonating a browser user agent to avoid potential download restrictions.<\/li>\n\n\n\n<li>The file is read directly from the system if it&#8217;s a local path.<\/li>\n\n\n\n<li>Once the PDF file data is obtained, the <code class=\"markup--code markup--li-code\">_parse_pdf<\/code> method is called to send this data to the external parser API (in this case, the <code class=\"markup--code markup--li-code\">llmsherpa<\/code> API).<\/li>\n\n\n\n<li>The API processes the PDF and returns a JSON response with parsed data.<\/li>\n\n\n\n<li>The JSON response is then processed to extract the \u2018blocks\u2019 of data, representing structured information parsed from the PDF.<\/li>\n\n\n\n<li>These blocks are finally returned as a <code class=\"markup--code markup--li-code\">Document<\/code> object.<\/li>\n<\/ul>\n\n\n\n<p class=\"graf graf--p wp-block-paragraph\"><strong class=\"markup--strong markup--p-strong\">Result<\/strong>: The <code class=\"markup--code markup--p-code\">doc<\/code> variable now holds a <code class=\"markup--code markup--p-code\">Document<\/code> object that contains the structured data parsed from the PDF. This <code class=\"markup--code markup--p-code\">Document<\/code> object can be used to access and manipulate the parsed information.<\/p>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<blockquote class=\"graf graf--pullquote\"><p>Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? <a class=\"markup--anchor markup--pullquote-anchor\" href=\"https:\/\/www.comet.com\/production\/site\/llm-course\/?utm_source=Heartbeat&amp;utm_medium=referral&amp;utm_content=Medium&amp;utm_campaign=Heartbeat_LangChain_Series_HS\" target=\"_blank\" rel=\"noopener ugc nofollow\" data-href=\"https:\/\/www.comet.com\/production\/site\/llm-course\/?utm_source=Heartbeat&amp;utm_medium=referral&amp;utm_content=Medium&amp;utm_campaign=Heartbeat_LangChain_Series_HS\">Check out this free LLMOps course<\/a> from industry expert Elvis Saravia of&nbsp;DAIR.AI!<\/p><\/blockquote>\n<\/div>\n<\/div>\n<\/section>\n\n\n\n<section class=\"section section--body\">\n<div class=\"section-divider\">\n<hr class=\"section-divider\">\n<\/div>\n<div class=\"section-content\">\n<div class=\"section-inner sectionLayout--insetColumn\">\n<p class=\"graf graf--p\">What we end up with a type of <code class=\"markup--code markup--p-code\">Document<\/code> object with several methods available to it.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-built_in\">type<\/span>(doc)\n\n<span class=\"hljs-comment\"># llmsherpa.readers.layout_reader.Document<\/span><\/span><\/pre>\n<h4 class=\"graf graf--h4\">Retrieving Chunks from the&nbsp;PDF<\/h4>\n<p class=\"graf graf--p\">The <code class=\"markup--code markup--p-code\">chunks<\/code> method provides coherent pieces or segments of content from the parsed PDF.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">for<\/span> chunk <span class=\"hljs-keyword\">in<\/span> doc.chunks():\n    <span class=\"hljs-built_in\">print<\/span>(chunk.to_text())<\/span><\/pre>\n<h4 class=\"graf graf--h4\">Extracting Tables from the&nbsp;PDF<\/h4>\n<p class=\"graf graf--p\">The <code class=\"markup--code markup--p-code\">tables<\/code> method enables you to retrieve tables identified and extracted from the PDF.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">for<\/span> table <span class=\"hljs-keyword\">in<\/span> doc.tables():\n    <span class=\"hljs-built_in\">print<\/span>(table.to_text())<\/span><\/pre>\n<h4 class=\"graf graf--h4\">Accessing Sections of the&nbsp;PDF<\/h4>\n<p class=\"graf graf--p\">The <code class=\"markup--code markup--p-code\">sections<\/code> method allows you to segment the content of the parsed PDF. This is especially handy if you want to navigate or read specific chapters, sub-chapters, or other logical divisions in the document.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">for<\/span> section <span class=\"hljs-keyword\">in<\/span> doc.sections():\n    <span class=\"hljs-built_in\">print<\/span>(section.title)<\/span><\/pre>\n<p class=\"graf graf--p\">In the code snippet below, you\u2019ll search for a section titled \u20182 Methodology\u2019 in a parsed PDF document that displays its complete content, including all subsections and nested content.<\/p>\n<p class=\"graf graf--p\">It does so from a parsed PDF document using the <code class=\"markup--code markup--p-code\">llmsherpa.readers.layout_reader<\/code> library.<\/p>\n<ul class=\"postList\">\n<li class=\"graf graf--li\">The variable <code class=\"markup--code markup--li-code\">selected_section<\/code> is initialized to <code class=\"markup--code markup--li-code\">None<\/code> and acts as a placeholder for the desired section.<\/li>\n<li class=\"graf graf--li\">The code iterates over all sections in the <code class=\"markup--code markup--li-code\">doc<\/code> (a <code class=\"markup--code markup--li-code\">Document<\/code> object) using the <code class=\"markup--code markup--li-code\">sections()<\/code> method.<\/li>\n<li class=\"graf graf--li\">During the iteration, if a section with the title \u20182 Methodology\u2019 is found, the <code class=\"markup--code markup--li-code\">selected_section<\/code> variable is updated to reference this section, and the loop is immediately exited.<\/li>\n<li class=\"graf graf--li\">The <code class=\"markup--code markup--li-code\">to_html<\/code> method is then used to generate an HTML representation of the <code class=\"markup--code markup--li-code\">selected_section<\/code>. By setting both <code class=\"markup--code markup--li-code\">include_children=True<\/code> and <code class=\"markup--code markup--li-code\">recurse=True.<\/code>The generated HTML will include the immediate child elements of the section and all of its descendants. This ensures a comprehensive view of the section and its sub-content.<\/li>\n<li class=\"graf graf--li\">Finally, the <code class=\"markup--code markup--li-code\">HTML<\/code> function is used to display the section in a Jupyter Notebook.<\/li>\n<\/ul>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"1\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">get_section_text<\/span>(<span class=\"hljs-params\">doc, section_title<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Extracts the text from a specific section in a parsed PDF document.\n\n    Parameters:\n    - doc (Document): A Document object from the llmsherpa.readers.layout_reader library.\n    - section_title (str): The title of the section to extract.\n\n    Returns:\n    - str: The HTML representation of the section's content, or a message if the section is not found.\n    \"\"\"<\/span>\n\n    selected_section = <span class=\"hljs-literal\">None<\/span>\n\n    <span class=\"hljs-comment\"># Find the desired section by title<\/span>\n    <span class=\"hljs-keyword\">for<\/span> section <span class=\"hljs-keyword\">in<\/span> doc.sections():\n        <span class=\"hljs-keyword\">if<\/span> section.title == section_title:\n            selected_section = section\n            <span class=\"hljs-keyword\">break<\/span>\n\n    <span class=\"hljs-comment\"># If the section is not found, return a message<\/span>\n    <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-keyword\">not<\/span> selected_section:\n        <span class=\"hljs-keyword\">return<\/span> <span class=\"hljs-string\">f\"No section titled '<span class=\"hljs-subst\">{section_title}<\/span>' found.\"<\/span>\n\n    <span class=\"hljs-comment\"># Return the full content of the section as HTML<\/span>\n    <span class=\"hljs-keyword\">return<\/span> selected_section.to_html(include_children=<span class=\"hljs-literal\">True<\/span>, recurse=<span class=\"hljs-literal\">True<\/span>)<\/span><\/pre>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">You can see the text in any given section like so:<\/strong><\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\">section_text = get_section_text(doc, <span class=\"hljs-string\">'2 Methodology'<\/span>)\nHTML(section_text)<\/span><\/pre>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">And you can use that text as context for an LLM:<\/strong><\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"1\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">get_answer_from_llm<\/span>(<span class=\"hljs-params\">context, question, api_instance<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Uses an LLM to answer a specific question about the provided context.\n\n    Parameters:\n    - context (str): The text or content to analyze.\n    - question (str): A question to answer about the context.\n    - api_instance: An instance of the API (e.g., OpenAI) used to generate the answer.\n\n    Returns:\n    - str: The API's response text.\n    \"\"\"<\/span>\n\n    prompt = <span class=\"hljs-string\">f\"Read this text and answer the question: <span class=\"hljs-subst\">{question}<\/span>:\\n<span class=\"hljs-subst\">{context}<\/span>\"<\/span>\n    resp = api_instance.complete(prompt)\n\n    <span class=\"hljs-keyword\">return<\/span> resp.text\n\nquestion = <span class=\"hljs-string\">\"Describe the methodology used to conduct the experiments in this research\"<\/span>\n\nllm = OpenAI()\n\nresponse = get_answer_from_llm(section_text, question, llm)\n\n<span class=\"hljs-built_in\">print<\/span>(response)<\/span><\/pre>\n<h3 class=\"graf graf--h3\">Below is the summary&nbsp;\ud83d\udc47\ud83c\udffd<\/h3>\n<blockquote class=\"graf graf--blockquote\"><p>The methodology used in this research involves conducting pairwise comparisons between different models. The researchers start by selecting a set of prompts (P) and a pool of models (M). For each prompt in P and each model in M, a completion is generated. Paired model completions \u00a9 are formed for evaluation, where each pair consists of completions from two distinct models. An annotator then reviews each pair and assesses their relative quality, assigning evaluation scores (ScoreAi, ScoreBi) based on their preference.<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>To rank the prompts in P, an offline approach is proposed. The focus is on highlighting the dissimilarity between the responses of the two models. The researchers aim to identify and rank prompts that have a low likelihood of tie outcomes, where both completions are viewed as similarly good or bad by annotators.<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>To achieve this, the prompts and completion pairs within P are reordered based on dissimilarity scores. An optimal permutation (\u03c0) is found to create an ordered set (P\u0302\u03c0) that prioritizes evaluation instances with a strong preference signal from annotators. This reordering reduces the number of annotations required to determine model preference.<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>Conventional string matching techniques such as BLEU and ROUGE are not suitable for this problem as they may not capture the meaning or quality of completions accurately. The researchers aim to streamline and optimize the evaluation process by strategically selecting prompts that amplify the informativeness of each comparison.<\/p><\/blockquote>\n<h3 class=\"graf graf--h3\">Overview of <code class=\"markup--code markup--h3-code\">doc.tables()<\/code><\/h3>\n<p class=\"graf graf--p\">The <code class=\"markup--code markup--p-code\">doc.tables()<\/code> is designed to extract and return tables from a parsed PDF document.<\/p>\n<h3 class=\"graf graf--h3\">Under the&nbsp;Hood:<\/h3>\n<ol class=\"postList\">\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Node Traversal<\/strong>: The method begins by initializing an empty list, <code class=\"markup--code markup--li-code\">tables<\/code>, to store nodes tagged as tables. It then traverses the entire document tree, starting from the root node.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Tag-Based Identification<\/strong>: During traversal, the method checks the <code class=\"markup--code markup--li-code\">tag<\/code> attribute of each node. If a node has its <code class=\"markup--code markup--li-code\">tag<\/code> set to <code class=\"markup--code markup--li-code\">'table'<\/code>, it is considered a table and is added to the <code class=\"markup--code markup--li-code\">tables<\/code> list.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Return<\/strong>: After traversing all nodes, the method returns the <code class=\"markup--code markup--li-code\">tables<\/code> list, which contains all nodes identified as tables.<\/li>\n<\/ol>\n<h3 class=\"graf graf--h3\">Potential Parsing Discrepancies:<\/h3>\n<ol class=\"postList\">\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Broad Tagging<\/strong>: The method relies solely on the <code class=\"markup--code markup--li-code\">tag<\/code> attribute to identify tables. If, during the initial PDF parsing, certain non-table elements are tagged as <code class=\"markup--code markup--li-code\">'table'<\/code> (due to layout similarities or parsing complexities), they will be incorrectly identified as tables by this method.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">PDF Structure Complexity<\/strong>: PDFs are visually-oriented documents. Elements that appear as text or lists to human readers might be structured in a tabular manner in the underlying PDF content. This can lead the parser to tag such elements as tables, even if they don\u2019t visually resemble tables.<\/li>\n<li class=\"graf graf--li\"><strong class=\"markup--strong markup--li-strong\">Lack of Additional Verification<\/strong>: The method does not employ additional checks or heuristics to verify the tabular nature of identified nodes. Implementing further criteria (e.g., checking for rows\/columns or tabular data patterns) could enhance table identification accuracy.<\/li>\n<\/ol>\n<p class=\"graf graf--p\">While <code class=\"markup--code markup--p-code\">doc.tables()<\/code> provides a straightforward way to extract tables from a parsed PDF. You should be aware of potential discrepancies in table identification due to the inherent complexities of PDF parsing and the method&#8217;s reliance on tags alone.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-comment\"># how many \"tables\" we have<\/span>\n<span class=\"hljs-built_in\">len<\/span>(doc.tables())\n\n<span class=\"hljs-comment\"># 13<\/span><\/span><\/pre>\n<h4 class=\"graf graf--h4\">Let\u2019s see a table and reason over it with an&nbsp;LLM<\/h4>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"1\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">display_table<\/span>(<span class=\"hljs-params\">doc, index<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Returns the HTML representation of a specified table from a parsed PDF document.\n\n    Parameters:\n    - doc (Document): A Document object from the llmsherpa.readers.layout_reader library.\n    - index (int): The index of the table to display.\n\n    Returns:\n    - str: The HTML representation of the table, or a message if the table is not found.\n    \"\"\"<\/span>\n\n    tables = doc.tables()\n    <span class=\"hljs-keyword\">if<\/span> index &lt; <span class=\"hljs-number\">0<\/span> <span class=\"hljs-keyword\">or<\/span> index &gt;= <span class=\"hljs-built_in\">len<\/span>(tables):\n        <span class=\"hljs-keyword\">return<\/span> <span class=\"hljs-string\">\"Table index out of range.\"<\/span>\n\n    <span class=\"hljs-keyword\">return<\/span> tables[index].to_html()\n\ntable_ = display_table(doc, <span class=\"hljs-number\">4<\/span>)\nHTML(table_)<\/span><\/pre>\n<figure class=\"graf graf--figure\"><img decoding=\"async\" class=\"graf-image\" src=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*ECmrtHcWjkQ0eYRw-1TurA.png\" data-image-id=\"1*ECmrtHcWjkQ0eYRw-1TurA.png\" data-width=\"2520\" data-height=\"490\"><\/figure>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\">question = <span class=\"hljs-string\">\"What insight can you glean from this table?\"<\/span>\nresponse = get_answer_from_llm(table_, question, llm)\n<span class=\"hljs-built_in\">print<\/span>(response)<\/span><\/pre>\n<p class=\"graf graf--p\">And the response from the LLM:<\/p>\n<blockquote class=\"graf graf--blockquote\"><p>From this table, we can glean the following insights:<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>&nbsp;1. There are four different models mentioned: Flan-t5, Dolly-v2, Falcon-instruct falcon, and MPT-instruct.<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>2. Each model has a different base model architecture: T5 encoder-decoder [3B, 11B] formal instruct for Flan-t5, pythia decoder-only for Dolly-v2, decoder-only for Falcon-instruct falcon, and mpt decoder-only for MPT-instruct.<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>3. The size of the models varies, with Flan-t5 and Dolly-v2 not specifying the size, Falcon-instruct falcon being [7B], and MPT-instruct also being [7B].<\/p><\/blockquote>\n<blockquote class=\"graf graf--blockquote\"><p>4. The finetuning data for each model is mentioned, with Flan-t5 and Dolly-v2 not specifying any, Falcon-instruct falcon using instruct\/chat, and MPT-instruct using colloquial instruct\/preference.<\/p><\/blockquote>\n<h3 class=\"graf graf--h3\">Vector search and Retrieval Augmented Generation with Smart&nbsp;Chunking<\/h3>\n<p class=\"graf graf--p\"><code class=\"markup--code markup--p-code\">LayoutPDFReader<\/code> is designed to chunk text while intelligently preserving the integrity of related content.<\/p>\n<p class=\"graf graf--p\">This means that all list items, including the paragraph that precedes the list, are kept together.<\/p>\n<p class=\"graf graf--p\">In addition, items on a table are grouped, and contextual information from section headers and nested section headers is included.<\/p>\n<p class=\"graf graf--p\">By using the following code, you can create a LlamaIndex query engine from the document chunks generated by <code class=\"markup--code markup--p-code\">LayoutPDFReader<\/code>.<\/p>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"1\" data-code-block-lang=\"python\"><span class=\"pre--content\"><span class=\"hljs-keyword\">from<\/span> llama_index.readers.schema.base <span class=\"hljs-keyword\">import<\/span> Document\n<span class=\"hljs-keyword\">from<\/span> llama_index <span class=\"hljs-keyword\">import<\/span> VectorStoreIndex\n\nindex = VectorStoreIndex([])\n\n<span class=\"hljs-keyword\">for<\/span> chunk <span class=\"hljs-keyword\">in<\/span> doc.chunks():\n    index.insert(Document(text=chunk.to_context_text(), extra_info={}))\n\nquery_engine = index.as_query_engine()\n\n<span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title function_\">query_vectorstore<\/span>(<span class=\"hljs-params\">question, query_engine=query_engine<\/span>):\n    <span class=\"hljs-string\">\"\"\"\n    Queries a vectorstore using an engine with a question and prints the response.\n\n    Parameters:\n    - query_engine: The engine to use for querying (e.g., an instance of a class with a query method).\n    - question (str): The question to ask.\n\n    Returns:\n    - response: The response from vector store\n    \"\"\"<\/span>\n    response = query_engine.query(question)\n    <span class=\"hljs-keyword\">return<\/span> response\n\nresponse = query_vectorstore(<span class=\"hljs-string\">\"What is the methodology in this paper?\"<\/span>)\n\nresponse.response<\/span><\/pre>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">And the response from the LLM:<\/strong><\/p>\n<blockquote class=\"graf graf--blockquote\"><p>The methodology in this paper focuses on prioritizing evaluation instances that showcase distinct model behaviors. The goal is to minimize tie outcomes and optimize the evaluation process, especially when resources are limited. However, this approach may inherently favor certain data points and introduce biases. The methodology also acknowledges the risk of over-representing certain challenges and under-representing areas where models have consistent outputs. It is important to note that the proposed methodology is designed to prioritize annotation within budget constraints, rather than using it for sample exclusion.<\/p><\/blockquote>\n<pre class=\"graf graf--pre graf--preV2\" spellcheck=\"false\" data-code-block-mode=\"2\" data-code-block-lang=\"python\"><span class=\"pre--content\">response = query_vectorstore(<span class=\"hljs-string\">\"How do you quantify A vs B dissimilarity?\"<\/span>)\nresponse.response<\/span><\/pre>\n<blockquote class=\"graf graf--blockquote\"><p>The quantification of A vs B dissimilarity can be done using the Kullback-Leibler (KL) divergence formula. This formula involves calculating the sum of the product of the probability of each element in A (pA) and the logarithm of the ratio between the probability of that element in A (pA) and the probability of that element in B (pB).<\/p><\/blockquote>\n<h3 class=\"graf graf--h3\">Conclusion<\/h3>\n<p class=\"graf graf--p\">In summary, the blog post introduces LlamaSherpa, an innovative library that addresses the challenge of chunking large documents for use with Large Language Models (LLMs).<\/p>\n<p class=\"graf graf--p\">LlamaSherpa\u2019s \u201csmart chunking\u201d method is layout-aware, preserving the semantics and structure of the original document, which is crucial for maintaining the context and meaning. The library\u2019s LayoutPDFReader tool efficiently processes PDFs to create more effective inputs for LLMs.<\/p>\n<p class=\"graf graf--p\">By utilizing LlamaSherpa, users can enhance the performance of their RAG pipelines, ensuring that the model\u2019s context window encapsulates the most relevant and structured information from large documents.<\/p>\n<\/div>\n<\/div>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>Smart Chunking Techniques for Enhanced RAG Pipeline Performance A huge pain point for Retrieval Augmented Generation is the challenge of making the text in large documents, especially PDFs, available for LLMs due to the limitations of the LLM context window. You could naively chunk your documents\u200a\u2014\u200aa straightforward method of breaking down large documents into smaller [&hellip;]<\/p>\n","protected":false},"author":68,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"customer_name":"","customer_description":"","customer_industry":"","customer_technologies":"","customer_logo":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[65,7],"tags":[70,71,52,31,34],"coauthors":[166],"class_list":["post-8267","post","type-post","status-publish","format-standard","hentry","category-llmops","category-tutorials","tag-langchain","tag-language-models","tag-llm","tag-llmops","tag-prompt-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>LlamaSherpa: Document Chunking for\u00a0LLMs - Comet<\/title>\n<meta name=\"description\" content=\"The new LlamaSherpa package makes smart chunking for RAG possible by being is layout-aware and considering the document\u2019s structure.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"LlamaSherpa: Document Chunking for\u00a0LLMs\" \/>\n<meta property=\"og:description\" content=\"The new LlamaSherpa package makes smart chunking for RAG possible by being is layout-aware and considering the document\u2019s structure.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/\" \/>\n<meta property=\"og:site_name\" content=\"Comet\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cometdotml\" \/>\n<meta property=\"article:published_time\" content=\"2023-11-30T14:25:21+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2026-01-02T20:57:43+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg\" \/>\n<meta name=\"author\" content=\"Harpreet Sahota\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@Cometml\" \/>\n<meta name=\"twitter:site\" content=\"@Cometml\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Harpreet Sahota\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"LlamaSherpa: Document Chunking for\u00a0LLMs - Comet","description":"The new LlamaSherpa package makes smart chunking for RAG possible by being is layout-aware and considering the document\u2019s structure.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/","og_locale":"en_US","og_type":"article","og_title":"LlamaSherpa: Document Chunking for\u00a0LLMs","og_description":"The new LlamaSherpa package makes smart chunking for RAG possible by being is layout-aware and considering the document\u2019s structure.","og_url":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/","og_site_name":"Comet","article_publisher":"https:\/\/www.facebook.com\/cometdotml","article_published_time":"2023-11-30T14:25:21+00:00","article_modified_time":"2026-01-02T20:57:43+00:00","og_image":[{"url":"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg","type":"","width":"","height":""}],"author":"Harpreet Sahota","twitter_card":"summary_large_image","twitter_creator":"@Cometml","twitter_site":"@Cometml","twitter_misc":{"Written by":"Harpreet Sahota","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#article","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/"},"author":{"name":"Harpreet Sahota","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6"},"headline":"LlamaSherpa: Document Chunking for\u00a0LLMs","datePublished":"2023-11-30T14:25:21+00:00","dateModified":"2026-01-02T20:57:43+00:00","mainEntityOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/"},"wordCount":1797,"publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg","keywords":["LangChain","Language Models","LLM","LLMOps","Prompt Engineering"],"articleSection":["LLMOps","Tutorials"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/","url":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/","name":"LlamaSherpa: Document Chunking for\u00a0LLMs - Comet","isPartOf":{"@id":"https:\/\/www.comet.com\/site\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#primaryimage"},"image":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg","datePublished":"2023-11-30T14:25:21+00:00","dateModified":"2026-01-02T20:57:43+00:00","description":"The new LlamaSherpa package makes smart chunking for RAG possible by being is layout-aware and considering the document\u2019s structure.","breadcrumb":{"@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#primaryimage","url":"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg","contentUrl":"https:\/\/cdn-images-1.medium.com\/max\/1600\/1*zVJSZ0NjWcLJs_wiRJN7oQ.jpeg"},{"@type":"BreadcrumbList","@id":"https:\/\/www.comet.com\/site\/blog\/llamasherpa-document-chunking-for-llms\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.comet.com\/site\/"},{"@type":"ListItem","position":2,"name":"LlamaSherpa: Document Chunking for\u00a0LLMs"}]},{"@type":"WebSite","@id":"https:\/\/www.comet.com\/site\/#website","url":"https:\/\/www.comet.com\/site\/","name":"Comet","description":"Build Better Models Faster","publisher":{"@id":"https:\/\/www.comet.com\/site\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.comet.com\/site\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.comet.com\/site\/#organization","name":"Comet ML, Inc.","alternateName":"Comet","url":"https:\/\/www.comet.com\/site\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/","url":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","contentUrl":"https:\/\/www.comet.com\/site\/wp-content\/uploads\/2025\/01\/logo_comet_square.png","width":310,"height":310,"caption":"Comet ML, Inc."},"image":{"@id":"https:\/\/www.comet.com\/site\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/cometdotml","https:\/\/x.com\/Cometml","https:\/\/www.youtube.com\/channel\/UCmN63HKvfXSCS-UwVwmK8Hw"]},{"@type":"Person","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/46036ab474aa916e2873daece26a28d6","name":"Harpreet Sahota","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.comet.com\/site\/#\/schema\/person\/image\/2d21512be19ba7e19a71a803309e2a88","url":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a6ca5a533fc9f143a0a7428037ff652aa0633d66bf27e76ae89b955ae72a0f2d?s=96&d=mm&r=g","caption":"Harpreet Sahota"},"url":"https:\/\/www.comet.com\/site\/blog\/author\/theartistsofdatasciencegmail-com\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/users\/68"}],"replies":[{"embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/comments?post=8267"}],"version-history":[{"count":2,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8267\/revisions"}],"predecessor-version":[{"id":18827,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/posts\/8267\/revisions\/18827"}],"wp:attachment":[{"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/media?parent=8267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/categories?post=8267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/tags?post=8267"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.comet.com\/site\/wp-json\/wp\/v2\/coauthors?post=8267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}