Retrieval in LangChain: Part 1

Document Loaders, Document Transformers

Retrieval in LangChain refers to fetching and retrieving relevant data or documents from external sources.

It is a crucial step in many language model applications, especially in Retrieval Augmented Generation (RAG) tasks.

Retrieval is useful because it allows you to incorporate external data into your language model, providing additional context and information that may not be present in the model’s training data.

By retrieving relevant documents, you can enhance the generation process and improve the quality and relevance of the generated responses.

You may need retrieval in LangChain when you want to:

Incorporate user-specific data: Retrieval allows you to fetch data that is specific to individual users or applications, enabling personalized and context-aware responses.

Provide additional information: By retrieving relevant documents, you can supplement the model’s knowledge with up-to-date information, facts, or explanations.

Answer questions over documents: Retrieval is particularly useful for tasks like question answering, where you need to find relevant information from a large corpus of documents.

You can determine if you need retrieval by considering if your application requires accessing external data or retrieving relevant documents based on user queries.

If you need to enhance your language model’s responses with additional information or provide accurate answers to user queries, retrieval can be beneficial.

Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI.

To use retrieval in LangChain, you can follow these steps:

Load documents: Use document loaders to load documents from various sources, such as files, websites, or databases.

Transform documents: Apply document transformers to preprocess and transform the loaded documents, such as splitting large documents into smaller chunks or applying specific logic optimized for different document types.

Create embeddings: Generate embeddings for the documents using text embedding models. Embeddings capture the semantic meaning of text and enable efficient searching and similarity calculations.

Store documents and embeddings: Use vector stores to store the documents and their corresponding embeddings. Vector stores provide efficient storage and retrieval capabilities for large collections of embeddings.

Retrieve relevant documents: Use retrievers to query the vector store and retrieve relevant documents based on user queries or search criteria. Retriever algorithms, such as similarity search or Maximum Marginal Relevance (MMR) search, can be used to find the most relevant documents.

Following these steps, you can effectively incorporate retrieval capabilities into your LangChain application and enhance the language model’s performance and contextual understanding.

Document Loaders

Document loaders in LangChain are used to load data from various sources as Document objects.

A Document is a piece of text with associated metadata. Document loaders provide a convenient way to fetch data from different sources, such as text files, web pages, or even transcripts of videos. The main purpose of document loaders is to retrieve data and prepare it for further processing in LangChain.

They expose a load method that fetches data from the configured source and returns it as a Document object. Some document loaders also support lazy loading, which allows data to be loaded into memory only when needed.

Text loader

This is the simplest loader. It reads in a file as text and places it all into one Document.

%%capture
!pip install langchain openai tiktoken
!wget -O "golden-sayings-of-epictetus.txt" https://www.gutenberg.org/cache/epub/871/pg871.txt

import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

from langchain.document_loaders import TextLoader
loader = TextLoader("golden-sayings-of-epictetus.txt")
golden_sayings = loader.load()

type(golden_sayings)
# list

type(golden_sayings[0])
# langchain.schema.document.Document

CSV Loaders

CSV loaders in LangChain are used to load CSV files into the system for further processing and analysis.

They allow you to easily import data from CSV files and convert them into LangChain’s Document format. CSV loaders are useful when you have structured data in CSV format that you want to work with in LangChain.

To use a CSV loader in LangChain, you can follow these steps:

1) Import the CSVLoader class from the langchain.document_loaders module.

2) Create an instance of the CSVLoader class, providing the path to the CSV file as the argument.

3) Use the load() method of the CSVLoader instance to load the CSV file and convert it into LangChain’s Document format.

CSV loaders are particularly useful when you have tabular data in CSV format that you want to analyze or process using LangChain’s text analysis capabilities.

They allow you to easily import and work with structured data from CSV files within the LangChain ecosystem.

Here’s an example code snippet that demonstrates how to use a CSV loader in LangChain:

from langchain.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='/content/sample_data/california_housing_test.csv')
data = loader.load()

print(data)