Document Loaders in Langchain
- Naveen
- 0
In this article, we will be looking at multiple ways which langchain uses to load document to bring information from various sources and prepare it for processing. These loaders act like data connectors, fetching information and converting it into a format Langchain understands.
There are a lot of document loaders in LangChain and you can find them here.
We will cover some of them:
- TextLoader
- CSVLoader
- JSONLoader
- DirectoryLoader
- PyPDFLoader
- ArxivLoader
- Docx2txtLoader
- WebBaseLoader
- UnstructuredFileLoader
- UnstructuredURLLoader
- YoutubeAudioLoader
- NotionDirectoryLoader
TextLoader
from langchain_community.document_loaders import TextLoader text = '/content/check.txt' loader = TextLoader(text) loader.load() # Output [Document(page_content='India, country that occupies the greater part of South Asia. India is made up of 28 states and eight union territories, and its national capital is New Delhi, built in the 20th century just south of the historic hub of Old Delhi to serve as India’s administrative centre. Its government is a constitutional republic that represents a highly diverse population consisting of thousands of ethnic groups and hundreds of languages. India became the world’s most populous country in 2023, according to estimates by the United Nations.', metadata={'source': '/content/check.txt'})]
CSVLoader
CSV files are a common format for storing tabular data, and the CSVLoader provides a convenient way to read and process this data.
import pandas as pd # Create a simple DataFrame data = { 'Name': ['Rohit', 'Ayaan', 'Ajay', 'Sandesh'], 'Age': [26, 20, 23, 23], 'City': ['Delhi', 'Mumbai', 'Noida', 'Chicago'] } df = pd.DataFrame(data) # Export the DataFrame to a CSV file csv_file_path = 'sample_data.csv' df.to_csv(csv_file_path, index=False) from langchain_community.document_loaders.csv_loader import CSVLoader loader = CSVLoader(file_path='sample_data.csv') data = loader.load() data # Output [Document(page_content='Name: Rohit\nAge: 26\nCity: Delhi', metadata={'source': 'sample_data.csv', 'row': 0}), Document(page_content='Name: Ayaan\nAge: 20\nCity: Mumbai', metadata={'source': 'sample_data.csv', 'row': 1}), Document(page_content='Name: Ajay\nAge: 23\nCity: Noida', metadata={'source': 'sample_data.csv', 'row': 2}), Document(page_content='Name: Sandesh\nAge: 23\nCity: Chicago', metadata={'source': 'sample_data.csv', 'row': 3})]
When you load data from a CSV file, the loader typically creates a separate “Document” object for each row of data in the CSV.
By default, the source of each Document is set to the entire file path of the CSV itself. This might not be ideal if you want to track where each piece of information comes from within the CSV.
You can specify a column name within your CSV file using source_column. The value in that specific column for each row will then be used as the individual source for the corresponding Document created from that row.
loader = CSVLoader(file_path='sample_data.csv', source_column="Age") data = loader.load() data # Output [Document(page_content='Name: Rohit\nAge: 26\nCity: Delhi', metadata={'source': '26', 'row': 0}), Document(page_content='Name: Ayaan\nAge: 20\nCity: Mumbai', metadata={'source': '20', 'row': 1}), Document(page_content='Name: Ajay\nAge: 23\nCity: Noida', metadata={'source': '23', 'row': 2}), Document(page_content='Name: Sandesh\nAge: 23\nCity: Chicago', metadata={'source': '23', 'row': 3})]
This becomes particularly helpful when working with “chains” that involve answering questions based on the source of the information. By having individual source information for each Document, these chains can consider the origin of the data while processing and potentially provide more nuanced or reliable answers.
JSONLoader
JSONLoader is designed to handle data stored in JSON.
[ { "id": 1, "name": "Ajay Kumar", "email": "ajay.kumar@example.com", "age": 23, "city": "Delhi" }, { "id": 2, "name": "Rohit Sharma", "email": "rohit.sharma@example.com", "age": 26, "city": "Mumbai" }, { "id": 3, "name": "Sandesh Tukrul", "email": "sandesh.tukrul@example.com", "age": 23, "city": "Noida" } ]
JSONLoaders utilizes the JQ library for parsing JSON data. JQ offers a powerful query language specifically designed for manipulating JSON structures.
The jq_schema parameter allows you to provide a JQ expression within the JSONLoader function.
!pip install jq from langchain_community.document_loaders import JSONLoader loader = JSONLoader( file_path='example.json', jq_schema='map({ name, email })', text_content=False) data = loader.load() data # Output [Document(page_content="[{'name': 'Ajay Kumar', 'email': 'ajay.kumar@example.com'}, {'name': 'Rohit Sharma', 'email': 'rohit.sharma@example.com'}, {'name': 'Sandesh Tukrul', 'email': 'sandesh.tukrul@example.com'}]", metadata={'source': '/content/example.json', 'seq_num': 1})]
DirectoryLoader
It loads all the documents in a directory, it uses UnstructuredLoader under the hood, by default.
We can use the glob parameter to control which files to load. Note that here it doesn’t load the .rst file or the .html files.
from langchain_community.document_loaders import DirectoryLoader loader = DirectoryLoader('../', glob="**/*.md") docs = loader.load() len(docs)
PyPDFLoader
The PyPDFLoader is a powerful tool in LangChain for seamlessly loading and processing PDF documents. Utilizing the pypdf library, it preserves the structure and layout of PDFs while extracting text content. You can load entire documents or individual pages, enabling granular processing. PyPDFLoader integrates with LangChain’s ecosystem, allowing advanced natural language tasks like question answering on PDF data.
from langchain.document_loaders import PyPDFLoader loader = PyPDFLoader("/content/sample_data/MachineLearning-Lecture01.pdf") pages_1 = loader.load() #Each page is a Document. A Document contains text (page_content) and metadata. len(pages_1) """22""" page = pages_1[0] print(page.page_content[:500]) # Output MachineLearning-Lecture01 Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is ju st spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning. By way of introduction, my name's Andrew Ng and I'll be instru ctor for this class. And so I personally work in machine learning, and I' ve worked on it for about 15 years now, and I actually think that machine learning i
We can also use UnstructuredPDFLoader to load PDFs.
from langchain_community.document_loaders import UnstructuredPDFLoader loader = UnstructuredPDFLoader("/content/sample_data/MachineLearning-Lecture01.pdf") data = loader.load()
We have OnlinePDFLoader to load online PDFs.
from langchain_community.document_loaders import OnlinePDFLoader loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf") data = loader.load() data # Output [Document(page_content='3 2 0 2\n\nb e F 7\n\n]\n\nG A . h t a m\n\n[\n\n1 v 3 0 8 3 0 . 2 0 3 2 : v i X r a\n\nA WEAK (k, k)-LEFSCHETZ THEOREM FOR PROJECTIVE TORIC ORBI...
There are many more that utilize different sources…
# PyPDFium2Loader from langchain_community.document_loaders import PyPDFium2Loader loader = PyPDFium2Loader("text.pdf") data = loader.load() # PDFMinerLoader from langchain_community.document_loaders import PDFMinerLoader loader = PDFMinerLoader("text.pdf") data = loader.load() # PDFMinerPDFasHTMLLoader from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader loader = PDFMinerPDFasHTMLLoader("text.pdf") data = loader.load()[0] # entire PDF is loaded as a single Document # PyMuPDFLoader from langchain_community.document_loaders import PyMuPDFLoader loader = PyMuPDFLoader("text.pdf") data = loader.load() # Directory loader for PDF from langchain_community.document_loaders import PyPDFDirectoryLoader loader = PyPDFDirectoryLoader("folder/") docs = loader.load()
ArxivLoader
The ArxivLoader from LangChain is a game-changer for researchers and academics, providing direct access to the extensive arXiv repository of open-access publications. With just a few lines of code, you can fetch and process cutting-edge research papers, unlocking a wealth of knowledge.
from langchain_community.document_loaders import ArxivLoader docs = ArxivLoader(query="1605.08386", load_max_docs=2).load() print(len(docs)) print() print(docs[0].metadata) # Output 1 {'Published': '2016-05-26', 'Title': 'Heat-bath random walks with Markov bases', 'Authors': 'Caprice Stanley, Tobias Windisch', 'Summary': 'Graphs on lattice points are studied whose edges come from a finite set of\nallowed moves of arbitrary length. We show that the diameter of these graphs on\nfibers of a fixed integer matrix can be bounded from above by a constant. We\nthen study the mixing behaviour of heat-bath random walks on these graphs. We\nalso state explicit conditions on the set of moves so that the heat-bath random\nwalk, a generalization of the Glauber dynamics, is an expander in fixed\ndimension.'}
Docx2txtLoader
The Docx2txtLoader is a specialized tool designed to handle Microsoft Office Word documents (docx files). It allows you to effortlessly load and extract text content from Word files, making it a valuable asset for processing and analyzing documentation, reports, and other text-based materials stored in this widely-used format. With Docx2txtLoader, you can seamlessly integrate Word document data into your natural language processing pipelines and workflows within the LangChain ecosystem.
from langchain_community.document_loaders import Docx2txtLoader loader = Docx2txtLoader("example_data.docx") data = loader.load() data # Output [Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'file’}
WebBaseLoader
from langchain.document_loaders import WebBaseLoader loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md") docs = loader.load() print(docs[0].page_content[:500])
UnstructuredFileLoader
Unlike loaders designed for specific formats like TextLoader, UnstructuredFileLoader automatically detects the file type you provide.
The loader utilizes the “unstructured” library under the hood. This library analyzes the file content and attempts to extract meaningful information based on the file type.
from langchain_community.document_loaders import UnstructuredFileLoader loader = UnstructuredFileLoader('/content/textfile.txt') docs = loader.load() docs # Output [Document(page_content='The rise of generative models\n\nGenerative AI refers to deep-learning models that can take raw data—say, all of Wikipedia or the collected works of Rembrandt—and “learn” to generate statistically probable outputs when prompted. At a high level, generative models encode a simplified representation of their training data and draw from it to create a new work that’s similar, but not identical, to the original data. Generative models have been used for years in statistics to analyze numerical data. The rise of deep learning, however, made it possible to extend them to images, speech, and other complex data types. Among the first class of AI models to achieve this cross-over feat were variational autoencoders, or VAEs, introduced in 2013. VAEs were the first deep-learning models to be widely used for generating realistic images and speech.', metadata={'source': '/content/textfile.txt'})] loader = UnstructuredFileLoader('/content/textfile.txt', mode="elements") docs = loader.load() docs # Output [Document(page_content='The rise of generative models', metadata={'source': '/content/textfile.txt', 'file_directory': '/content', 'filename': 'textfile.txt', 'last_modified': '2024-03-09T01:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'Title'}), Document(page_content='Generative AI refers to deep-learning models that can take raw data—say, all of Wikipedia or the collected works of Rembrandt—and “learn” to generate statistically probable outputs when prompted. At a high level, generative models encode a simplified representation of their training data and draw from it to create a new work that’s similar, but not identical, to the original data. Generative models have been used for years in statistics to analyze numerical data. The rise of deep learning, however, made it possible to extend them to images, speech, and other complex data types. Among the first class of AI models to achieve this cross-over feat were variational autoencoders, or VAEs, introduced in 2013. VAEs were the first deep-learning models to be widely used for generating realistic images and speech.', metadata={'source': '/content/textfile.txt', 'file_directory': '/content', 'filename': 'textfile.txt', 'last_modified': '2024-03-09T01:18:22', 'languages': ['eng'], 'filetype': 'text/plain', 'category': 'NarrativeText'})] # pip install "unstructured[pdf]" loader = UnstructuredFileLoader("text.pdf") docs = loader.load() docs # Output [Document(page_content='Event\n\nCommence Date\n\nReference\n\nPaul Kalkbrenner\n\n10 September,Satu info@biletino.com', metadata={'source': 'text.pdf'})]
UnstructuredURLLoader
from langchain.document_loaders import UnstructuredURLLoader loader = UnstructuredURLLoader(urls = ['https://see.stanford.edu/materials/aimlcs229/transcripts/MachineLearning-Lecture01.pdf']) pages = loader.load() len(pages) """1""" pagee = pages[0] print(pagee.page_content[:500]) # Output MachineLearning-Lecture01 Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning. By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so I personally work in machine learning, and I've worked on it for about 15 years now, and I actually think that machine learning is the most e
YouTube
from langchain.document_loaders.generic import GenericLoader from langchain.document_loaders.parsers import OpenAIWhisperParser from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader # ! pip install yt_dlp # ! pip install pydub url="https://www.youtube.com/watch?v=jGwO_UgTS7I" save_dir="docs/youtube/" loader = GenericLoader( YoutubeAudioLoader([url],save_dir), OpenAIWhisperParser() ) docs = loader.load() docs[0].page_content[0:500]
NotionDirectoryLoader
from langchain.document_loaders import NotionDirectoryLoader loader = NotionDirectoryLoader("docs/Notion_DB") docs = loader.load() print(docs[0].page_content[0:200]) docs[0].metadata