Loading and Processing Documents with LangChain
Introductionβ
LangChain provides powerful tools for loading various document formats like Markdown, HTML, and PDF into a common representation that enables downstream natural language processing tasks. This guide covers the key concepts and includes concrete examples for real-world usage.
Loading Markdownβ
Markdown is a popular lightweight markup language used for formatting plain text documents. LangChain provides easy loading of Markdown through the UnstructuredMarkdownLoader
:
from langchain.document_loaders import UnstructuredMarkdownLoader
markdown_path = "README.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()
The load()
method parses the Markdown and returns a list of Document
objects containing the content.
For example, loading the LangChain README:
data
[Document(page_content="# LangChain\n\nπ Building applications with LLMs through composability π\n\n...",
metadata={'source': 'README.md'})]
The page_content
contains the full extracted text. The metadata
includes the source path.
Retaining Markdown Elementsβ
By default, the Markdown is combined into a single text block. To retain the original sections, headers, etc., specify mode="elements"
:
loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")
data = loader.load()
data[0]
Document(page_content='πΌπ€ LangChain',
metadata={'source': 'README.md',
'page_number': 1,
'category': 'Title'})
Now the elements are separated, with metadata like category
identifying headers, paragraphs, and more.
Loading HTMLβ
Loading HTML works similarly using UnstructuredHTMLLoader
:
from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("index.html")
data = loader.load()
This extracts all text from the HTML document.
For more structured extraction, BSHTMLLoader
uses BeautifulSoup to parse the HTML and capture additional metadata like page titles:
from langchain.document_loaders import BSHTMLLoader
loader = BSHTMLLoader("index.html")
data = loader.load()
print(data[0].metadata['title'])
Loading PDFsβ
To load PDF documents, LangChain provides the PyPDFium2Loader
:
from langchain.document_loaders import PyPDFium2Loader
loader = PyPDFium2Loader("paper.pdf")
data = loader.load()
This uses the PyPDFium2 library to extract text and metadata from PDFs.
Real-World Usageβ
The document loading capabilities of LangChain enable many downstream NLP applications. Here are two concrete examples:
Search/QA over Papers: Load a folder of academic papers as PDFs. Then implement a chain that takes questions, searches the paper contents, and returns an extractive answer.
Summarize Blog Posts: Load a WordPress site's HTML pages. Summarize each post into a short paragraph to generate a digest.
Some other applications enabled by the loaders:
- Chatbots that load relevant documents to a conversation for more knowledgeable responses.
- Agents that load webpages to extract information from or reason about.
The unified Document
format returned by the loaders provides a consistent way to bring external data into a LangChain application.
Limitationsβ
Some limitations to be aware of when loading documents:
- Very large documents may exceed memory constraints. Consider streaming or chunking.
- Complex formats like PDF require additional parsing logic beyond basic text extraction.
- Expect noise and inconsistencies when extracting text from arbitrary web pages.
Conclusionβ
LangChain makes it simple to load documents in various formats for downstream NLP tasks. The built-in loaders handle details like Markdown parsing and PDF extraction so you can focus on your core application logic.