Document Loaders

Document loaders allow you to load data from various sources into Document objects that can be used in LangChain. A Document contains the text content as well as any associated metadata.

There are built-in document loaders for many common data sources like files, web pages, YouTube transcripts, and more. See the Integrations page for details on all the available document loader integrations.

Key Features

The main features of document loaders are:

Load: Load data from a source into Document objects
Lazy Load: Lazily load data into memory only when needed
Splitting: Automatically split documents into smaller chunks

Get Started

The simplest loader reads in a file as text and places it all into one Document:

from langchain.document_loaders import TextLoader

loader = TextLoader("./index.md")  
loader.load()

This loads the contents of index.md into a single Document.

Lazy Loading

With lazy loading, documents are only loaded from disk when you first request them. This is useful for large datasets where you don't want everything in memory at once.

You can also iterate over a lazy loader to load each document:

for doc in loader:
    print(doc) 

Splitting Documents

The default recommended text splitter is the RecursiveCharacterTextSplitter. This text splitter takes a list of characters. It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth.

To split documents into smaller chunks:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter()

loader = TextLoader("./file.txt", splitter=splitter)
docs = loader.load()

Now docs will contain multiple Document objects split from the original text.

There are many different text splitting strategies available to handle different document types. You can also customize parameters like chunk size and overlap amount.

Tips for Large Datasets

When working with large volumes of documents:

Use lazy loading to avoid loading everything into memory
Load documents in parallel using multiprocessing
Split documents into small chunks suitable for your model
Store embeddings for fast nearest-neighbor search

See the Vector Stores page for more details on efficiently working with large datasets.

Document Loaders

Key Features​

Get Started​

Lazy Loading​

Splitting Documents​

Tips for Large Datasets​

Key Features

Get Started

Lazy Loading

Splitting Documents

Tips for Large Datasets