Recursively Split Text by Characters
The RecursiveCharacterTextSplitter is recommended for splitting generic text into semantically meaningful chunks. It works by recursively trying separator characters to split text until the chunks are small enough.
How it Works
The RecursiveCharacterTextSplitter tries to split text on a provided list of separator characters, in order. The default separator list is:
["\n\n", "\n", " ", ""]
This has the effect of trying to keep paragraphs, sentences, and words together in the chunks.
The chunk size is measured by the number of characters. Once the chunks are small enough per the provided chunk size, the splitting stops.
By splitting last on spaces and empty strings, it tries to maintain semantic meaning in each chunk.
Usage
Import the RecursiveCharacterTextSplitter:
from langchain.text_splitter import RecursiveCharacterTextSplitter
Initialize it with the desired chunk size and overlap:
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=100,
chunk_overlap=20,
length_function=len
)
Then split text:
texts = text_splitter.create_documents([text])
You can also customize the separator list:
text_splitter = RecursiveCharacterTextSplitter(
separators=["\n\n", "\n", " "],
chunk_size=100,
chunk_overlap=20,
length_function=len
)
Examples
Split a long text file into chunks:
with open('book.txt') as f:
book = f.read()
texts = text_splitter.create_documents([book])
Get the first few split chunks:
chunks = text_splitter.split_text(book)[:5]
Pass along metadata during splitting:
metadata = {"title": "My Book"}
texts = text_splitter.create_documents([book], metadatas=[metadata])
The RecursiveCharacterTextSplitter is useful for splitting long documents into smaller chunks while trying to maintain semantic meaning. It provides flexibility to customize the separator characters and chunk size as needed.