Analyze Document
The AnalyzeDocumentChain is an end-to-end chain that allows running various NLP tasks over long documents by splitting the input document into chunks, mapping each chunk through a chain, and then reducing the outputs into a single response.
Basic Usage
The AnalyzeDocumentChain can be used as an end-to-end chain to process long documents. It takes in a single document, splits it up into chunks, runs each chunk through a mapping chain, and then combines the results using a reducing chain.
For example, here is how to summarize a long document:
from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain
llm = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm, chain_type="map_reduce")
from langchain.chains import AnalyzeDocumentChain
summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)
with open("long_document.txt") as f:
document = f.read()
summary = summarize_document_chain.run(document)
print(summary)
This splits the document, summarizes each chunk, and combines the summaries into one overall summary.
We can also use it for question answering:
from langchain.chains.question_answering import load_qa_chain
qa_chain = load_qa_chain(llm, chain_type="map_reduce")
qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)
answer = qa_document_chain.run(
input_document=document,
question="What is the main topic of the document?"
)
print(answer)
Customizing the Chain
The AnalyzeDocumentChain is customizable by configuring the:
text_splitter
: Splits the input text into chunks. For example, theSentenceTextSplitter
.llm_chain
: The mapping chain run on each chunk. For example, a summarization or QA chain.combine_docs_chain
: The reducing chain that combines the mapped outputs. Such asReduceDocumentsChain
.
For example, to summarize each sentence and combine the results:
from langchain.chains.combine_documents import ReduceDocumentsChain
from langchain.chains import LLMChain, PromptTemplate
from langchain.text_splitter import SentenceTextSplitter
# LLM chain that summarizes each sentence
prompt = PromptTemplate(
template="Summarize this sentence: {text}",
input_variables=["text"]
)
map_llm_chain = LLMChain(llm=llm, prompt=prompt)
# Combine summaries into one text
combine_docs_chain = ReduceDocumentsChain(
combine_documents_chain=StuffDocumentsChain(llm_chain=map_llm_chain)
)
# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
text_splitter=SentenceTextSplitter(),
llm_chain=map_llm_chain,
combine_docs_chain=combine_docs_chain
)
summary = doc_chain.run(long_document)
Multi Input Prompts
You can also use prompts with multiple inputs. For example, to answer questions about code:
from langchain.chains.combine_documents.map_reduce import MapReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
# Prompts for mapping and reducing
MAP_PROMPT = PromptTemplate(input_variables=["code"],
template="Describe this code: {code}")
REDUCE_PROMPT = PromptTemplate(input_variables=["code_description", "question"],
template="Answer this question about the code: {question}")
# Create mapping and reducing chains
map_llm_chain = LLMChain(llm=llm, prompt=MAP_PROMPT)
reduce_llm_chain = LLMChain(llm=llm, prompt=REDUCE_PROMPT)
# Combine code descriptions into one string
combine_docs_chain = StuffDocumentsChain(llm_chain=reduce_llm_chain)
# Create MapReduce chain
map_reduce_chain = MapReduceDocumentsChain(
llm_chain=map_llm_chain,
reduce_documents_chain=ReduceDocumentsChain(
combine_documents_chain=combine_docs_chain
)
)
# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
text_splitter=CharacterTextSplitter(separator="\n\n"),
llm_chain=map_reduce_chain
)
doc_chain.run(code, question="What is the time complexity?")
Using Refine Chains
The refine
chain type can also be used with AnalyzeDocumentChain for things like summarization. For example:
from langchain.chains.summarize import load_summarize_chain
summary_chain = load_summarize_chain(llm, chain_type="refine")
doc_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)
summary = doc_chain.run(long_document)
We can also return the intermediate refinement steps with return_refine_steps=True
.
Troubleshooting
Some common issues and solutions:
Chain exceeds context size: Use a smaller
chunk_size
or simplify the mapping chain.Reducing takes a long time: Increase
chunk_overlap
to reduce chunks.Repeated/irrelevant text: Check mapping and reducing prompts.
Missing key information: Tweak mapping prompt to retain details.
Errors with customization: Validate configs match docs.