Skip to main content

Analyze Document

The AnalyzeDocumentChain is an end-to-end chain that allows running various NLP tasks over long documents by splitting the input document into chunks, mapping each chunk through a chain, and then reducing the outputs into a single response.

Basic Usage

The AnalyzeDocumentChain can be used as an end-to-end chain to process long documents. It takes in a single document, splits it up into chunks, runs each chunk through a mapping chain, and then combines the results using a reducing chain.

For example, here is how to summarize a long document:

from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm, chain_type="map_reduce")

from langchain.chains import AnalyzeDocumentChain

summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)

with open("long_document.txt") as f:
document = f.read()

summary = summarize_document_chain.run(document)
print(summary)

This splits the document, summarizes each chunk, and combines the summaries into one overall summary.

We can also use it for question answering:

from langchain.chains.question_answering import load_qa_chain

qa_chain = load_qa_chain(llm, chain_type="map_reduce")

qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)

answer = qa_document_chain.run(
input_document=document,
question="What is the main topic of the document?"
)

print(answer)

Customizing the Chain

The AnalyzeDocumentChain is customizable by configuring the:

  • text_splitter: Splits the input text into chunks. For example, the SentenceTextSplitter.

  • llm_chain: The mapping chain run on each chunk. For example, a summarization or QA chain.

  • combine_docs_chain: The reducing chain that combines the mapped outputs. Such as ReduceDocumentsChain.

For example, to summarize each sentence and combine the results:

from langchain.chains.combine_documents import ReduceDocumentsChain  
from langchain.chains import LLMChain, PromptTemplate
from langchain.text_splitter import SentenceTextSplitter

# LLM chain that summarizes each sentence
prompt = PromptTemplate(
template="Summarize this sentence: {text}",
input_variables=["text"]
)

map_llm_chain = LLMChain(llm=llm, prompt=prompt)

# Combine summaries into one text
combine_docs_chain = ReduceDocumentsChain(
combine_documents_chain=StuffDocumentsChain(llm_chain=map_llm_chain)
)

# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
text_splitter=SentenceTextSplitter(),
llm_chain=map_llm_chain,
combine_docs_chain=combine_docs_chain
)

summary = doc_chain.run(long_document)

Multi Input Prompts

You can also use prompts with multiple inputs. For example, to answer questions about code:

from langchain.chains.combine_documents.map_reduce import MapReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

# Prompts for mapping and reducing
MAP_PROMPT = PromptTemplate(input_variables=["code"],
template="Describe this code: {code}")

REDUCE_PROMPT = PromptTemplate(input_variables=["code_description", "question"],
template="Answer this question about the code: {question}")

# Create mapping and reducing chains
map_llm_chain = LLMChain(llm=llm, prompt=MAP_PROMPT)
reduce_llm_chain = LLMChain(llm=llm, prompt=REDUCE_PROMPT)

# Combine code descriptions into one string
combine_docs_chain = StuffDocumentsChain(llm_chain=reduce_llm_chain)

# Create MapReduce chain
map_reduce_chain = MapReduceDocumentsChain(
llm_chain=map_llm_chain,
reduce_documents_chain=ReduceDocumentsChain(
combine_documents_chain=combine_docs_chain
)
)

# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
text_splitter=CharacterTextSplitter(separator="\n\n"),
llm_chain=map_reduce_chain
)

doc_chain.run(code, question="What is the time complexity?")

Using Refine Chains

The refine chain type can also be used with AnalyzeDocumentChain for things like summarization. For example:

from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm, chain_type="refine")

doc_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)

summary = doc_chain.run(long_document)

We can also return the intermediate refinement steps with return_refine_steps=True.

Troubleshooting

Some common issues and solutions:

  • Chain exceeds context size: Use a smaller chunk_size or simplify the mapping chain.

  • Reducing takes a long time: Increase chunk_overlap to reduce chunks.

  • Repeated/irrelevant text: Check mapping and reducing prompts.

  • Missing key information: Tweak mapping prompt to retain details.

  • Errors with customization: Validate configs match docs.