Analyze Document

The AnalyzeDocumentChain is an end-to-end chain that allows running various NLP tasks over long documents by splitting the input document into chunks, mapping each chunk through a chain, and then reducing the outputs into a single response.

Basic Usage

The AnalyzeDocumentChain can be used as an end-to-end chain to process long documents. It takes in a single document, splits it up into chunks, runs each chunk through a mapping chain, and then combines the results using a reducing chain.

For example, here is how to summarize a long document:

from langchain import OpenAI
from langchain.chains.summarize import load_summarize_chain

llm = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm, chain_type="map_reduce")

from langchain.chains import AnalyzeDocumentChain

summarize_document_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)

with open("long_document.txt") as f:
  document = f.read()

summary = summarize_document_chain.run(document)
print(summary) 

This splits the document, summarizes each chunk, and combines the summaries into one overall summary.

We can also use it for question answering:

from langchain.chains.question_answering import load_qa_chain

qa_chain = load_qa_chain(llm, chain_type="map_reduce")

qa_document_chain = AnalyzeDocumentChain(combine_docs_chain=qa_chain)

answer = qa_document_chain.run(
  input_document=document,
  question="What is the main topic of the document?"  
)

print(answer)

Customizing the Chain

The AnalyzeDocumentChain is customizable by configuring the:

text_splitter: Splits the input text into chunks. For example, the SentenceTextSplitter.
llm_chain: The mapping chain run on each chunk. For example, a summarization or QA chain.
combine_docs_chain: The reducing chain that combines the mapped outputs. Such as ReduceDocumentsChain.

For example, to summarize each sentence and combine the results:

from langchain.chains.combine_documents import ReduceDocumentsChain  
from langchain.chains import LLMChain, PromptTemplate
from langchain.text_splitter import SentenceTextSplitter

# LLM chain that summarizes each sentence
prompt = PromptTemplate(
  template="Summarize this sentence: {text}",
  input_variables=["text"]
)

map_llm_chain = LLMChain(llm=llm, prompt=prompt)

# Combine summaries into one text  
combine_docs_chain = ReduceDocumentsChain(
  combine_documents_chain=StuffDocumentsChain(llm_chain=map_llm_chain)
)

# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
  text_splitter=SentenceTextSplitter(),
  llm_chain=map_llm_chain,
  combine_docs_chain=combine_docs_chain  
)

summary = doc_chain.run(long_document)

Multi Input Prompts

You can also use prompts with multiple inputs. For example, to answer questions about code:

from langchain.chains.combine_documents.map_reduce import MapReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

# Prompts for mapping and reducing  
MAP_PROMPT = PromptTemplate(input_variables=["code"], 
                            template="Describe this code: {code}")
                            
REDUCE_PROMPT = PromptTemplate(input_variables=["code_description", "question"],
                               template="Answer this question about the code: {question}")
                               
# Create mapping and reducing chains                         
map_llm_chain = LLMChain(llm=llm, prompt=MAP_PROMPT)
reduce_llm_chain = LLMChain(llm=llm, prompt=REDUCE_PROMPT)  

# Combine code descriptions into one string
combine_docs_chain = StuffDocumentsChain(llm_chain=reduce_llm_chain)

# Create MapReduce chain
map_reduce_chain = MapReduceDocumentsChain(
  llm_chain=map_llm_chain,
  reduce_documents_chain=ReduceDocumentsChain(
     combine_documents_chain=combine_docs_chain
  )
)

# Create end-to-end chain
doc_chain = AnalyzeDocumentChain(
  text_splitter=CharacterTextSplitter(separator="\n\n"),
  llm_chain=map_reduce_chain
)

doc_chain.run(code, question="What is the time complexity?") 

Using Refine Chains

The refine chain type can also be used with AnalyzeDocumentChain for things like summarization. For example:

from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm, chain_type="refine")

doc_chain = AnalyzeDocumentChain(combine_docs_chain=summary_chain)

summary = doc_chain.run(long_document)

We can also return the intermediate refinement steps with return_refine_steps=True.

Troubleshooting

Some common issues and solutions:

Chain exceeds context size: Use a smaller chunk_size or simplify the mapping chain.
Reducing takes a long time: Increase chunk_overlap to reduce chunks.
Repeated/irrelevant text: Check mapping and reducing prompts.
Missing key information: Tweak mapping prompt to retain details.
Errors with customization: Validate configs match docs.

Analyze Document

Basic Usage​

Customizing the Chain​

Multi Input Prompts​

Using Refine Chains​

Troubleshooting​

Basic Usage

Customizing the Chain

Multi Input Prompts

Using Refine Chains

Troubleshooting