Skip to main content

Connecting External Data to Language Models

Many language model applications require connecting user-specific data that is not part of the model's training set. The data connection module in LangChain provides the building blocks to load, transform, store and query external data, enabling it to be used alongside language models.

Overview

The main components of the data connection module are:

  • Document Loaders: Load documents from diverse sources like local files, databases, APIs, etc.
  • Document Transformers: Preprocess documents by splitting, converting to QA pairs, removing duplicates etc.
  • Text Embedding Models: Convert text to vector representations capturing semantic meaning.
  • Vector Stores: Index and store document vectors for efficient similarity search.
  • Retrievers: Query vector stores to find relevant documents.

Data Connection Pipeline

These components allow building pipelines to bring in external data, process it, index it for fast retrieval, and query it from language models.

Document Loaders

Document loaders allow loading data from various sources:

  • Local files and folders
  • Databases via SQLAlchemy
  • APIs via Requests
  • Google Drive, S3 Buckets

Custom loaders can be written for other data sources.

Examples:

# Load local text file
loader = DocumentLoader()
docs = loader.load_from_path('data.txt')

# Load SQL database
loader = PSQLDocumentLoader(table='documents', db_url='...')
docs = loader.load()

Document Transformers

Transformers preprocess loaded documents before indexing. Useful transformations:

  • Splitting documents into passages
  • Converting documents to question-answer pairs
  • Removing duplicate/irrelevant docs

Multiple transformers can be chained for complex pipelines.

Examples:

# Split documents
splitter = SplitDocuments(chunk_size=100)
docs = splitter.transform(docs)

# Convert to QA
to_qa = Doc2QA()
docs = to_qa.transform(docs)

Text Embedding Models

Embedders convert text to vector representations capturing semantic meaning, allowing vector similarity based retrieval.

LangChain supports models like SBERT, GPT-2 etc. Custom embeddings can also be provided.

Example:

embedder = SBERTEmbedding()
embeddings = embedder.embed_documents(docs)

Vector Stores

Vector stores like FAISS and Pinecone index embeddings for fast similarity search. Stores can persist data to disk.

Example:

store = FAISS()
store.add_documents(embeddings)

Retrievers

Retrievers query vector stores to find relevant documents using text or embeddings. Different retrievers optimize for latency, accuracy etc.

Example:

retriever = EmbeddingRetriever(vectorstore=store)
results = retriever.retrieve("What is LangChain?")

The data connection module provides the building blocks to connect external data to language models. The modular design allows customizing pipelines for specific use cases.