Streaming

Some Chat models provide a streaming response. This means that instead of waiting for the entire response to be returned, you can start processing it as soon as it's available. This is useful if you want to display the response to the user as it's being generated, or if you want to process the response as it's being generated.

Currently, we support streaming for a broad range of Chat model implementations, including but not limited to ChatOpenAI and ChatAnthropic. To utilize streaming, use a CallbackHandler that implements on_chat_new_message.

Displaying Streaming Results

One way to use streaming is to print out the results in real-time. Here is an example using StreamingStdOutCallbackHandler:

from langchain.chat_models import ChatOpenAI
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

chat = ChatOpenAI(streaming=True, callbacks=[StreamingStdOutCallbackHandler()])
resp = chat([HumanMessage("Write me a song about sparkling water.")])

This will print the song lyrics as they are generated.

Persisting Streaming Results

Another common use case is to persist the streaming results to a database or file storage. Here is an example saving the chat history to SQLite:

import sqlite3
from langchain.callbacks.base import CallbackHandler

class ChatHistoryCallback(CallbackHandler):

    def __init__(self):
        self.conn = sqlite3.connect('chat_history.db')
        self.c = self.conn.cursor()
        self.c.execute('''CREATE TABLE chathistory
            (id INTEGER PRIMARY KEY, role TEXT, content TEXT)''')

    def on_chat_new_message(self, message):
        self.c.execute("INSERT INTO chathistory VALUES (NULL, ?, ?)",  
                       (message.role, message.content))
        self.conn.commit()

# Usage:
chat = ChatOpenAI(streaming=True, callbacks=[ChatHistoryCallback()])
chat([HumanMessage("Hello!")]) 

When persisting results, be mindful of potential bottlenecks. If the chat model generates messages faster than your database can ingest, you may need queueing, buffering, or rate limiting.

Real-time NLP

You can also process each message as it streams in to enable real-time NLP. For example, you could perform sentiment analysis on each chat message and aggregate the results.

Exposing via HTTP

Finally, you can build web APIs by exposing the CallbackHandler through a web framework like FastAPI. This allows streaming messages to clients over HTTP.

This allows you to build powerful applications on top of streaming chat models!

Streaming

Displaying Streaming Results​

Persisting Streaming Results​

Real-time NLP​

Exposing via HTTP​

Displaying Streaming Results

Persisting Streaming Results

Real-time NLP

Exposing via HTTP