Using Moderation Chains to Prevent Harmful LLM Outputs

Language models can sometimes generate harmful, dangerous, or unethical outputs. Moderation chains in LangChain provide a way to detect and handle these cases to prevent your application from generating and spreading harmful content.

Overview

Moderation chains are useful for detecting text from a language model that could be hateful, violent, etc. Some API providers like OpenAI specifically prohibit generating certain types of harmful content in their usage policies. Using a moderation chain is an important step to comply with these policies and prevent your application from generating harmful text.

There are a few ways moderation chains can handle harmful text:

Throw an error that your application code can catch and handle
Return a string to the user explaining the text was harmful
Custom logic tailored to your specific application

We'll cover examples of:

Running any text through a moderation chain
Appending a moderation chain to an LLMChain
Best practices for handling harmful text

Running Text Through a Moderation Chain

Let's see some examples of running text through a moderation chain.

First we'll create a simple moderation chain with default settings:

from langchain.chains import OpenAIModerationChain

moderation_chain = OpenAIModerationChain()

This will return a string if text violates OpenAI's policy:

moderation_chain.run("This is okay") 
# 'This is okay'

moderation_chain.run("I will kill you")
# "Text was found that violates OpenAI's content policy." 

We can also configure the chain to throw an error on violations:

moderation_chain_error = OpenAIModerationChain(error=True)

moderation_chain_error.run("This is okay")
# 'This is okay' 

moderation_chain_error.run("I will kill you")
# Raises ValueError

And we can create a custom chain with custom logic:

class CustomModeration(OpenAIModerationChain):
   
    def _moderate(self, text: str, results: dict) -> str:
        if results["flagged"]:
            error_str = f"The following text was found harmful: {text}"
            return error_str
        return text
        
custom_moderation = CustomModeration()

custom_moderation.run("This is okay")  
# 'This is okay'

custom_moderation.run("I will kill you")
# "The following text was found harmful: I will kill you"

Appending Moderation to LLMChains

Now let's look at appending a moderation chain to an LLMChain using the SequentialChain abstraction.

We'll start with a simple LLMChain that has a single input:

from langchain.chains import LLMChain
from langchain.llms import OpenAI  
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(template="{text}", input_variables=["text"])
llm_chain = LLMChain(llm=OpenAI(temperature=0), prompt=prompt)

text = """  
Person 1: I will kill you
Person 2: 
"""

llm_chain.run(text) 
# 'I will kill you'

We can append the moderation chain:

from langchain.chains import SequentialChain, SimpleSequentialChain

chain = SimpleSequentialChain(chains=[llm_chain, moderation_chain]) 

chain.run(text)
# "Text was found that violates OpenAI's content policy." 

For an LLMChain with multiple inputs, we need to use SequentialChain and align the input/output keys:

prompt = PromptTemplate(template="{setup}{new_input}Person 2:",
                        input_variables=["setup", "new_input"]) 

# ...define llm_chain

setup = "..."
new_input = "I will kill you"  

inputs = {"setup": setup, "new_input": new_input}

# Align input/output keys  
moderation_chain.input_key = "text"
moderation_chain.output_key = "sanitized_text"

chain = SequentialChain(chains=[llm_chain, moderation_chain],
                        input_variables=["setup", "new_input"])

chain(inputs, return_only_outputs=True) 
# {'sanitized_text': "Text was found that violates OpenAI's content policy."}

Best Practices for Handling Harmful Text

Once harmful text is detected by the moderation chain, here are some best practices for handling it gracefully:

Log the incident with relevant context so you can track how often it occurs and debug issues.
Return a generic error message to the user rather than the flagged text itself.
If possible, allow the user to edit their input and retry.
Have human review processes in place for appeals or edge cases.
Use the least permissive settings needed to mitigate harm. Overly strict moderation creates a poor user experience.
Implement thoughtful UI/UX to communicate issues tactfully with users.

The right approach depends on your specific application. The key is having a plan in place to handle errors smoothly.

Conclusion

Using moderation chains is an important step to prevent your application from generating harmful text. The patterns shown here demonstrate some ways to effectively integrate moderation chains into your workflows. With the right error handling approach, you can create safe and ethical LLM-powered applications.

Using Moderation Chains to Prevent Harmful LLM Outputs

Overview​

Running Text Through a Moderation Chain​

Appending Moderation to LLMChains​

Best Practices for Handling Harmful Text​

Conclusion​

Overview

Running Text Through a Moderation Chain

Appending Moderation to LLMChains

Best Practices for Handling Harmful Text

Conclusion