Comparison Evaluators

Comparison evaluators in LangChain provide a way to measure and compare the outputs of two different chains, agents, or language models given the same input. As noted in the FAQ, these evaluators are helpful for comparative analyses like A/B testing between models or evaluating different versions of a model. They can also be useful for generating preference scores for reinforcement learning.

Overview

As explained in the FAQ, comparison evaluators inherit from the PairwiseStringEvaluator class, which provides a standard interface for evaluating two string outputs - typically generated by different prompts, models, or model versions.

At their core, comparison evaluators take two string outputs and return a dictionary containing an evaluation score and other relevant details.

Some common use cases:

Comparing outputs of different language models or prompts
A/B testing model versions during iterative improvement
Generating preference scores for reinforcement learning

Usage

The FAQ provides this example for using a comparison evaluator:

from langchain.evaluators import EmbeddingDistanceEvaluator

evaluator = EmbeddingDistanceEvaluator()

output1 = "This is the first output string"
output2 = "This is the second output string" 

results = evaluator.evaluate(output1, output2)

The results will contain the evaluation score and details.

Customization

As noted in the FAQ, you can create custom comparison evaluators by inheriting from PairwiseStringEvaluator and implementing the _evaluate_string_pairs method (and _aevaluate_string_pairs for async evaluation).

The key methods and properties to override are:

evaluate_string_pairs: Implement this to evaluate pairs of strings.
aevaluate_string_pairs: Implement this for async evaluation.
requires_input: Boolean indicating if input is needed.
requires_reference: Boolean indicating if reference labels are needed.

See the custom comparison evaluator guide for more details and examples.

Built-in Evaluators

As mentioned in the FAQ, LangChain comes with some built-in comparison evaluators:

See the API reference docs for details on usage and configuration of these built-in evaluators.

Comparison Evaluators

Overview​

Usage​

Customization​

Built-in Evaluators​

Overview

Usage

Customization

Built-in Evaluators