Comparison Evaluators
Comparison evaluators in LangChain provide a way to measure and compare the outputs of two different chains, agents, or language models given the same input. As noted in the FAQ, these evaluators are helpful for comparative analyses like A/B testing between models or evaluating different versions of a model. They can also be useful for generating preference scores for reinforcement learning.
Overview
As explained in the FAQ, comparison evaluators inherit from the PairwiseStringEvaluator
class, which provides a standard interface for evaluating two string outputs - typically generated by different prompts, models, or model versions.
At their core, comparison evaluators take two string outputs and return a dictionary containing an evaluation score and other relevant details.
Some common use cases:
- Comparing outputs of different language models or prompts
- A/B testing model versions during iterative improvement
- Generating preference scores for reinforcement learning
Usage
The FAQ provides this example for using a comparison evaluator:
from langchain.evaluators import EmbeddingDistanceEvaluator
evaluator = EmbeddingDistanceEvaluator()
output1 = "This is the first output string"
output2 = "This is the second output string"
results = evaluator.evaluate(output1, output2)
The results
will contain the evaluation score and details.
Customization
As noted in the FAQ, you can create custom comparison evaluators by inheriting from PairwiseStringEvaluator
and implementing the _evaluate_string_pairs
method (and _aevaluate_string_pairs
for async evaluation).
The key methods and properties to override are:
evaluate_string_pairs
: Implement this to evaluate pairs of strings.aevaluate_string_pairs
: Implement this for async evaluation.requires_input
: Boolean indicating if input is needed.requires_reference
: Boolean indicating if reference labels are needed.
See the custom comparison evaluator guide for more details and examples.
Built-in Evaluators
As mentioned in the FAQ, LangChain comes with some built-in comparison evaluators:
See the API reference docs for details on usage and configuration of these built-in evaluators.