Skip to main content

Evaluation

Overview

Evaluation is a critical part of developing reliable applications with language models. LangChain provides various evaluator types and tools to help measure model performance across different scenarios.

As mentioned in the LangChain FAQ:

LangChain offers various types of evaluators to help you measure performance and integrity on diverse data, and we hope to encourage the the community to create and share other useful evaluators so everyone can improve.

Evaluator Types

LangChain offers the following main evaluator types:

String Evaluators

String evaluators compare a model's generated text against a reference string or input. They are useful for basic accuracy testing and quality checks.

As noted in the FAQ:

String Evaluators assess the predicted string for a given input, usually comparing it against a reference string.

For example:

from langchain.evaluators import StringDistanceEvaluator

evaluator = StringDistanceEvaluator()
input = "What is the capital of France?"
reference = "The capital of France is Paris."
prediction = llm.generate(input)

score = evaluator.evaluate(prediction, reference, input)

To create a custom string evaluator, inherit from StringEvaluator and implement _evaluate_strings.

Comparison Evaluators

Comparison evaluators measure two model outputs. They are helpful for A/B testing models and computing preference scores.

As stated in the FAQ:

Comparison Evaluators are designed to compare predictions from two runs on a common input.

For example:

from langchain.evaluators import EmbeddingDistanceEvaluator

evaluator = EmbeddingDistanceEvaluator()

input = "How do I make pasta?"
pred_1 = llm1.generate(input)
pred_2 = llm2.generate(input)

results = evaluator.evaluate(pred_1, pred_2, input)

To build a custom comparison evaluator, inherit from PairwiseStringEvaluator.

Trajectory Evaluators

Trajectory evaluators measure agent performance across an entire interaction. They assess qualities like consistency, coherence, and goal achievement.

As noted in the FAQ:

Trajectory Evaluators are used to evaluate the entire trajectory of agent actions.

For example:

from langchain.evaluators import GoalCompletionEvaluator

evaluator = GoalCompletionEvaluator()
agent = MyAgent()

trajectory = agent.run_interaction(user_inputs)

score = evaluator.evaluate(trajectory)

To create a custom trajectory evaluator, inherit from TrajectoryEvaluator.

Usage Examples

The docs include various guides showing how to apply evaluators in real use cases:

As noted in the FAQ:

We also are working to share guides and cookbooks that demonstrate how to use these evaluators in real-world scenarios, such as:

  • Chain Comparisons: This example uses a comparison evaluator to predict the preferred output. It reviews ways to measure confidence intervals to select statistically significant differences in aggregate preference scores across different models or prompts.

Reference Documentation

For detailed API docs on the evaluators, see the reference documentation.