Introduction
Prompt playground
Python SDK 2.0
- Monitoring
Get started with SDK
Monitoring
Prompt templates
Evaluation
Testing
- Overview
- Python
- JS/TS
Datasets
Fine-tune
Automatic Evaluators
Automatic evaluators can be rule-based or graded by large language models (LLMs), and they can programmatically run on LLM input or output. Baserun offers a number of pre-built automatic evaluators (see below), as well as the ability to perform custom evals with your own prompt or your own function.
Complete Reference
Checks if the submission
contains any of the expected
values or if any of the expected
values contain the submission
.
Returns true
if there’s a fuzzy match, otherwise false
.
Name of the evaluation.
The input string.
A string or a list of strings to check against.
Checks if the submission
does not start with any of the expected
values.
Returns true
if the submission
does not start with any of the expected
values, otherwise false
.
Name of the evaluation.
The input string.
A string or a list of strings to check against.
Checks if the submission
does not contain any of the expected
values.
Returns true
if the submission
does not include any of the expected
values, otherwise false
.
Name of the evaluation.
The input string.
A string or a list of strings to check against.
Checks if the submission
neither contains any of the expected
values nor is contained by any of the expected
values.
Returns true
if there’s no fuzzy match, otherwise false
.
Name of the evaluation.
The input string.
A string or a list of strings to check against.
Evaluates the model’s response based on a prompt and a set of choices.
Returns the choice given by the model.
Example:
Name of the evaluation.
The prompt passed to the model.
A dictionary of choices and their scores.
OpenAI model that you want to use for the evaluation.
Any metadata that might be useful for you.
Variables that will be substituted in the formatted prompt.
Checks a submitted answer against an expert answer for factual consistency using gpt-4-0613.
Returns one of:
- “A”: The output is a subset of the expert answer and fully consistent with it.
- “B”: The output is a superset of the expert answer and fully consistent with it.
- “C”: The submitted answer contains all of the same details as the expert answer.
- “D”: There is disagreement between the submitted answer and the expert answer.
- “E”: The answers differ, but these differences don’t matter from the perspective of factuality.
Name of the evaluation.
The question.
The expert answer.
The submitted answer.
Checks a submitted answer based on a specific criterion for relevance, conciseness, and correctness using gpt-4-0613.
Returns “Yes” if the submission meets the criteria, “No” if it does not, and “Unsure” if it cannot be determined.
Name of the evaluation.
The task.
The submitted answer.
The criterion.