Skip to main content
Automatic evaluators can be rule-based or graded by large language models (LLMs), and they can programmatically run on LLM input or output. Baserun offers a number of pre-built automatic evaluators (see below), as well as the ability to perform custom evals with your own prompt or your own function.

Complete Reference

  • Python
  • Typescript
Checks if the submission starts with any of the expected values.Returns true if the submission starts with any of the expected values, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission contains any of the expected values within it.Returns true if the submission includes any of the expected values, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission contains any of the expected values or if any of the expected values contain the submission.Returns true if there’s a fuzzy match, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission does not start with any of the expected values.Returns true if the submission does not start with any of the expected values, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission does not contain any of the expected values.Returns true if the submission does not include any of the expected values, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission neither contains any of the expected values nor is contained by any of the expected values.Returns true if there’s no fuzzy match, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
expected
Union[str, List[str]]
A string or a list of strings to check against.
Checks if the submission is a valid JSON string.Returns true if the submission is a valid JSON string, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
Checks if the submission includes injection.Returns true if the submission includes injection, otherwise false.
name
str
Name of the evaluation.
submission
str
The input string.
Checks the submission using a custom function.Returns the result of the custom function.
name
str
Name of the evaluation.
submission
str
The input string.
fn
Callable[[str], bool]
A custom function that takes the submission and returns a boolean.
Checks the submission using an asynchronous custom function.Returns the result of the custom function.
name
str
Name of the evaluation.
submission
str
The input string.
fn
Callable[[str], Awaitable[bool]]
A custom function that takes the submission and returns a promise that resolves to a boolean.
Evaluates the model’s response based on a prompt and a set of choices.Returns the choice given by the model.Example:
result = Baserun.evals.model_graded_custom(
    name="Truthiness",
    prompt="How true is this statement? {statement}.",
    choices={"True": 1, "Somewhat true": 0.5, "Not true": 0},
    statement=statement,
)
name
str
Name of the evaluation.
prompt
str
The prompt passed to the model.
choices
dict[str, float]
A dictionary of choices and their scores.
model
str
default:"gpt-4-0125-preview"
OpenAI model that you want to use for the evaluation.
metadata
Optional[dict[str, Any]]
default:"None"
Any metadata that might be useful for you.
variables
dict[str, str]
Variables that will be substituted in the formatted prompt.
Checks a submitted answer against an expert answer for factual consistency using gpt-4-0613.Returns one of:
  • “A”: The output is a subset of the expert answer and fully consistent with it.
  • “B”: The output is a superset of the expert answer and fully consistent with it.
  • “C”: The submitted answer contains all of the same details as the expert answer.
  • “D”: There is disagreement between the submitted answer and the expert answer.
  • “E”: The answers differ, but these differences don’t matter from the perspective of factuality.
name
str
Name of the evaluation.
question
str
The question.
expert
str
The expert answer.
submission
str
The submitted answer.
Checks a submitted answer based on a specific criterion for relevance, conciseness, and correctness using gpt-4-0613.Returns “Yes” if the submission meets the criteria, “No” if it does not, and “Unsure” if it cannot be determined.
name
str
Name of the evaluation.
task
str
The task.
submission
str
The submitted answer.
criterion
str
The criterion.
Checks the submitted string for potential malicious content using gpt-4-0613.Returns “Yes” if the submission is malicious, “No” if it is not malicious, and “Unsure” if it cannot be determined.
name
str
Name of the evaluation.
submission
str
The input string.
I