Automatic Evaluators

Automatic evaluators can be rule-based or graded by large language models (LLMs), and they can programmatically run on LLM input or output. Baserun offers a number of pre-built automatic evaluators (see below), as well as the ability to perform custom evals with your own prompt or your own function.

Complete Reference

match(name: str, submission: str, expected: Union[str, List[str]]) -> bool

includes(name: str, submission: str, expected: Union[str, List[str]]) -> bool

fuzzy_match(name: str, submission: str, expected: Union[str, List[str]]) -> bool

not_match(name: str, submission: str, expected: Union[str, List[str]]) -> bool

not_includes(name: str, submission: str, expected: Union[str, List[str]]) -> bool

not_fuzzy_match(name: str, submission: str, expected: Union[str, List[str]]) -> bool

valid_json(name: str, submission: str) -> bool

check_injection(name: str, submission: str) -> bool

custom(name: str, submission: str, fn: Callable[[str], bool]) -> bool

custom_async(name: str, submission: str, fn: Callable[[str], Awaitable[bool]]) -> bool

model_graded_custom(name: str, prompt: str, choices: dict[str, float], model: str, metadata: Optional[Dict[str, Any]], **variables) -> str

Evaluates the model’s response based on a prompt and a set of choices.Returns the choice given by the model.Example:

result = Baserun.evals.model_graded_custom(
    name="Truthiness",
    prompt="How true is this statement? {statement}.",
    choices={"True": 1, "Somewhat true": 0.5, "Not true": 0},
    statement=statement,
)

name

str

Name of the evaluation.

prompt

str

The prompt passed to the model.

choices

dict[str, float]

A dictionary of choices and their scores.

model

str

default:"gpt-4-0125-preview"

OpenAI model that you want to use for the evaluation.

metadata

Optional[dict[str, Any]]

default:"None"

Any metadata that might be useful for you.

variables

dict[str, str]

Variables that will be substituted in the formatted prompt.

model_graded_fact(name: str, question: str, expert: str, submission: str) -> str

model_graded_closed_qa(name: str, task: str, submission: str, criterion: str) -> str

model_graded_security(name: str, submission: str) -> str

Introduction

Prompt playground

Python SDK 2.0

Get started with SDK

Monitoring

Prompt templates

Evaluation

Testing

Datasets

Fine-tune

Integrations

Automatic Evaluators

Complete Reference

Introduction

Prompt playground

Python SDK 2.0

Get started with SDK

Monitoring

Prompt templates

Evaluation

Testing

Datasets

Fine-tune

Integrations

​Complete Reference

Complete Reference