Offline evaluation via SDK

Introduction

We leverage existing testing frameworks like pytest and Jest so you don’t have to learn a new tool and can directly integrate your existing testing logic.

Features

pytest, jest, vitest integration
supports end-to-end testing

The Python SDK 2.0 does not currently support offline evaluation currently. Please install baserun==0.9.36 if you are using the Python SDK

Instruction

Install Baserun SDK

pip install baserun

Generate an API key

Create an account at https://app.baserun.ai/sign-up. Then generate an API key for your project in the settings tab. Set it as an environment variable:

export BASERUN_API_KEY="your_api_key_here"

Or if using python, set baserun.api_key to its value:

baserun.api_key = "br-..."

Define the test

Use our pytest plugin and immediately start testing with Baserun. By default all LLM calls will be automatically logged.
In TS/JS you don’t need to use a plugin, just add await baserun.init() to the beforeAll hook. In TS/JS we support both Jest and Vitest.

# test_module.py

import openai

def test_paris_trip():
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.7,
        messages=[
            {
                "role": "user",
                "content": "What are three activities to do in Paris?"
            }
        ],
    )
    output = response.choices[0].message.content
    return output

Add evaluators

Baserun offers a number of pre-built evaluators, as well as the ability to perform custom evaluations with your own prompt or function.To add an evaluator, simply give the evaluator a name, use baserun.eval.evaluator_name to select which evaluator to use, and pass the expected input variables for the evaluator.Each evaluator has its own set of expected input variables, so be sure to check the automatic evaluator documentation for the specific evaluator you are using.In the following example, we use the not_includes evaluator as an example.

# test_module.py

import openai

def test_paris_trip():
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0.7,
        messages=[
            {
                "role": "user",
                "content": "What are three activities to do in Paris?"
            }
        ],
    )
    output = response.choices[0].message.content
    includes_effier_tower = baserun.evals.includes("includes_effier_tower", output, ["Effier Tower"])

    # Optional:The test will fail if the eval failed
    assert includes_effier_tower

Run test

pytest --baserun test_module.py
...
========================Baserun========================
Test results available at: https://app.baserun.ai/runs/<id>
=======================================================

Running multiple tests

It’s often helpful to exercise the same test over multiple examples. To do this:
In python, we suggest using the parametrize decorator and for larger numbers of examples you can read from a file or other data structure.
In TS/JS, you can use a simple for loop to autogenerate your tests and for larger numbers of examples you can read from a file or other data structure. See the Testing section for more information.


import openai

@pytest.mark.parametrize("place,expected", [("Paris", "Eiffel Tower"), ("Rome", "Colosseum")])

def test_paris_trip(place, expected):
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        temperature=0.7,
        messages=[
            {
                "role": "user",
                "content": "What are three activities to do in Paris?"
            }
        ],
    )

    output = response.choices[0].message.content
    includes_effier_tower = baserun.evals.includes("includes_effier_tower", output, ["Effier Tower"])

    # Optional:The test will fail if the eval failed
    assert includes_effier_tower

Click on the link to open the offline evaluation run report in Baserun.

Demo projects

For your reference, we have example apps using LangChain to implement an agent:

Automatic evaluation

Explore automatic evaluators

Testing

Learn more about supported testing frameworks

Introduction

Prompt playground

Python SDK 2.0

Get started with SDK

Monitoring

Prompt templates

Evaluation

Testing

Datasets

Fine-tune

Integrations

Offline evaluation via SDK

Introduction

Features

Instruction

Running multiple tests

Demo projects

Automatic evaluation

Testing

Introduction

Prompt playground

Python SDK 2.0

Get started with SDK

Monitoring

Prompt templates

Evaluation

Testing

Datasets

Fine-tune

Integrations

​Introduction

​Features

​Instruction

​Running multiple tests

​Demo projects

Automatic evaluation

Testing

Introduction

Features

Instruction

Running multiple tests

Demo projects