One major challenge in building products to automate more complex, nuanced tasks is that defining “good” and “bad” performance becomes a major bottleneck. While you may be able to define a few examples of “bad” behavior (e.g. AVs should try and avoid collisions, customer service chatbots should not reference stale information, medical chatbots should not misdiagnose patients), most of the interactions users will have with the system will be much more subjective.

Human evaluation enable AI teams to define their own evaluation criteria and collect feedback from human annotators.

Human evaluator has 3 types:

  • Boolean: True or False result.
  • Number: 1-5 scale.
  • Enum: Multiple options.

Create a new evaluator

1

Navigate to the Evaluators tab and click on the 'New evaluator' button

You will be prompted to select an evaluator type. Select the Human Evaluator.

2

Name the evaluator and select the evaluator type

3

Click on the 'Create' button

4

Apply a evaluator

Navigate to a trace or a LLM request, click on the “Evaluate” button on the top right. Fill the form and click on the “Submit” button.

View evaluation results

Now you can see the result inside the monitoring tabs and you can aggregate the results by the evaluation result.