Compared to traditional software, which is deterministic, LLM applications are unpredictable and require different evaluation methods and tools.
Testing and evaluation help identify issues, quantify app performance, provide insights, and set benchmarks to help AI teams continuously improve the LLM features.
Automatic evaluation and Human evaluations are two primary ways to evaluate:
Automatic evaluation: Automatic evaluation involves creating structured testing datasets, which contain predefined input values and their expected outputs. Using tools like the Baserun SDK, you can programmatically compare the LLM’s outputs against these expected results. This can be done by executing specific functions or use AI to grade the outputs. For instance, in a customer service scenario, the testing dataset might include various customer queries and the ideal responses. Automatic evaluation can be rule-based or model-graded.
Human evaluation: In situations where creativity and nuanced understanding are crucial, such as drafting a marketing email, manual review becomes essential. Here, a human evaluator would read and assess the content for its creativity, tone, and alignment with brand values. Manual review serves as a vital initial step to understand which aspects of the LLM’s outputs require closer programmatic examination.
Human evaluation can be done by having internal users grade the outputs (annotation) or by collecting feedback from end-users (feedback).
The choice between automatic evaluation and manual review depends on the specific use cases. In some scenarios, a combination of both methods might be most effective, creating a comprehensive diagnostic workflow. For example, in content creation, automatic checks can assess basic grammar and relevance, while manual reviews can fine-tune the content for style and engagement.
There are many terminologies currently been used for evaluting LLMs. To simplify, we will categories them by development stages: