One major challenge in building products to automate more complex, nuanced tasks is that defining “good” and “bad” performance becomes a major bottleneck. While you may be able to define a few examples of “bad” behavior (e.g. AVs should try and avoid collisions, customer service chatbots should not reference stale information, medical chatbots should not misdiagnose patients), most of the interactions users will have with the system will be much more subjective.Human evaluation enable AI teams to define their own evaluation criteria and collect feedback from human annotators.Human evaluator has 3 types: