Model Evaluation on Amazon Bedrock allows you to evaluate, compare, and select the best foundation models for your use case. Amazon Bedrock offers a choice of using an LLM-as-a-judge, programmatic evaluation, and human evaluation. You can use an LLM-as-a-judge for metrics such as correctness, completeness, and coherence, as well as responsible AI metrics such as answer refusal and harmfulness. Programmatic evaluation offers algorithms for metrics such as accuracy, robustness, and toxicity. Additionally, for those metrics or subjective and custom metrics, such as friendliness or style, you can set up a human evaluation workflow with a few clicks. Human evaluation leverages your own employees or an AWS-managed team as reviewers. Model evaluation provides built-in curated datasets or you can bring your own datasets. Now, customers can evaluate models in the Europe (Zurich).
Model Evaluation on Amazon Bedrock is now available in these regions , and evaluation type availability varies by region.
To learn more about Model Evaluation on Amazon Bedrock, see the Amazon Bedrock Evaluations page . To get started, sign in to Amazon Bedrock on the AWS Management Console or use the Amazon Bedrock APIs.