Today, we are announcing RAG evaluation support in Amazon Bedrock Knowledge Bases. This capability allows you to evaluate your retrieval-augmented generation (RAG) applications built on Amazon Bedrock Knowledge Bases. You can evaluate either information retrieval or the retrieval plus content generation. Evaluations are powered by LLM-as-a-Judge technology, with customers having a choice of several judge models to use. For retrieval evaluation, you can select from metrics such as context relevance and coverage. For retrieve plus generation evaluation, you can select from quality metrics such as correctness, completeness, and faithfulness (hallucination detection), as well as responsible AI metrics such as harmfulness, answer refusal, and stereotyping. You can also compare across evaluation jobs in order to compare Knowledge Bases with different settings like chunking strategy or vector length, or different content generating models.
Evaluating RAG applications can be difficult, as there are many components in the retrieval and generation that need to be optimized. Now, Amazon Bedrock Knowledge Bases’s RAG evaluation tool allows customers to evaluate their Knowledge Base-powered applications conveniently and quickly where their data and LLMs already live. Additionally, you can incorporate Amazon Bedrock Guardrails directly into your evaluation for even more thorough testing. Using these RAG evaluation tools on Amazon Bedrock can save cost as well as weeks of time compared to a full offline human-based evaluation, allowing you to make improvements in your application faster and easier.
To learn more, including region availability, read the AWS News blog and visit the Amazon Bedrock Evaluations page . To get started, log into the Amazon Bedrock Console or use the Amazon Bedrock APIs.