Evaluators
An evaluator is a LLM agent that evaluates a Target on a test. Evaluators utilize foundation models directly on Amazon Bedrock. They do not make use of the Agents for Amazon Bedrock functionality.
Evaluation workflow
The diagram below depicts the workflow that is conducted during evaluation.
graph TD
classDef nodeText font-size:10pt;
A((Start)) --> B{Initial<br>prompt?}
B -->|yes| C(Invoke agent)
B -->|no| D(Generate initial prompt)
D --> C
C --> E(Get test status)
E --> F{All steps<br>attempted?}
F --> |yes| G(Evaluate conversation)
F --> |no| H{Max turns<br>reached?}
H --> |yes| I(Fail)
H --> |no| J(Generate user response)
J --> C
G --> K{All expected<br>results<br>observed?}
K --> |yes| L(Pass)
K --> |no| I(Fail)
I --> M((End))
L --> M
class A,B,C,D,E,F,G,H,I,J,K,L,M nodeText;
style I stroke:#f00
style L stroke:#0f0
Evaluator costs
By default, evaluators will utilize the InvokeModel API with On-Demand mode, which will incur AWS charges based on input tokens processed and output tokens generated. You can find the latest pricing details for Amazon Bedrock here.
The cost of running an evaluator for a single test is influenced by the following:
- The number and length of the steps.
- The number and length of expected results.
- The length of the target agent's responses.
You can view the total number of input tokens processed and output tokens generated by the evaluator using --verbose
flag when you perform a run (agenteval run --verbose
).
Note
If you have purchased Provisioned Throughput model units for the model
used to run evaluation, you can specify this resource using the provisioned_throughput_arn
configuration.
Example
Let's use this Amazon Bedrock agent as a target we want to test.
For the following test case:
tests:
retrieve_missing_documents:
steps:
- Ask agent for a list of missing documents for claim-006.
expected_results:
- The agent returns a list of missing documents.
We find that on average, the evaluator processes ~583 input tokens and generates ~290 output tokens.
Prerequisites
The principal must have InvokeModel to the model
specified in the configuration.
Configurations
Info
This project uses Boto3's credential resolution chain to determine the AWS credentials to use. Please refer to the Boto3 documentation for more details.
evaluator:
model: claude-3
provisioned_throughput_arn: my-throughput-arn
aws_profile: my-profile
aws_region: us-west-2
endpoint_url: my-endpoint-url
max_retry: 10
model
(string)
Name of the model used to run evaluation. This must be one of:
claude-3
(Claude 3 Sonnet)
provisioned_throughput_arn
(string; optional)
The Amazon Resource Name (ARN) of the Provisioned Throughput.
aws_profile
(string; optional)
A profile name that is used to create a Boto3 session.
aws_region
(string; optional)
The AWS region that is used to create a Boto3 session.
endpoint_url
(string; optional)
The endpoint URL for the AWS service which is used to construct the Boto3 client.
max_retry
(integer; optional)
Configures the Boto3 client with the maximum number of retry attempts allowed. The default is 10
.