Evaluators

An evaluator is a LLM agent that evaluates a Target on a test. Evaluators utilize foundation models directly on Amazon Bedrock. They do not make use of the Agents for Amazon Bedrock functionality.

Evaluation workflow

The diagram below depicts the workflow that is conducted during evaluation.

graph TD
  classDef nodeText font-size:10pt;
  A((Start)) --> B{Initial<br>prompt?}
  B -->|yes| C(Invoke agent)
  B -->|no| D(Generate initial prompt)
  D --> C
  C --> E(Get test status)
  E --> F{All steps<br>attempted?}  
  F --> |yes| G(Evaluate conversation)
  F --> |no| H{Max turns<br>reached?}
  H --> |yes| I(Fail)
  H --> |no| J(Generate user response)
  J --> C
  G --> K{All expected<br>results<br>observed?}
  K --> |yes| L(Pass)
  K --> |no| I(Fail)
  I --> M((End))
  L --> M
  class A,B,C,D,E,F,G,H,I,J,K,L,M nodeText;
  style I stroke:#f00
  style L stroke:#0f0

Evaluator costs

By default, evaluators will utilize the InvokeModel API with On-Demand mode, which will incur AWS charges based on input tokens processed and output tokens generated. You can find the latest pricing details for Amazon Bedrock here.

The cost of running an evaluator for a single test is influenced by the following:

The number and length of the steps.
The number and length of expected results.
The length of the target agent's responses.

You can view the total number of input tokens processed and output tokens generated by the evaluator using --verbose flag when you perform a run (agenteval run --verbose).

Note

If you have purchased Provisioned Throughput model units for the model used to run evaluation, you can specify this resource using the provisioned_throughput_arn configuration.

Example

Let's use this Amazon Bedrock agent as a target we want to test.

For the following test case:

agenteval.yml

tests:
  retrieve_missing_documents:
    steps:
    - Ask agent for a list of missing documents for claim-006.
    expected_results:
    - The agent returns a list of missing documents.

We find that on average, the evaluator processes ~583 input tokens and generates ~290 output tokens.

Prerequisites

The principal must have InvokeModel to the model specified in the configuration.

Configurations

Info

This project uses Boto3's credential resolution chain to determine the AWS credentials to use. Please refer to the Boto3 documentation for more details.

agenteval.yml

evaluator:
  model: claude-3
  provisioned_throughput_arn: my-throughput-arn
  aws_profile: my-profile
  aws_region: us-west-2
  endpoint_url: my-endpoint-url
  max_retry: 10

model (string)

Name of the model used to run evaluation. This must be one of:

claude-3 (Claude 3 Sonnet)
claude-3_5 (Claude 3.5 Sonnet)
claude-3_7-us (Claude 3.7 Sonnet)
claude-haiku-3_5-us (Claude 3.5 Haiku)
llama-3_3-us (Llama 3.3 70B)

The models suffixed with `-us` are using default USA cross region inference profile. Bedrock cross region documentation link.

custom-config (dict; optional)

A valid combination with keys model_id and request_body specifying which foundation model with what configuration to invoke Bedrock. See Bedrock documentation or the default configurations in src/agenteval/evaluators/model_config/preconfigured_model_configs.py. Currently, only Meta and Anthropic models are supported.

provisioned_throughput_arn (string; optional)

The Amazon Resource Name (ARN) of the Provisioned Throughput.

aws_profile (string; optional)

A profile name that is used to create a Boto3 session.

aws_region (string; optional)

The AWS region that is used to create a Boto3 session.

endpoint_url (string; optional)

The endpoint URL for the AWS service which is used to construct the Boto3 client.

max_retry (integer; optional)

Configures the Boto3 client with the maximum number of retry attempts allowed. The default is 10.