Benchmarking¶

ML Container Creator can generate a do/benchmark script that measures LLM endpoint performance using the SageMaker AI Benchmarking service (powered by NVIDIA AIPerf). The script creates a workload configuration, runs a benchmark job against your deployed endpoint, polls for completion, and displays a results summary.

Prerequisites¶

Requirement	Details
Deployed endpoint	Endpoint must be `InService` (run `./do/deploy` first)
AWS credentials	Configured via `aws configure` or environment variables
Supported architecture	`transformers` or `diffusors` only (`transformers-vllm`, `transformers-sglang`, `transformers-tensorrt-llm`, `transformers-lmi`, `transformers-djl`, `diffusors-vllm-omni`)
Deployment target	`managed-inference` only (HyperPod EKS is not supported)
Bootstrapped account	Run `ml-container-creator bootstrap` to provision IAM permissions for benchmark APIs

Architecture Requirement

Benchmarking requires the OpenAI-compatible chat completions API, which is only available on transformer and diffusor model servers. HTTP and Triton architectures are not supported.

Workflow¶

The typical benchmarking workflow follows the standard deploy-then-measure pattern:

graph LR
    A[Generate project<br>with benchmark] --> B[Build & push<br>container] --> C[Deploy<br>endpoint] --> D[Run<br>benchmark] --> E[Interpret<br>results] --> F[Clean up]

1. Generate a project with benchmarking enabled¶

ml-container-creator my-llm-project \
  --deployment-config=transformers-vllm \
  --model-name=meta-llama/Llama-3.1-8B-Instruct \
  --instance-type=ml.g5.2xlarge \
  --deployment-target=managed-inference \
  --include-benchmark \
  --benchmark-concurrency=10 \
  --benchmark-input-tokens=550 \
  --benchmark-output-tokens=150 \
  --benchmark-streaming

2. Build and deploy¶

cd my-llm-project
./do/build        # or ./do/submit for CodeBuild
./do/push
./do/deploy

Wait for the endpoint to reach InService status.

3. Run the benchmark¶

./do/benchmark

The script will:

Verify the endpoint is InService
Create (or update) a Secrets Manager secret for the HuggingFace token (if HF_TOKEN is set in do/config)
Create an AI Workload Config with your benchmark parameters
Create an AI Benchmark Job targeting your endpoint's inference component
Poll every 30 seconds until the job completes (up to 30 minutes)
Download and display the results summary

4. Clean up benchmark resources¶

./do/benchmark --clean   # Clean after displaying results
./do/clean benchmark     # Clean benchmark resources independently

Parameters¶

All benchmark parameters are set during project generation and stored in do/config. You can also edit do/config directly to adjust parameters between runs.

Parameter	CLI Option	Default	Description
`includeBenchmark`	`--include-benchmark`	`false`	Enable benchmark script generation
`benchmarkConcurrency`	`--benchmark-concurrency`	`10`	Number of concurrent requests sent to the endpoint
`benchmarkInputTokensMean`	`--benchmark-input-tokens`	`550`	Mean number of input tokens per request
`benchmarkOutputTokensMean`	`--benchmark-output-tokens`	`150`	Mean number of output tokens per request
`benchmarkStreaming`	`--benchmark-streaming`	`true`	Enable streaming responses during benchmark
`benchmarkRequestCount`	`--benchmark-request-count`	Service default	Total number of requests to send (leave empty for service default)
`benchmarkS3OutputPath`	`--benchmark-s3-output-path`	Auto-generated	S3 URI for benchmark results output

Environment variables¶

Variable	Description
`ML_INCLUDE_BENCHMARK`	Set to `true` to enable benchmarking (equivalent to `--include-benchmark`)
`ML_BENCHMARK_S3_OUTPUT_PATH`	Override the S3 output path for benchmark results

do/config variables¶

When benchmarking is enabled, the following variables are exported in do/config:

# SageMaker AI Benchmarking configuration
export BENCHMARK_CONCURRENCY="10"
export BENCHMARK_INPUT_TOKENS_MEAN="550"
export BENCHMARK_OUTPUT_TOKENS_MEAN="150"
export BENCHMARK_STREAMING="true"
export BENCHMARK_REQUEST_COUNT=""
export BENCHMARK_S3_OUTPUT_PATH="s3://ml-container-creator-benchmark-us-east-1-123456789012/my-llm-project/"

Interpreting Results¶

When the benchmark job completes, do/benchmark displays a summary table with the following metrics:

Throughput metrics¶

Metric	Description
Request throughput	Number of completed requests per second
Output token throughput	Number of output tokens generated per second across all requests

Latency metrics¶

Metric	Description
Request latency P50	Median end-to-end request latency
Request latency P90	90^th percentile end-to-end request latency
Request latency P99	99^th percentile end-to-end request latency

Streaming metrics (when streaming is enabled)¶

Metric	Description
TTFT P50	Median time to first token — latency from request submission to first output token
TTFT P90	90^th percentile time to first token
TTFT P99	99^th percentile time to first token
ITL P50	Median inter-token latency — time between consecutive output tokens
ITL P90	90^th percentile inter-token latency
ITL P99	99^th percentile inter-token latency

What good results look like¶

TTFT should be low (< 500ms for most models) — high TTFT indicates the model is slow to start generating
ITL should be consistent — high variance suggests resource contention or batching delays
Request throughput scales with concurrency — if throughput plateaus while increasing concurrency, the endpoint is saturated
P99 latency much higher than P50 indicates tail latency issues, often caused by request queuing

Tuning tips¶

Increase benchmarkConcurrency to find the saturation point of your endpoint
Adjust benchmarkInputTokensMean and benchmarkOutputTokensMean to match your expected production workload
Run multiple benchmarks with different parameters to build a performance profile
Compare results across instance types to find the best cost/performance ratio

Cleanup¶

Benchmark resources (workload configs and benchmark jobs) persist in your AWS account after the benchmark completes. Clean them up to avoid clutter:

# Clean benchmark resources only
./do/clean benchmark

# Or use the --clean flag during benchmark execution
./do/benchmark --clean

The benchmark cleanup target:

Deletes the workload config ({project-name}-benchmark-config)
Deletes all completed/failed benchmark jobs matching {project-name}-benchmark-*
Does not delete the Secrets Manager secret (managed separately)

The ./do/clean all command also includes benchmark cleanup when benchmarking is enabled.

Example: Full CLI invocation¶

Generate a project with all benchmark parameters specified for a fully non-interactive run:

ml-container-creator my-benchmark-project \
  --skip-prompts \
  --deployment-config=transformers-sglang \
  --model-name=meta-llama/Llama-3.1-8B-Instruct \
  --instance-type=ml.g5.2xlarge \
  --region=us-west-2 \
  --deployment-target=managed-inference \
  --build-target=codebuild \
  --include-benchmark \
  --benchmark-concurrency=20 \
  --benchmark-input-tokens=1024 \
  --benchmark-output-tokens=256 \
  --benchmark-streaming \
  --benchmark-request-count=100

Then deploy and benchmark:

cd my-benchmark-project
./do/submit          # Build via CodeBuild
./do/deploy          # Deploy to SageMaker
./do/benchmark       # Run the benchmark
./do/benchmark --clean  # Run again and clean up after

Troubleshooting¶

"Endpoint is not InService"¶

The benchmark requires a fully deployed endpoint. Check endpoint status:

aws sagemaker describe-endpoint --endpoint-name <your-endpoint> \
  --query 'EndpointStatus' --output text

Wait for InService status before running ./do/benchmark.

"UnrecognizedClientException"¶

The AI Benchmarking APIs may not be available in all AWS regions. Try deploying to us-east-1 or us-west-2 where the service is generally available.

Benchmark job fails¶

Check the failure reason displayed by the script. Common causes:

Insufficient permissions — Run ml-container-creator bootstrap to add benchmark IAM permissions
Endpoint not responding — Verify the endpoint works with ./do/test before benchmarking
Tokenizer access — Ensure HF_TOKEN is set in do/config for gated models

Benchmark times out¶

The default timeout is 30 minutes. If your benchmark needs more time (high request counts or slow endpoints), edit the MAX_POLL_ATTEMPTS variable in do/benchmark.