Benchmarking¶
ML Container Creator can generate a do/benchmark script that measures LLM endpoint performance using the SageMaker AI Benchmarking service (powered by NVIDIA AIPerf). The script creates a workload configuration, runs a benchmark job against your deployed endpoint, polls for completion, and displays a results summary.
Prerequisites¶
| Requirement | Details |
|---|---|
| Deployed endpoint | Endpoint must be InService (run ./do/deploy first) |
| AWS credentials | Configured via aws configure or environment variables |
| Supported architecture | transformers or diffusors only (transformers-vllm, transformers-sglang, transformers-tensorrt-llm, transformers-lmi, transformers-djl, diffusors-vllm-omni) |
| Deployment target | managed-inference only (HyperPod EKS is not supported) |
| Bootstrapped account | Run ml-container-creator bootstrap to provision IAM permissions for benchmark APIs |
Architecture Requirement
Benchmarking requires the OpenAI-compatible chat completions API, which is only available on transformer and diffusor model servers. HTTP and Triton architectures are not supported.
Workflow¶
The typical benchmarking workflow follows the standard deploy-then-measure pattern:
graph LR
A[Generate project<br>with benchmark] --> B[Build & push<br>container] --> C[Deploy<br>endpoint] --> D[Run<br>benchmark] --> E[Interpret<br>results] --> F[Clean up]
1. Generate a project with benchmarking enabled¶
ml-container-creator my-llm-project \
--deployment-config=transformers-vllm \
--model-name=meta-llama/Llama-3.1-8B-Instruct \
--instance-type=ml.g5.2xlarge \
--deployment-target=managed-inference \
--include-benchmark \
--benchmark-concurrency=10 \
--benchmark-input-tokens=550 \
--benchmark-output-tokens=150 \
--benchmark-streaming
2. Build and deploy¶
Wait for the endpoint to reach InService status.
3. Run the benchmark¶
The script will:
- Verify the endpoint is
InService - Create (or update) a Secrets Manager secret for the HuggingFace token (if
HF_TOKENis set indo/config) - Create an AI Workload Config with your benchmark parameters
- Create an AI Benchmark Job targeting your endpoint's inference component
- Poll every 30 seconds until the job completes (up to 30 minutes)
- Download and display the results summary
4. Clean up benchmark resources¶
./do/benchmark --clean # Clean after displaying results
./do/clean benchmark # Clean benchmark resources independently
Parameters¶
All benchmark parameters are set during project generation and stored in do/config. You can also edit do/config directly to adjust parameters between runs.
| Parameter | CLI Option | Default | Description |
|---|---|---|---|
includeBenchmark |
--include-benchmark |
false |
Enable benchmark script generation |
benchmarkConcurrency |
--benchmark-concurrency |
10 |
Number of concurrent requests sent to the endpoint |
benchmarkInputTokensMean |
--benchmark-input-tokens |
550 |
Mean number of input tokens per request |
benchmarkOutputTokensMean |
--benchmark-output-tokens |
150 |
Mean number of output tokens per request |
benchmarkStreaming |
--benchmark-streaming |
true |
Enable streaming responses during benchmark |
benchmarkRequestCount |
--benchmark-request-count |
Service default | Total number of requests to send (leave empty for service default) |
benchmarkS3OutputPath |
--benchmark-s3-output-path |
Auto-generated | S3 URI for benchmark results output |
Environment variables¶
| Variable | Description |
|---|---|
ML_INCLUDE_BENCHMARK |
Set to true to enable benchmarking (equivalent to --include-benchmark) |
ML_BENCHMARK_S3_OUTPUT_PATH |
Override the S3 output path for benchmark results |
do/config variables¶
When benchmarking is enabled, the following variables are exported in do/config:
# SageMaker AI Benchmarking configuration
export BENCHMARK_CONCURRENCY="10"
export BENCHMARK_INPUT_TOKENS_MEAN="550"
export BENCHMARK_OUTPUT_TOKENS_MEAN="150"
export BENCHMARK_STREAMING="true"
export BENCHMARK_REQUEST_COUNT=""
export BENCHMARK_S3_OUTPUT_PATH="s3://ml-container-creator-benchmark-us-east-1-123456789012/my-llm-project/"
Interpreting Results¶
When the benchmark job completes, do/benchmark displays a summary table with the following metrics:
Throughput metrics¶
| Metric | Description |
|---|---|
| Request throughput | Number of completed requests per second |
| Output token throughput | Number of output tokens generated per second across all requests |
Latency metrics¶
| Metric | Description |
|---|---|
| Request latency P50 | Median end-to-end request latency |
| Request latency P90 | 90th percentile end-to-end request latency |
| Request latency P99 | 99th percentile end-to-end request latency |
Streaming metrics (when streaming is enabled)¶
| Metric | Description |
|---|---|
| TTFT P50 | Median time to first token — latency from request submission to first output token |
| TTFT P90 | 90th percentile time to first token |
| TTFT P99 | 99th percentile time to first token |
| ITL P50 | Median inter-token latency — time between consecutive output tokens |
| ITL P90 | 90th percentile inter-token latency |
| ITL P99 | 99th percentile inter-token latency |
What good results look like¶
- TTFT should be low (< 500ms for most models) — high TTFT indicates the model is slow to start generating
- ITL should be consistent — high variance suggests resource contention or batching delays
- Request throughput scales with concurrency — if throughput plateaus while increasing concurrency, the endpoint is saturated
- P99 latency much higher than P50 indicates tail latency issues, often caused by request queuing
Tuning tips¶
- Increase
benchmarkConcurrencyto find the saturation point of your endpoint - Adjust
benchmarkInputTokensMeanandbenchmarkOutputTokensMeanto match your expected production workload - Run multiple benchmarks with different parameters to build a performance profile
- Compare results across instance types to find the best cost/performance ratio
Cleanup¶
Benchmark resources (workload configs and benchmark jobs) persist in your AWS account after the benchmark completes. Clean them up to avoid clutter:
# Clean benchmark resources only
./do/clean benchmark
# Or use the --clean flag during benchmark execution
./do/benchmark --clean
The benchmark cleanup target:
- Deletes the workload config (
{project-name}-benchmark-config) - Deletes all completed/failed benchmark jobs matching
{project-name}-benchmark-* - Does not delete the Secrets Manager secret (managed separately)
The ./do/clean all command also includes benchmark cleanup when benchmarking is enabled.
Example: Full CLI invocation¶
Generate a project with all benchmark parameters specified for a fully non-interactive run:
ml-container-creator my-benchmark-project \
--skip-prompts \
--deployment-config=transformers-sglang \
--model-name=meta-llama/Llama-3.1-8B-Instruct \
--instance-type=ml.g5.2xlarge \
--region=us-west-2 \
--deployment-target=managed-inference \
--build-target=codebuild \
--include-benchmark \
--benchmark-concurrency=20 \
--benchmark-input-tokens=1024 \
--benchmark-output-tokens=256 \
--benchmark-streaming \
--benchmark-request-count=100
Then deploy and benchmark:
cd my-benchmark-project
./do/submit # Build via CodeBuild
./do/deploy # Deploy to SageMaker
./do/benchmark # Run the benchmark
./do/benchmark --clean # Run again and clean up after
Troubleshooting¶
"Endpoint is not InService"¶
The benchmark requires a fully deployed endpoint. Check endpoint status:
aws sagemaker describe-endpoint --endpoint-name <your-endpoint> \
--query 'EndpointStatus' --output text
Wait for InService status before running ./do/benchmark.
"UnrecognizedClientException"¶
The AI Benchmarking APIs may not be available in all AWS regions. Try deploying to us-east-1 or us-west-2 where the service is generally available.
Benchmark job fails¶
Check the failure reason displayed by the script. Common causes:
- Insufficient permissions — Run
ml-container-creator bootstrapto add benchmark IAM permissions - Endpoint not responding — Verify the endpoint works with
./do/testbefore benchmarking - Tokenizer access — Ensure
HF_TOKENis set indo/configfor gated models
Benchmark times out¶
The default timeout is 30 minutes. If your benchmark needs more time (high request counts or slow endpoints), edit the MAX_POLL_ATTEMPTS variable in do/benchmark.