Skip to main content

SCENARIO 3: Automatic Saturation Detection

When to use this scenario:

Use sweep mode for automated capacity discovery when you don't want to manually guess appropriate QPS test stages—ideal for initial deployments, CI/CD pipelines, or quick capacity re-validation after infrastructure changes. The tool floods your system to empirically determine saturation, then automatically generates intelligent test stages clustered around that critical point. This eliminates human bias in test design and ensures consistent, reproducible methodology across different environments and teams, though you trade fine-grained control for scientific automation.

Choose Scenario 2 when you need to validate specific load targets (e.g., "can we handle 20 QPS?") or want predictable test stages for production environments.

Choose Scenario 3 when discovering unknown capacity limits or when consistent automated methodology matters more than testing specific QPS values.

Deployment

# Add the AI on EKS Helm repository
helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts/
helm repo update

# Install sweep scenario
helm install sweep-test ai-on-eks/benchmark-charts \
--set benchmark.scenario=sweep \
--set benchmark.target.baseUrl=http://qwen3-vllm.default:8000 \
--set benchmark.target.modelName=qwen3-8b \
--set benchmark.target.tokenizerPath=Qwen/Qwen3-8B \
--namespace benchmarking --create-namespace

# Monitor progress - watch for automatic stage generation in logs
kubectl logs -n benchmarking -l benchmark.scenario=sweep -f

Customizing Sweep Parameters

Adjust saturation probe settings:

# custom-sweep.yaml
benchmark:
scenario: sweep
target:
baseUrl: http://your-model.your-namespace:8000
scenarios:
sweep:
load:
sweep:
numRequests: 3000 # More requests for larger systems
timeout: 90 # Longer probe time
numStages: 7 # More test stages
stageDuration: 240 # Longer stage duration
saturationPercentile: 99 # More conservative estimate
helm install sweep-test ai-on-eks/benchmark-charts -f custom-sweep.yaml -n benchmarking

Key Configuration:

  • Variable synthetic data distributions
  • Sweep mode (automatic): floods system with configurable request count (default: 2000) over 60 seconds to discover saturation point
  • Auto-generated test stages using geometric clustering near saturation
  • Streaming enabled

Understanding the results:

The tool's preprocessing phase identifies saturation by flooding your system with 2000 requests over 60 seconds and measuring processing rate; the saturation_percentile: 95 means it uses the 95th percentile of observed rates for conservative estimates. Review the automatically generated stages in the logs (geometric clustering produces tighter spacing near saturation, like 4, 8, 14, 17, 18 QPS) and compare the detected saturation point against your manual testing expectations. Significant discrepancies reveal queueing bottlenecks or resource constraints you might have missed, and the geometric distribution provides rich data precisely where performance transitions from stable to degraded.

Configuring the Saturation Probe: The numRequests parameter in the sweep configuration controls how many requests are sent during the initial saturation discovery phase. The default value of 2000 is appropriate for most deployments, but you can adjust this based on your expected capacity.

Alternative: Raw Kubernetes YAML
---
apiVersion: v1
kind: ConfigMap
metadata:
name: inference-perf-sweep
namespace: benchmarking
data:
config.yml: |
api:
type: completion
streaming: true

data:
type: synthetic
input_distribution:
mean: 512
std_dev: 128
min: 128
max: 2048
output_distribution:
mean: 256
std_dev: 64
min: 32
max: 512

load:
type: constant
stages: [] # Auto-generated by sweep
sweep:
type: geometric
num_requests: 2000
timeout: 60
num_stages: 5
stage_duration: 180
saturation_percentile: 95
num_workers: 8

server:
type: vllm
model_name: qwen3-8b
base_url: http://qwen3-vllm.default:8000
ignore_eos: true

tokenizer:
pretrained_model_name_or_path: Qwen/Qwen3-8B

storage:
simple_storage_service:
bucket_name: "inference-perf-results"
path: "sweep-test/results"
---
apiVersion: batch/v1
kind: Job
metadata:
name: inference-perf-sweep
namespace: benchmarking
spec:
backoffLimit: 2
ttlSecondsAfterFinished: 3600
template:
spec:
restartPolicy: Never
serviceAccountName: inference-perf-sa

affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: qwen3-vllm
topologyKey: topology.kubernetes.io/zone

containers:
- name: inference-perf
image: quay.io/inference-perf/inference-perf:v0.2.0
command: ["/bin/sh", "-c"]
args:
- |
inference-perf --config_file /workspace/config.yml
volumeMounts:
- name: config
mountPath: /workspace/config.yml
subPath: config.yml
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"

volumes:
- name: config
configMap:
name: inference-perf-sweep