SCENARIO 3: Automatic Saturation Detection

When to use this scenario:

Use sweep mode for automated capacity discovery when you don't want to manually guess appropriate QPS test stages—ideal for initial deployments, CI/CD pipelines, or quick capacity re-validation after infrastructure changes. The tool floods your system to empirically determine saturation, then automatically generates intelligent test stages clustered around that critical point. This eliminates human bias in test design and ensures consistent, reproducible methodology across different environments and teams, though you trade fine-grained control for scientific automation.

Choose Scenario 2 when you need to validate specific load targets (e.g., "can we handle 20 QPS?") or want predictable test stages for production environments.

Choose Scenario 3 when discovering unknown capacity limits or when consistent automated methodology matters more than testing specific QPS values.

Deployment

Using Helm Chart (Recommended)

# Add the AI on EKS Helm repository
helm repo add ai-on-eks https://awslabs.github.io/ai-on-eks-charts/
helm repo update

# Install sweep scenario
helm install sweep-test ai-on-eks/benchmark-charts \
  --set benchmark.scenario=sweep \
  --set benchmark.target.baseUrl=http://qwen3-vllm.default:8000 \
  --set benchmark.target.modelName=qwen3-8b \
  --set benchmark.target.tokenizerPath=Qwen/Qwen3-8B \
  --namespace benchmarking --create-namespace

# Monitor progress - watch for automatic stage generation in logs
kubectl logs -n benchmarking -l benchmark.scenario=sweep -f

Customizing Sweep Parameters

Adjust saturation probe settings:

# custom-sweep.yaml
benchmark:
  scenario: sweep
  target:
    baseUrl: http://your-model.your-namespace:8000
  scenarios:
    sweep:
      load:
        sweep:
          numRequests: 3000        # More requests for larger systems
          timeout: 90              # Longer probe time
          numStages: 7             # More test stages
          stageDuration: 240       # Longer stage duration
          saturationPercentile: 99 # More conservative estimate

helm install sweep-test ai-on-eks/benchmark-charts -f custom-sweep.yaml -n benchmarking

Key Configuration:

Variable synthetic data distributions
Sweep mode (automatic): floods system with configurable request count (default: 2000) over 60 seconds to discover saturation point
Auto-generated test stages using geometric clustering near saturation
Streaming enabled

Understanding the results:

The tool's preprocessing phase identifies saturation by flooding your system with 2000 requests over 60 seconds and measuring processing rate; the saturation_percentile: 95 means it uses the 95th percentile of observed rates for conservative estimates. Review the automatically generated stages in the logs (geometric clustering produces tighter spacing near saturation, like 4, 8, 14, 17, 18 QPS) and compare the detected saturation point against your manual testing expectations. Significant discrepancies reveal queueing bottlenecks or resource constraints you might have missed, and the geometric distribution provides rich data precisely where performance transitions from stable to degraded.

Configuring the Saturation Probe: The numRequests parameter in the sweep configuration controls how many requests are sent during the initial saturation discovery phase. The default value of 2000 is appropriate for most deployments, but you can adjust this based on your expected capacity.

Alternative: Raw Kubernetes YAML

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: inference-perf-sweep
  namespace: benchmarking
data:
  config.yml: |
    api:
      type: completion
      streaming: true

    data:
      type: synthetic
      input_distribution:
        mean: 512
        std_dev: 128
        min: 128
        max: 2048
      output_distribution:
        mean: 256
        std_dev: 64
        min: 32
        max: 512

    load:
      type: constant
      stages: []  # Auto-generated by sweep
      sweep:
        type: geometric
        num_requests: 2000
        timeout: 60
        num_stages: 5
        stage_duration: 180
        saturation_percentile: 95
      num_workers: 8

    server:
      type: vllm
      model_name: qwen3-8b
      base_url: http://qwen3-vllm.default:8000
      ignore_eos: true

    tokenizer:
      pretrained_model_name_or_path: Qwen/Qwen3-8B

    storage:
      simple_storage_service:
        bucket_name: "inference-perf-results"
        path: "sweep-test/results"
---
apiVersion: batch/v1
kind: Job
metadata:
  name: inference-perf-sweep
  namespace: benchmarking
spec:
  backoffLimit: 2
  ttlSecondsAfterFinished: 3600
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: inference-perf-sa

      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app.kubernetes.io/component: qwen3-vllm
            topologyKey: topology.kubernetes.io/zone

      containers:
      - name: inference-perf
        image: quay.io/inference-perf/inference-perf:v0.2.0
        command: ["/bin/sh", "-c"]
        args:
        - |
          inference-perf --config_file /workspace/config.yml
        volumeMounts:
        - name: config
          mountPath: /workspace/config.yml
          subPath: config.yml
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"

      volumes:
      - name: config
        configMap:
          name: inference-perf-sweep

When to use this scenario:​

Deployment​

Using Helm Chart (Recommended)​

Customizing Sweep Parameters​

Key Configuration:​

Understanding the results:​