Skip to main content

StarRocks on EKS Infrastructure

Deploy a production-ready StarRocks stack on Amazon EKS that supports both shared-data (S3 + CN cache) and shared-nothing (EBS + BE) cluster architectures. The stack includes KEDA-based autoscaling with scale-to-zero capability, Prometheus monitoring, and a pre-built Grafana dashboard.

Architecture

This stack deploys a complete StarRocks platform with:

  • StarRocks Operator (v1.11.3) deployed via ArgoCD
  • S3 bucket for shared-data storage (via Pod Identity)
  • High-performance StorageClass gp3-starrocks (6000 IOPS, 250 MB/s, xfs) for BE nodes
  • KEDA operator for event-driven autoscaling with scale-to-zero
  • PodMonitors that automatically discover and scrape all StarRocks clusters
  • Grafana dashboard pre-configured for multi-cluster filtering
  • Karpenter NodePools for elastic compute (general-purpose, memory-optimized, compute-optimized Graviton)

Supported Cluster Types

You can deploy one or both cluster types after the stack is ready. Both use the same FE (Frontend) but differ in the storage architecture:

  • Shared-Data Cluster — Stateless Compute Nodes (CN) read from S3 with local NVMe cache. Unlimited storage, second-scale elasticity.
  • Shared-Nothing Cluster — Backend Nodes (BE) store data on local EBS volumes. Consistent low latency, simpler operations.

For a detailed architecture comparison, see the Benchmark page.

Prerequisites

  • AWS CLIInstall Guide
  • Terraform (>= 1.5.0) — Install Guide
  • kubectlInstall Guide
  • Helm (>= 3.0) — Install Guide
  • AWS credentials configured — Run aws configure or use IAM roles
  • EKS IAM permissions — Ability to create EKS clusters, VPCs, IAM roles, S3 buckets

Step 1: Clone Repository

git clone https://github.com/awslabs/data-on-eks.git
cd data-on-eks/data-stacks/starrocks-on-eks

Step 2: Customize Stack Configuration

The default terraform/data-stack.tfvars deploys a StarRocks-only environment. Edit if you need additional components:

terraform/data-stack.tfvars
name   = "starrocks-on-eks"
region = "us-east-1"

# Core component
enable_starrocks = true

# Enable optional components if needed
enable_spark_operator = false
enable_jupyterhub = false
enable_raydata = false
enable_amazon_prometheus = false
enable_superset = false
enable_ingress_nginx = false

Step 3: Deploy the Stack

./deploy.sh
Expected deployment time

15-20 minutes — includes EKS cluster, all addons, StarRocks operator, KEDA operator, PodMonitors, and the gp3-starrocks StorageClass.

Step 4: Verify the Stack

export KUBECONFIG=kubeconfig.yaml

# Verify the StarRocks operator
kubectl get pods -n starrocks

# Verify KEDA operator
kubectl get pods -n keda

# Verify StorageClass
kubectl get storageclass gp3-starrocks

# Verify PodMonitors (Prometheus auto-discovery)
kubectl get podmonitors -n starrocks

Expected output:

# starrocks namespace
NAME READY STATUS RESTARTS AGE
kube-starrocks-operator-5d558f7b8b-9s5m6 1/1 Running 0 2m

# keda namespace
NAME READY STATUS RESTARTS
keda-operator-xxxxxxxxx 1/1 Running 0
keda-operator-metrics-apiserver-xxxxxxxxx 1/1 Running 0

# StorageClass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE
gp3-starrocks ebs.csi.aws.com Delete WaitForFirstConsumer

# PodMonitors
NAME AGE
starrocks-be 2m
starrocks-cn 2m
starrocks-fe 2m

What Gets Deployed

Infrastructure Components

ComponentPurpose
StarRocks OperatorManages StarRocksCluster CRDs (FE, BE, CN lifecycle)
S3 BucketShared-data storage (AES256 encrypted, Pod Identity access)
KEDA OperatorEvent-driven autoscaling with scale-to-zero support
PodMonitorsPrometheus scraping for FE (port 8030), BE/CN (port 8040)
gp3-starrocks StorageClassProduction-tuned EBS: 6000 IOPS, 250 MB/s, xfs, encrypted
Pod Identity RoleIAM access for CN pods to read/write S3
Grafana DashboardMulti-cluster StarRocks dashboard at infra/terraform/grafana-dashboards/starrocks-dashboard.json

Karpenter NodePools

The stack reuses existing Karpenter NodePools from the base infrastructure:

NodePoolNodeGroupType LabelUsed ForInstance Families
general-purposegeneral-purposeFE podsm6g, m7g, m7i (on-demand)
memory-optimized-gravitonSparkGravitonMemoryOptimizedBE pods (shared-nothing), CN pods (shared-data)r6g, r7g, r7gd, r8g, r8gd
compute-optimized-gravitonSparkGravitonComputeOptimizedCompute-heavy workloadsc6g, c7g, c7gd, c8gd

Instance type is selected at the StarRocksCluster level via nodeSelector with the karpenter.k8s.aws/instance-family label.

CN Autoscaling with KEDA

The stack deploys a KEDA ScaledObject for the shared-data CN nodes that enables:

  • Scale-to-zero — CN replicas drop to 0 during idle periods (saves ~$1,400/month per CN instance)
  • Prometheus-based triggers — scales on StarRocks query rate (starrocks_fe_query_total) and active connections
  • Fast activation — first query (>0.1 QPS) triggers scale from 0 → 1
  • Cooldown protection — 300s stabilization window prevents scale-down thrashing

Scaling behavior:

ParameterValuePurpose
Min replicas0Scale to zero during idle
Max replicas10Upper bound for bursts
Activation threshold0.1 QPSTriggers scale from 0 → 1
Scale-up2 pods / 30sFast response to query bursts
Scale-down1 pod / 60sConservative to keep cache warm
Cooldown300sPrevents thrashing between queries
Replicas Field Conflict

When KEDA is managing CN replicas, the StarRocksCluster CR must NOT set replicas under starRocksCnSpec. KEDA creates an underlying HPA that manages the replica count. If replicas is set, you'll see a conflict.

The example manifest examples/starrocks-shared-data.yaml uses a fixed replicas: 3 for benchmark consistency. To use KEDA autoscaling, remove that field and re-apply the manifest.

Cold Cache Trade-off

Scaling CN to 0 means losing the NVMe data cache. The first query after scale-up reads from S3 (~10-50ms per read) until the cache rebuilds. For latency-sensitive workloads, consider setting minReplicaCount: 1 in the ScaledObject to keep at least one warm CN.

Step 5: Deploy a StarRocks Cluster

The stack ships with two pre-built cluster examples. Deploy either or both.

Option A: Shared-Data Cluster (S3 + CN)

The deploy script automatically substitutes the S3 bucket ID and region from Terraform outputs:

./examples/deploy-shared-data.sh

This creates a StarRocksCluster named starrocks-shared-data with:

  • 3× FE (m6g.2xlarge) for metadata + query coordination
  • 3× CN (r8gd.8xlarge) with 1.9TB local NVMe for data cache
  • S3-backed storage with automatic writes

Option B: Shared-Nothing Cluster (EBS + BE)

./examples/deploy-shared-nothing.sh

This creates a StarRocksCluster named starrocks-shared-nothing with:

  • 3× FE (m6g.2xlarge)
  • 3× BE (r8g.8xlarge) with 500Gi gp3-starrocks EBS PVCs per BE

Monitor cluster creation

kubectl get pods -n starrocks -w

Karpenter provisions new nodes as pods are scheduled — expect 3-5 minutes for nodes to come up.

Once all pods are Running, verify the cluster:

kubectl get starrockscluster -n starrocks

Expected:

NAME                       PHASE     FESTATUS   BESTATUS   CNSTATUS
starrocks-shared-data running running running
starrocks-shared-nothing running running running

Step 6: Connect to StarRocks

Via Port-Forward (local development)

# Shared-Data on port 9030
kubectl port-forward svc/starrocks-shared-data-fe-service 9030:9030 -n starrocks

# Shared-Nothing on port 9031 (different local port to avoid conflicts)
kubectl port-forward svc/starrocks-shared-nothing-fe-service 9031:9030 -n starrocks

Connect with any MySQL client:

mysql -h 127.0.0.1 -P 9030 -u root   # shared-data
mysql -h 127.0.0.1 -P 9031 -u root # shared-nothing

Via Internal NLB (in-VPC clients)

The shared-data FE service is exposed via an AWS NLB:

NLB=$(kubectl get svc starrocks-shared-data-fe-service -n starrocks \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
mysql -h $NLB -P 9030 -u root

GUI Clients (Mac)

Any MySQL-compatible client works — examples:

ToolInstall
DBeaver (free)brew install --cask dbeaver-community
TablePlusbrew install --cask tableplus
Sequel Ace (free)brew install --cask sequel-ace

Create connections as MySQL with host 127.0.0.1, port 9030/9031, user root, no password.

Step 7: Access Grafana

# Port-forward Grafana
kubectl port-forward svc/monitoring-grafana 3000:80 -n monitoring

# Get admin password
kubectl get secret grafana-admin-secret -n monitoring \
-o jsonpath='{.data.admin-password}' | base64 -d

Open http://localhost:3000 (user: admin). Import the StarRocks dashboard from infra/terraform/grafana-dashboards/starrocks-dashboard.json. Select your Prometheus datasource and cluster name from the dropdowns — metrics should appear immediately.

Observing KEDA Autoscaling

# View the ScaledObject
kubectl get scaledobject -n starrocks

# View the underlying HPA that KEDA creates
kubectl get hpa -n starrocks

# View current replica count
kubectl get starrockscluster starrocks-shared-data -n starrocks \
-o jsonpath='{.status.starRocksCnStatus.failedInstances}'

# Watch CN pods scale
kubectl get pods -n starrocks -l app.kubernetes.io/component=cn -w

To test scale-to-zero behavior, let the cluster sit idle for >5 minutes (cooldown period). Then run a query — you should see CN scale from 0 → 1 within ~30-60 seconds.

Next Steps

Troubleshooting

Pods stuck in Pending state

kubectl describe pods -n starrocks
kubectl logs -n karpenter deployment/karpenter
kubectl get nodeclaims

Common causes:

  • Karpenter provisioning delay (3-5 min expected)
  • Instance type unavailable in the AZ (check Karpenter logs for InstanceTypeNotAvailable)
  • Node selector mismatch (e.g., requesting r8gd when Karpenter can only provision r7gd)

FE not connecting to S3

# Check Pod Identity is working
kubectl describe pod starrocks-shared-data-fe-0 -n starrocks | grep -A3 "Environment"

# Verify the ServiceAccount has the role
kubectl get sa starrocks-sa -n starrocks -o yaml

# Check FE logs for S3 errors
kubectl logs starrocks-shared-data-fe-0 -n starrocks | grep -i s3

KEDA ScaledObject not scaling

# Check the ScaledObject status
kubectl describe scaledobject starrocks-cn-scaler -n starrocks

# Check the HPA KEDA created
kubectl describe hpa -n starrocks

# Check KEDA operator logs
kubectl logs -n keda deployment/keda-operator

# Verify Prometheus is reachable from KEDA
kubectl exec -n keda deployment/keda-operator -- \
wget -qO- http://kube-prometheus-kube-prome-prometheus.kube-prometheus:9090/-/healthy

Cleanup

To remove all resources:

cd data-stacks/starrocks-on-eks
./cleanup.sh
warning

This will delete all resources including the EKS cluster, S3 buckets (with data), and VPC. Back up any important data before cleanup.