TPC-DS Benchmark on StarRocks
Run the industry-standard TPC-DS decision support benchmark on your StarRocks shared-data cluster to validate query performance at scale.
Overview
TPC-DS models a decision support system with complex queries covering reporting, ad-hoc, iterative OLAP, and data mining workloads. This guide walks through generating 1TB of TPC-DS data, loading it into StarRocks, and running all 99 benchmark queries.
What you'll do:
- Generate 1TB TPC-DS dataset
- Create TPC-DS schema and tables
- Load data via Stream Load
- Run 99 TPC-DS queries and collect results
- Observe CN auto-scaling under load
Prerequisites
- StarRocks cluster deployed following the Infrastructure Guide
- StarRocks shared-data cluster running (
kubectl get starrocksclusters -n starrocks) kubectlconfigured for your EKS cluster
Verify your cluster is ready:
kubectl get pods -n starrocks
Expected output:
NAME READY STATUS RESTARTS AGE
kube-starrocks-operator-xxxxxxxxx-xxxxx 1/1 Running 0 1h
starrocks-shared-data-fe-0 1/1 Running 0 30m
starrocks-shared-data-cn-0 1/1 Running 0 28m
Step 1: Generate TPC-DS Data (1TB)
Create the PVC and data generator pod:
kubectl apply -f examples/tpcds-datagen-1tb.yaml
This creates:
- A 1Ti PVC (
tpcds-data-pvc) for storing generated data - A generator job (
tpcds-datagen-1tb) running on a memory-optimized Karpenter node with thedsdgentool
Monitor progress:
kubectl logs -f job/tpcds-datagen-1tb -n starrocks
Data generation for TPC-DS scale factor 1000 (SF1000) takes approximately 8-10 hours depending on instance type. The "1TB" refers to the scale factor — the actual generated .dat files are ~917GB of pipe-delimited text across ~10000 files. The pod exits when complete; the data persists on the PVC for loading.
Step 2: Create TPC-DS Tables
Apply the table creation ConfigMap:
kubectl apply -f examples/tpcds-tables-configmap.yaml
Create the database:
kubectl run mysql-client -n starrocks --rm -i --restart=Never --image=mysql:8.0 -- \
mysql -hstarrocks-shared-data-fe-service -P9030 -uroot -e "CREATE DATABASE IF NOT EXISTS tpcds;"
Create all 24 TPC-DS tables:
kubectl get configmap tpcds-tables-sql -n starrocks -o jsonpath='{.data.tpcds_create\.sql}' | \
kubectl run mysql-create -n starrocks --rm -i --restart=Never --image=mysql:8.0 -- \
mysql -hstarrocks-shared-data-fe-service -P9030 -uroot
Verify tables were created:
kubectl run mysql-show -n starrocks --rm -i --restart=Never --image=mysql:8.0 -- \
mysql -hstarrocks-shared-data-fe-service -P9030 -uroot tpcds -e "SHOW TABLES;"
Expected output:
+---------------------+
| Tables_in_tpcds |
+---------------------+
| call_center |
| catalog_page |
| catalog_returns |
| catalog_sales |
| customer |
| customer_address |
| customer_demographics|
| date_dim |
| household_demographics|
| income_band |
| inventory |
| item |
| promotion |
| reason |
| ship_mode |
| store |
| store_returns |
| store_sales |
| time_dim |
| warehouse |
| web_page |
| web_returns |
| web_sales |
| web_site |
+---------------------+
Step 4: Load Data via Stream Load
Run the data loader job:
kubectl apply -f examples/tpcds-data-loader.yaml
The job uses StarRocks Stream Load — an HTTP-based bulk loading API that sends each .dat file via PUT request to the FE endpoint. In shared-data mode, data flows through FE → CN → S3. Key settings:
column_separator:|— pipe-delimited.datfilesmax_filter_ratio:0.3— tolerates up to 30% bad rows per filetimeout:600— 10 minute timeout per filestrict_mode:false— lenient type conversion
Monitor progress:
kubectl logs -f job/tpcds-data-loader -n starrocks
Example output:
Loading file: /data/tpcds-1tb/catalog_returns_215_1000.dat
{
"TxnId": 8533,
"Label": "catalog_returns_1775615706652237951",
"Table": "catalog_returns",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 143761,
"NumberLoadedRows": 143761,
"NumberFilteredRows": 0,
"LoadBytes": 23308622,
"LoadTimeMs": 1065
}
Loading file: /data/tpcds-1tb/catalog_returns_216_1000.dat
After loading completes, verify row counts for the largest tables:
kubectl run mysql-verify -n starrocks --rm -i --restart=Never --image=mysql:8.0 -- \
mysql -hstarrocks-shared-data-fe-service -P9030 -uroot tpcds -e "
SELECT 'store_sales' as tbl, count(*) as cnt FROM store_sales
UNION ALL SELECT 'catalog_sales', count(*) FROM catalog_sales
UNION ALL SELECT 'web_sales', count(*) FROM web_sales
UNION ALL SELECT 'inventory', count(*) FROM inventory
UNION ALL SELECT 'customer', count(*) FROM customer;
"
Expected output:
tbl cnt
store_sales 2750394987
catalog_sales 1429178251
web_sales 719730368
inventory 783000000
customer 12000000
Loading ~537GB of data (SF1000) across ~5900 files takes approximately 1-2 hours depending on CN node count and instance type.
Step 5: Run TPC-DS Benchmark
Apply the benchmark job (includes all 99 TPC-DS queries as a ConfigMap):
kubectl apply -f examples/tpcds-benchmark.yaml
Monitor the benchmark:
kubectl logs -f job/tpcds-benchmark -n starrocks
Each query has a 600-second timeout. The job produces a summary report at the end:
query01: SUCCESS (6223ms)
query02: SUCCESS (12245ms)
query03: SUCCESS (12936ms)
query04: SUCCESS (20307ms)
...
query95: SUCCESS (344360ms)
query96: SUCCESS (12043ms)
query97: SUCCESS (7944ms)
query98: SUCCESS (6980ms)
query99: SUCCESS (2383ms)
Summary:
Total Queries: 99
Successful: 99
Failed: 0
Total Time: 719127ms
Average Time: 7263ms
All 99 TPC-DS queries completed successfully with 10 CN nodes. Average query time was ~7.3 seconds. Query 95 was the slowest at ~344 seconds due to its complex multi-way join pattern.
Step 6: Observe Auto-Scaling
While the benchmark is running, observe CN auto-scaling in a separate terminal:
# Watch HPA scaling CN replicas
kubectl get hpa -n starrocks -w
Expected output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
starrocks-shared-data-cn-autoscaler StarRocksCluster/starrocks-shared-data cpu: 6%/60%, memory: 54%/60% 1 10 10
# Watch CN pods scaling up
kubectl get pods -n starrocks -w
Expected output:
NAME READY STATUS AGE
starrocks-shared-data-cn-0 1/1 Running 42h
starrocks-shared-data-cn-1 1/1 Running 20h
starrocks-shared-data-cn-2 1/1 Running 15h
starrocks-shared-data-cn-3 1/1 Running 3h
starrocks-shared-data-cn-4 1/1 Running 17m
starrocks-shared-data-cn-5 1/1 Running 17m
starrocks-shared-data-cn-6 0/1 Pending 16m
starrocks-shared-data-cn-7 0/1 Pending 16m
starrocks-shared-data-cn-8 0/1 Pending 14m
starrocks-shared-data-cn-9 0/1 Pending 14m
The CN HPA scales from 1 to 10 replicas based on CPU and memory utilization (target: 60%). Karpenter automatically provisions new nodes as CN pods scale up.
Cleanup
To remove all resources:
cd data-on-eks/data-stacks/starrocks-on-eks/terraform/_local
./cleanup.sh
This will delete all resources including:
- EKS cluster and all workloads
- S3 buckets (StarRocks data)
- VPC and networking resources
Ensure you've backed up any important data before cleanup.