Skip to main content

Apache Spark with Apache Gluten + Velox Benchmarks

Apache Spark powers much of today’s large-scale analytics, but its default SQL engine is still JVM-bound and row-oriented. Even with Project Tungsten’s code generation and vectorized readers, operators often pay heavy costs for Java object creation, garbage collection, and row-to-column conversions. These costs become visible on analytic workloads that scan large Parquet or ORC tables, perform wide joins, or run memory-intensive aggregations—leading to slower queries and inefficient CPU use.

Modern C++ engines such as Velox, ClickHouse, and DuckDB show that SIMD-optimized, cache-aware vectorization can process the same data far faster. But replacing Spark is impractical given its ecosystem and scheduling model. Apache Gluten solves this by translating Spark SQL plans into the open Substrait IR and offloading execution to a native C++ backend (Velox, ClickHouse, etc.). This approach keeps Spark’s APIs and Kubernetes deployment model while accelerating the CPU-bound SQL layer—the focus of this deep dive and benchmark study on Amazon EKS.

In this guide you will:

  • Understand how the Spark + Gluten + Velox stack is assembled on Amazon EKS
  • Review TPC-DS 3TB benchmark results against native Spark on Graviton4 (r8gd.12xlarge)
  • Learn the configuration, deployment, and troubleshooting steps required to reproduce the study
TL;DR

Our TPC-DS 3TB benchmark on Amazon EKS (r8gd.12xlarge Graviton4) shows that Apache Gluten + Velox v1.6.0 delivered:

  • 39% faster overall on Parquet — total runtime dropped from 3,650.56s to 2,239.93s (1.63× speedup).
  • 81 of 103 queries improved, with peak speedups up to 4.36× on q93. 22 queries regressed, most within 20% (one outlier on q72).
  • Spark 3.5.8 on both sides; only the execution engine and related Spark settings differed.

TPC-DS 3TB Benchmark Results

Summary

Our TPC-DS 3TB benchmark on Amazon EKS demonstrates that Apache Gluten + Velox v1.6.0 delivers a 1.63× overall speedup compared to native Spark SQL on Parquet, with individual queries varying from ~77% faster to ~46% slower (excluding one outlier on q72).

NameCompletion Time (seconds)Speedup
Native Spark3,650.56Baseline
Gluten + Velox v1.6.02,239.931.63× (39% faster)

📊 View complete benchmark results and raw data →

Benchmark Infrastructure

Benchmark Methodology

Benchmarks ran sequentially on the same cluster to ensure identical hardware and eliminate resource contention. Native Spark executed first, followed by Gluten + Velox.

To ensure an apples-to-apples comparison, both native Spark and Gluten + Velox jobs ran on identical hardware, storage, and data. Only the execution engine and related Spark settings differed.

Test Environment

ComponentConfiguration
EKS ClusterAmazon EKS 1.34
Node Instance Typer8gd.12xlarge (Graviton4, 48 vCPUs, 384GB RAM, 1.8TB NVMe SSD)
Node Group12 nodes dedicated for benchmark workloads
Executor Configuration23 executors × 5 cores × 20GB on-heap + 32GB off-heap
Driver Configuration5 cores × 20GB RAM
DatasetTPC-DS 3TB (Parquet format)
StorageAmazon S3 with optimized S3A connector
Iterations5 (median runtime per query)

Spark Configuration Comparison

ConfigurationNative SparkGluten + Velox
Spark Version3.5.83.5.8
Gluten VersionN/A1.6.0
Java RuntimeOpenJDK 17OpenJDK 17
Execution EngineJVM-based TungstenNative C++ Velox
Key PluginsStandard SparkGlutenPlugin, ColumnarShuffleManager
Off-heap MemoryDefault32GB enabled
Vectorized ProcessingLimited Java SIMDFull C++ vectorization (ARM Neon)
Memory ManagementJVM GCUnified native + JVM

Critical Gluten-Specific Configurations

# Essential Gluten Plugin Configuration
"spark.plugins": "org.apache.gluten.GlutenPlugin"
"spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager"

# Memory Configuration - Critical for Gluten
"spark.memory.offHeap.enabled": "true"
"spark.memory.offHeap.size": "32g"

# Java 17 Compatibility (Required)
"spark.driver.extraJavaOptions": "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"
"spark.executor.extraJavaOptions": "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED"

# AWS-Specific: Required for S3 region detection
"spark.hadoop.fs.s3a.endpoint.region": "us-west-2"

Performance Results

Overall Performance

NameCompletion Time (seconds)Performance
Native Spark3,650.56Baseline
Gluten + Velox v1.6.02,239.931.63× (39% faster)

Performance Distribution

Performance RangeQuery CountPercentage
20%+ improvement5755%
10-20% improvement1616%
±10%1616%
10-20% degradation77%
20%+ degradation77%

Top 10 Query Improvements

QueryNative Spark (s)Gluten + Velox v1.6.0 (s)Speedup
q93-v4.0207.947.74.36× (+77%)
q23a-v4.0409.699.24.13× (+76%)
q23b-v4.0412.2113.73.62× (+72%)
q5-v4.040.811.53.56× (+72%)
q97-v4.043.313.13.30× (+70%)
q78-v4.0136.046.42.93× (+66%)
q59-v4.026.49.22.89× (+65%)
q65-v4.043.615.52.82× (+65%)
q86-v4.012.84.72.75× (+64%)
q35-v4.020.08.02.50× (+60%)

Top 10 Query Regressions

QueryNative Spark (s)Gluten + Velox v1.6.0 (s)Degradation
q72-v4.040.4193.5378% slower (0.21×)
q76-v4.041.961.246% slower (0.68×)
q77-v4.02.53.438% slower (0.72×)
q13-v4.08.811.936% slower (0.74×)
q27-v4.07.49.933% slower (0.75×)
q95-v4.085.9106.724% slower (0.81×)
q52-v4.01.51.921% slower (0.83×)
q3-v4.04.15.020% slower (0.83×)
q92-v4.01.82.219% slower (0.84×)
q55-v4.01.41.617% slower (0.85×)

Notes on Performance Differences

Gluten + Velox Outperforms Most Queries:

Example - Query 93 (+77%)

  • The hot 200-task join stage went from p50 1m16.378s → 6.523s (~12× faster) after the SortMergeJoin was replaced by ShuffledHashJoinExecTransformer and the two pre-join Sorts were eliminated.
  • Per-task peak memory on the same stage dropped from 3.3 GB (on-heap) to 80 MB (off-heap), with JVM GC collapsing from 3.628s to 35ms.
  • ColumnarShuffleManager produced ~24% smaller serialized shuffle bytes (214.1 GB → 162.2 GB) at identical record counts, reducing both network and disk pressure on downstream stages.

Example - Query 23a (+76%)

  • Identical job/stage/task counts (16/16/9790) and near-identical input on both runs — Catalyst chose the same physical plan, only operator implementations differ.
  • All 9 SortMergeJoins were replaced by ShuffledHashJoinExecTransformer and all 18 pre-join Sort operators were eliminated; the hot 200-task join stage ran 6.86× faster (2m47.28s → 24.358s).
  • Columnar shuffle reduced serialized bytes 36.5% at identical record counts (16.6B records); JVM GC dropped 99.9% (3m19.058s → 200ms).

Gluten + Velox Underperforms on Some Queries:

Example - Query 72 (-378%)

  • 152s of the 152.8s regression is concentrated in a single ShuffledHashJoin probe stage at identical 200 tasks and 1.01B shuffle records — no spill, no fallback, GC near zero.
  • Velox materialises 22.5M small column-vector batches (avg ~222 rows/batch — far below the 4,096-row target) through 9 cascaded downstream operators after the probe, paying per-batch operator overhead at every step.
  • Spark's whole-stage codegen handles the same 4.99B intermediate row stream as a single fused JVM function that never materialises it. This is a structural cost of Velox not implementing SortMergeJoin, not a bug.

Example - Query 13 (-36%)

  • Final SortMergeJoin → ShuffledHashJoinExecTransformer swap adds +522% to per-task p50 on the 66-task probe stage; the hash table is probed against a complex 3-way disjunctive non-equi predicate per match.
  • Per-task off-heap peak memory rose from 42 KB → 160 MB on the store_sales scan because Velox's columnar batches are less compact than Spark's row encoder for this 9-column int+decimal payload.
  • GC elimination saves only ~1.8s of wall time on this query — far less than the +3.2s per-task cost of the SHJ probe with disjunctive predicates, so the swap is a net loss when pre-sorted streams already exist.

Resource Usage Analysis

Both benchmarks ran sequentially on identical hardware (12× r8gd.12xlarge) to enable direct comparison. The dashboards below capture cluster-wide resource utilization during the full TPC-DS 3TB run for each engine.

CPU Utilization

Native Spark (Default)CPU utilization - Native Spark
Gluten + VeloxCPU utilization - Gluten + Velox

Memory Utilization

Native Spark (Default)Memory utilization - Native Spark
Gluten + VeloxMemory utilization - Gluten + Velox

Network Bandwidth

Native Spark (Default)Network bandwidth - Native Spark
Gluten + VeloxNetwork bandwidth - Gluten + Velox

Storage I/O

Native Spark (Default)Storage IOPS - Native Spark
Gluten + VeloxStorage IOPS - Gluten + Velox
Full Grafana Dashboard — Native Spark (Default)
Full Grafana dashboard - Native Spark
Full Grafana Dashboard — Gluten + Velox
Full Grafana dashboard - Gluten + Velox

When to Consider Gluten + Velox

Gluten + Velox may be a good fit for:

  • Compute-bound TPC-DS-style analytics — Workloads with wide joins, complex aggregations, and large scans (q5, q23, q78, q93, q97 in our run all sped up by 2.5–4.4×).
  • CPU-efficient batch pipelines on Graviton — The Velox backend takes advantage of ARM Neon SIMD on r8gd-class instances; off-heap columnar execution reduces JVM GC pressure.
  • Drop-in acceleration without code changes — Existing Spark SQL/DataFrame code runs unchanged; unsupported operators automatically fall back to Spark's JVM engine.

Validate before adopting in production:

  • Test your query mix — A small set of queries can regress (q72 in particular). Compare physical plans for SortMergeJoin → ShuffledHashJoin swaps and look for high-cardinality intermediate streams that Spark's whole-stage codegen would have pipelined without materialization.
  • Tune off-heap memory — Velox needs significant off-heap allocation (this benchmark used 32GB). Plan for 20–30% of executor memory off-heap.
  • Check Java 17 compatibility flags — The --add-opens JVM options are required.
Historic Results — TPC-DS 1TB on c5d.12xlarge

The interactive dashboard and tables below capture an earlier benchmark run on a different instance class (c5d.12xlarge, x86) and a smaller dataset (1TB). It is preserved here for context and ecosystem comparisons.

Interactive Performance Dashboard

We benchmarked TPC-DS 1TB workloads on a dedicated Amazon EKS cluster to compare native Spark SQL execution with Spark enhanced by Gluten and the Velox backend. The interactive dashboard below provides a comprehensive view of performance gains and business impact.

🚀
1.72x
Overall Performance Gain
72% faster execution across all 104 TPC-DS queries
💰
42%
Cost Reduction Potential
Direct correlation between performance improvement and compute costs
📊
86.5%
Success Rate
90 out of 104 queries improved, only 14 showed degradation
⏱️
42 min
Time Saved
42 minutes saved on TPC-DS 1TB benchmark suite (1.7h → 1.0h)

Performance Comparison: Runtime Analysis

Query Speedup Distribution

Top 10 Performance Improvements

Performance Improvement Distribution
90
Improved
14
Degraded
1.77x
Median
86.5%
Success Rate
Performance Categories
Excellent (3x+): 15 queriesGood (2x-3x): 25 queriesModerate (1.5x-2x): 23 queriesSlight (1x-1.5x): 27 queriesDegraded (<1x): 14 queries
Query Execution Time: Spark vs Gluten+Velox (Top 30 Queries by Improvement)

🔍 Performance Analysis Insights

  • Complex Analytical Queries: Queries with heavy joins and aggregations (q93, q49, q50) show the highest improvements (3.8x-5.6x)
  • Scan-Heavy Operations: Large table scans benefit significantly from native columnar processing
  • Vectorization Benefits: Mathematical operations and filters see consistent 2x-3x improvements
  • Memory-Intensive Queries: Queries like q23b (146s→52s) demonstrate native memory management advantages
  • Edge Cases: 14 queries showed degradation, primarily those with simple operations where JNI overhead exceeded benefits
  • Cost Savings: 69.8% reduction in execution time translates to ~42% lower compute costs on EKS

Summary

Our comprehensive TPC-DS 1TB benchmark on Amazon EKS demonstrates that Apache Gluten with Velox delivers a 1.72x overall speedup (72% faster) compared to native Spark SQL, with individual queries showing improvements ranging from 1.1x to 5.5x.

Benchmark Infrastructure Configuration

To ensure an apples-to-apples comparison, both native Spark and Gluten + Velox jobs ran on identical hardware, storage, and data. Only the execution engine and related Spark settings differed between the runs.

Test Environment Specifications

ComponentConfiguration
EKS ClusterAmazon EKS 1.33
Node Instance Typec5d.12xlarge (48 vCPUs, 96GB RAM, 1.8TB NVMe SSD)
Node Group8 nodes dedicated for benchmark workloads
Executor Configuration23 executors × 5 cores × 20GB RAM each
Driver Configuration5 cores × 20GB RAM
DatasetTPC-DS 1TB (Parquet format)
StorageAmazon S3 with optimized S3A connector

Spark Configuration Comparison

ConfigurationNative SparkGluten + Velox
Spark Version3.5.33.5.2
Java RuntimeOpenJDK 17OpenJDK 17
Execution EngineJVM-based TungstenNative C++ Velox
Key PluginsStandard SparkGlutenPlugin, ColumnarShuffleManager
Off-heap MemoryDefault2GB enabled
Vectorized ProcessingLimited Java SIMDFull C++ vectorization
Memory ManagementJVM GCUnified native + JVM

Critical Gluten-Specific Configurations

# Essential Gluten Plugin Configuration
spark.plugins: "org.apache.gluten.GlutenPlugin"
spark.shuffle.manager: "org.apache.spark.shuffle.sort.ColumnarShuffleManager"
spark.memory.offHeap.enabled: "true"
spark.memory.offHeap.size: "2g"

# Java 17 Compatibility for Gluten-Velox
spark.driver.extraJavaOptions: "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED"
spark.executor.extraJavaOptions: "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED"

Performance Analysis: Top 20 Query Improvements

Gluten’s native execution path shines on wide, compute-heavy SQL. The table highlights the largest gains across the 104 TPC-DS queries, comparing median runtimes over multiple iterations.

RankTPC-DS QueryNative Spark (s)Gluten + Velox (s)Speedup% Improvement
1q93-v2.480.1814.635.48×448.1%
2q49-v2.425.686.663.86×285.5%
3q50-v2.438.5710.003.86×285.5%
4q59-v2.417.574.823.65×264.8%
5q5-v2.423.186.423.61×261.4%
6q62-v2.49.412.883.27×227.0%
7q97-v2.418.685.993.12×211.7%
8q40-v2.415.175.053.00×200.2%
9q90-v2.412.054.212.86×186.2%
10q23b-v2.4147.1752.962.78×177.9%
11q29-v2.417.336.452.69×168.7%
12q9-v2.460.9023.032.64×164.5%
13q96-v2.49.193.552.59×158.8%
14q84-v2.47.993.122.56×156.1%
15q6-v2.49.873.872.55×155.3%
16q99-v2.49.703.812.55×154.6%
17q43-v2.44.701.872.51×151.1%
18q65-v2.417.517.002.50×150.2%
19q88-v2.450.9020.692.46×146.1%
20q44-v2.422.909.362.45×144.7%

Speedup Distribution Across Queries

Speedup RangeCount% of Total (≈97 queries)
≥ 3× and < 5×9≈ 9%
≥ 2× and < 3×29≈ 30%
≥ 1.5× and < 2×30≈ 31%
≥ 1× and < 1.5×21≈ 22%
< 1× (slower with Gluten)8≈ 8%

Key Performance Insights

DimensionInsightImpact
Aggregate Gains
  • Total runtime dropped from 1.7 hours to 1.0 hour (42 minutes saved)
  • Overall speedup of 1.72× across the TPC-DS suite
  • Peak single-query speedup of 5.48× (q93-v2.4)
  • Shorter batch windows and faster SLAs
  • Operational stability preserved via seamless Spark fallbacks
Query Patterns
  • Complex analytical queries accelerate by 3×-5.5×
  • Join-heavy workloads benefit from Velox hash joins
  • Aggregations and scans see consistent 2×-3× improvements
  • Prioritize Gluten adoption for compute-bound SQL pipelines
  • Plan for faster dimensional modeling and BI refreshes
Resource Utilization
  • CPU efficiency improves by ~72%
  • Unified native memory dramatically reduces GC pressure
  • Columnar shuffle + native readers boost I/O throughput
  • Lower infrastructure spend for the same workload
  • Smoother execution with fewer GC pauses
  • More predictable runtimes under heavy data scans

Business Impact Assessment

Cost Optimization Summary

note

With a 1.72× speedup, organizations can achieve:

  • ≈42% lower compute spend for batch processing workloads
  • Faster time-to-insight for business-critical analytics
  • Higher cluster utilization through reduced job runtimes

Operational Benefits

tip
  • Minimal migration effort: Drop-in plugin with existing Spark SQL code
  • Production-ready reliability preserves operational stability
  • Kubernetes-native integration keeps parity with existing EKS data platforms

Technical Recommendations

When to Deploy Gluten + Velox

  • High-Volume Analytics: TPC-DS-style complex queries with joins and aggregations
  • Cost-Sensitive Workloads: Where 40%+ compute cost reduction justifies integration effort
  • Performance-Critical Pipelines: SLA-driven workloads requiring faster execution

Implementation Considerations

  • Query Compatibility: Test edge cases in your specific workload patterns
  • Memory Tuning: Optimize off-heap allocation based on data characteristics
  • Monitoring: Leverage native metrics for performance debugging and optimization

The benchmark results demonstrate that Gluten + Velox represents a significant leap forward in Spark SQL performance, delivering production-ready native acceleration without sacrificing Spark's distributed computing advantages.

Architecture Overview — Apache Spark vs. Apache Spark with Gluten + Velox

Understanding how Gluten intercepts Spark plans clarifies why certain workloads accelerate so sharply. The diagrams and tables below contrast the native execution flow with the Velox-enhanced path.

Execution Path Comparison

Memory & Processing Comparison

AspectNative SparkGluten + VeloxImpact
Memory ModelJVM heap objectsApache Arrow off-heap columnarJVM GC time dropped ~99.7% in our 3TB run (2h47m → 27s)
ProcessingRow-by-row iteration via whole-stage codegenSIMD vectorized columnar batchesHigher per-cycle throughput on scan/filter/aggregate
CPU CacheMixed locality across row layoutsCache-friendly column layoutsBetter cache reuse on wide scans and aggregations
ShuffleUnsafeRow + SnappyColumnarShuffleManager (Arrow-encoded)22–25% smaller shuffle bytes overall (11.3 TB → 8.7 TB read)

What Is Apache Gluten — Why It Matters

Apache Gluten is a middleware layer that offloads Spark SQL execution from the JVM to high-performance native execution engines. For data engineers, this means:

Core Technical Benefits

  1. Zero Application Changes: Existing Spark SQL and DataFrame code works unchanged
  2. Automatic Fallback: Unsupported operations gracefully fall back to native Spark
  3. Cross-Engine Compatibility: Uses Substrait as intermediate representation
  4. Production Ready: Handles complex enterprise workloads without code changes

Gluten Plugin Architecture

Key Configuration Parameters

# Essential Gluten Configuration
sparkConf:
# Core Plugin Activation
"spark.plugins": "org.apache.gluten.GlutenPlugin"
"spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager"

# Memory Configuration
"spark.memory.offHeap.enabled": "true"
"spark.memory.offHeap.size": "4g" # Critical for Velox performance

# Fallback Control
"spark.gluten.sql.columnar.backend.velox.enabled": "true"
"spark.gluten.sql.columnar.forceShuffledHashJoin": "true"

What Is Velox — Why Gluten Needs It (Alternatives)

Velox is Meta's C++ vectorized execution engine optimized for analytical workloads. It serves as the computational backend for Gluten, providing:

Velox Core Components

LayerComponentPurpose
OperatorsFilter, Project, Aggregate, JoinVectorized SQL operations
ExpressionsVector functions, Type systemSIMD-optimized computations
MemoryApache Arrow buffers, Custom allocatorsCache-efficient data layout
I/OParquet/ORC readers, CompressionHigh-throughput data ingestion
CPUAVX2/AVX-512, ARM NeonHardware-accelerated processing

Velox vs Alternative Backends

FeatureVeloxClickHouseApache Arrow DataFusion
LanguageC++C++Rust
SIMD SupportAVX2/AVX-512/NeonAVX2/AVX-512Limited
Memory ModelApache Arrow ColumnarNative ColumnarApache Arrow Native
Spark IntegrationNative via GlutenVia GlutenExperimental
PerformanceExcellentExcellentGood
MaturityProduction (Meta)ProductionDeveloping

Configuring Spark + Gluten + Velox

The instructions in this section walk through the baseline artifacts you need to build an image, configure Spark defaults, and deploy workloads on the Spark Operator.

Docker Image Configuration

Create a production-ready Spark image with Gluten + Velox:

You can find the sample Dockerfile here: Dockerfile-spark-gluten-velox

Spark Configuration Examples

Use the templates below to bootstrap both shared Spark defaults and a sample SparkApplication manifest.

# spark-defaults.conf - Optimized for Gluten + Velox

# Core Gluten Configuration
spark.plugins org.apache.gluten.GlutenPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager

# Memory Configuration - Critical for Performance
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 4g
spark.executor.memoryFraction 0.8
spark.executor.memory 20g
spark.executor.memoryOverhead 6g

# Velox-specific Optimizations
spark.gluten.sql.columnar.backend.velox.enabled true
spark.gluten.sql.columnar.forceShuffledHashJoin true
spark.gluten.sql.columnar.backend.velox.bloom_filter.enabled true

# Java 17 Module Access (Required)
spark.driver.extraJavaOptions --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED
spark.executor.extraJavaOptions --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED

# Adaptive Query Execution
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.skewJoin.enabled true

# S3 Optimizations
spark.hadoop.fs.s3a.fast.upload.buffer disk
spark.hadoop.fs.s3a.multipart.size 128M
spark.hadoop.fs.s3a.connection.maximum 200
Why these defaults?
  • spark.plugins activates the Apache Gluten runtime so query plans can offload to Velox.
  • Off-heap configuration reserves Arrow buffers that prevent JVM garbage collection pressure.
  • Adaptive query execution settings keep shuffle partitions balanced under both native and Gluten runs.
  • S3 connector tuning avoids bottlenecks when scanning the TPC-DS dataset from Amazon S3.

Running Benchmarks

Follow the workflow below to reproduce the benchmark from data generation through post-run analysis.

TPC-DS Benchmark Setup

The complete TPC-DS harness is available in the repository: examples/benchmark/tpcds-benchmark-spark-gluten-velox/README.md.

Step 1: Generate TPC-DS Data

Follow this link to generate the test data in S3 bucket. The benchmark in this guide uses the 3TB scale.

Step 2: Submit Native & Gluten Jobs

Prerequisites

Before submitting benchmark jobs, ensure:

  1. S3 Bucket is configured: Export the S3 bucket name from your Terraform outputs
  2. Benchmark data is available: Verify TPC-DS data (e.g., 3TB) exists in the same S3 bucket

Export S3 bucket name from Terraform outputs:

Export S3 bucket variable
# Get S3 bucket name from Terraform outputs
export S3_BUCKET=$(terraform -chdir=path/to/your/terraform output -raw s3_bucket_id_data)

# Verify the bucket and data exist
aws s3 ls s3://$S3_BUCKET/TPCDS-TEST-3TB/

Submit benchmark jobs:

The Native Spark baseline manifest lives next to the Comet benchmark assets and is reused as the apples-to-apples baseline for both engines. The Gluten + Velox manifests are split by CPU architecture.

Submit native Spark benchmark
cd data-stacks/spark-on-eks/benchmarks/datafusion-comet-benchmarks
envsubst < tpcds-benchmark-parquet.yaml | kubectl apply -f -

Step 3: Monitor Benchmark Progress

Check SparkApplication status
kubectl get sparkapplications -n spark-team-a

Step 4: Spark History Server Analysis

Access detailed execution plans and metrics:

Open Spark History Server locally
kubectl port-forward svc/spark-history-server 18080:80 -n spark-history-server
Navigation Checklist
  • Point your browser to http://localhost:18080.
  • Locate both spark-<ID>-native and spark-<ID>-gluten applications.
  • In the Spark UI, inspect:
    1. SQL tab execution plans
    2. Presence of WholeStageTransformer stages in Gluten jobs
    3. Stage execution times across both runs
    4. Executor metrics for off-heap memory usage

Step 5: Summarize Findings

tip
  • Export runtime metrics from the Spark UI or event logs for both jobs.
  • Capture query-level comparisons (duration, stage counts, fallbacks) to document where Gluten accelerated or regressed.
  • Feed the results into cost or capacity planning discussions—speedups translate directly into smaller clusters or faster SLA achievement.

Key Metrics to Analyze

tip

As you compare native and Gluten runs, focus on the following signals:

  1. Query Plan Differences:

    • Native: WholeStageCodegen stages
    • Gluten: WholeStageTransformer stages
  2. Memory Usage Patterns:

    • Native: High on-heap usage, frequent GC
    • Gluten: Off-heap Arrow buffers, minimal GC
  3. CPU Utilization:

    • Native: 60-70% efficiency
    • Gluten: 80-90+ % efficiency with SIMD

Performance Analysis and Pitfalls

Gluten reduces friction for Spark adopters, but a few tuning habits help avoid regressions. Use the notes below as a checklist during rollout.

Common Configuration Pitfalls

caution
# ❌ WRONG - Insufficient off-heap memory
"spark.memory.offHeap.size": "512m" # Too small for real workloads

# ✅ CORRECT - Adequate off-heap allocation
"spark.memory.offHeap.size": "4g" # 20-30% of executor memory

# ❌ WRONG - Missing Java module access
# Results in: java.lang.IllegalAccessError

# ✅ CORRECT - Required for Java 17
"spark.executor.extraJavaOptions": "--add-opens=java.base/java.nio=ALL-UNNAMED"

# ❌ WRONG - Velox backend not enabled
"spark.gluten.sql.columnar.backend.ch.enabled": "true" # ClickHouse, not Velox!

# ✅ CORRECT - Velox backend configuration
"spark.gluten.sql.columnar.backend.velox.enabled": "true"

Performance Optimization Tips

tip
  1. Memory Sizing:

    • Off-heap: 20-30% of executor memory
    • Executor overhead: 15-20% reserved for Arrow buffers
    • Driver memory: 8-20 GB for complex queries (this benchmark used 20 GB)
  2. CPU Optimization:

    • Velox supports both x86 SIMD (AVX2/AVX-512) and ARM Neon — pick the instance class that matches your cost/perf target. Our 3TB benchmark ran on Graviton4 (r8gd.12xlarge).
    • Set spark.executor.cores = 4-8 for optimal vectorization
    • Reserve NVMe-backed instance storage for shuffle and spill (e.g., /mnt/k8s-disks/0)
  3. I/O Configuration:

    • Enable S3A fast upload: spark.hadoop.fs.s3a.fast.upload.buffer=disk
    • Increase connection pool to 200 connections: spark.hadoop.fs.s3a.connection.maximum=200
    • Use larger multipart sizes of 128 MB: spark.hadoop.fs.s3a.multipart.size=128M

Debugging Gluten Issues

note
# Enable debug logging
"spark.gluten.sql.debug": "true"
"spark.sql.planChangeLog.level": "WARN"

# Check for fallback operations
kubectl logs <spark-pod> | grep -i "fallback"

# Verify Velox library loading
kubectl exec <spark-pod> -- find /opt/spark -name "*velox*"

# Monitor off-heap memory usage
kubectl top pod <spark-pod> --containers

Verifying Gluten+Velox Execution in Spark History Server

When Gluten+Velox is working correctly, you'll see distinctive execution patterns in the Spark History Server that indicate native acceleration:

Key Indicators of Gluten+Velox Execution:

  • VeloxSparkPlanExecApi.scala references in stages and tasks
  • WholeStageCodegenTransformer nodes in the DAG visualization
  • ColumnarBroadcastExchange operations instead of standard broadcast
  • GlutenWholeStageColumnarRDD in the RDD lineage
  • Methods like executeColumnar and mapPartitions at VeloxSparkPlanExecApi.scala lines

Example DAG Pattern:

AQEShuffleRead
├── ColumnarBroadcastExchange
├── ShuffledColumnarBatchRDD [Unordered]
│ └── executeColumnar at VeloxSparkPlanExecApi.scala:630
└── MapPartitionsRDD [Unordered]
└── mapPartitions at VeloxSparkPlanExecApi.scala:632

What This Means:

  • VeloxSparkPlanExecApi: Gluten's interface layer to the Velox execution engine
  • Columnar operations: Data processed in columnar format (more efficient than row-by-row)
  • WholeStageTransformer: Multiple Spark operations fused into single native Velox operations
  • Off-heap processing: Memory management handled by Velox, not JVM garbage collector

If you see traditional Spark operations like mapPartitions at <WholeStageCodegen> without Velox references, Gluten may have fallen back to JVM execution for unsupported operations.

Conclusion

Apache Gluten with the Velox backend consistently accelerates Spark SQL workloads on Amazon EKS, delivering a 1.63× overall speedup (39% time reduction) in our TPC-DS 3TB benchmark on Graviton4 (r8gd.12xlarge). 81 of 103 queries improved — top wins reached 4.4× on q93 and 4.1× on q23a — while a small set regressed, most notably q72. The performance gains stem from offloading compute-intensive operators to a native, vectorized engine, reducing JVM overhead and improving CPU efficiency.

When planning your rollout:

  • Start by mirroring the configurations documented above, then tune off-heap memory and shuffle behavior based on workload shape.
  • Use the Spark Operator deployment flow to A/B test native and Gluten runs so you can quantify gains and detect fallbacks early.
  • Monitor Spark UI and metrics exports to build a data-backed case for production adoption or cluster right-sizing.

With the Docker image, Spark defaults, and example manifests provided in this guide, you can reproduce the benchmark end-to-end and adapt the pattern for your own cost and performance goals.


For complete implementation examples and benchmark results, see the GitHub repository.