Skip to main content

Apache Spark with Apache Gluten + Velox Benchmarks

Apache Spark powers much of today’s large-scale analytics, but its default SQL engine is still JVM-bound and row-oriented. Even with Project Tungsten’s code generation and vectorized readers, operators often pay heavy costs for Java object creation, garbage collection, and row-to-column conversions. These costs become visible on analytic workloads that scan large Parquet or ORC tables, perform wide joins, or run memory-intensive aggregations—leading to slower queries and inefficient CPU use.

Modern C++ engines such as Velox, ClickHouse, and DuckDB show that SIMD-optimized, cache-aware vectorization can process the same data far faster. But replacing Spark is impractical given its ecosystem and scheduling model. Apache Gluten solves this by translating Spark SQL plans into the open Substrait IR and offloading execution to a native C++ backend (Velox, ClickHouse, etc.). This approach keeps Spark’s APIs and Kubernetes deployment model while accelerating the CPU-bound SQL layer—the focus of this deep dive and benchmark study on Amazon EKS.

In this guide you will:

  • Understand how the Spark + Gluten + Velox stack is assembled on Amazon EKS
  • Review TPC-DS 1TB benchmark results against native Spark
  • Learn the configuration, deployment, and troubleshooting steps required to reproduce the study
TL;DR
  • Benchmark scope: TPC-DS 1TB, three iterations on Amazon EKS
  • Toolchain: Apache Spark + Apache Gluten + Velox
  • Performance: 1.72× faster runtime overall, with peak 5.48× speedups on aggregation-heavy queries
  • Cost impact: ≈42% lower compute spend from shorter runs and higher CPU efficiency

TPC-DS 1TB Benchmark Results: Native Spark vs. Gluten + Velox Performance Analysis

Interactive Performance Dashboard

We benchmarked TPC-DS 1TB workloads on a dedicated Amazon EKS cluster to compare native Spark SQL execution with Spark enhanced by Gluten and the Velox backend. The interactive dashboard below provides a comprehensive view of performance gains and business impact.

🚀
1.72x
Overall Performance Gain
72% faster execution across all 104 TPC-DS queries
💰
42%
Cost Reduction Potential
Direct correlation between performance improvement and compute costs
📊
86.5%
Success Rate
90 out of 104 queries improved, only 14 showed degradation
⏱️
42 min
Time Saved
42 minutes saved on TPC-DS 1TB benchmark suite (1.7h → 1.0h)

Performance Comparison: Runtime Analysis

Query Speedup Distribution

Top 10 Performance Improvements

Performance Improvement Distribution
90
Improved
14
Degraded
1.77x
Median
86.5%
Success Rate
Performance Categories
Excellent (3x+): 15 queriesGood (2x-3x): 25 queriesModerate (1.5x-2x): 23 queriesSlight (1x-1.5x): 27 queriesDegraded (<1x): 14 queries
Query Execution Time: Spark vs Gluten+Velox (Top 30 Queries by Improvement)

🔍 Performance Analysis Insights

  • Complex Analytical Queries: Queries with heavy joins and aggregations (q93, q49, q50) show the highest improvements (3.8x-5.6x)
  • Scan-Heavy Operations: Large table scans benefit significantly from native columnar processing
  • Vectorization Benefits: Mathematical operations and filters see consistent 2x-3x improvements
  • Memory-Intensive Queries: Queries like q23b (146s→52s) demonstrate native memory management advantages
  • Edge Cases: 14 queries showed degradation, primarily those with simple operations where JNI overhead exceeded benefits
  • Cost Savings: 69.8% reduction in execution time translates to ~42% lower compute costs on EKS

Summary

Our comprehensive TPC-DS 1TB benchmark on Amazon EKS demonstrates that Apache Gluten with Velox delivers a 1.72x overall speedup (72% faster) compared to native Spark SQL, with individual queries showing improvements ranging from 1.1x to 5.5x.

📊 View complete benchmark results and raw data →

Benchmark Infrastructure Configuration

To ensure an apples-to-apples comparison, both native Spark and Gluten + Velox jobs ran on identical hardware, storage, and data. Only the execution engine and related Spark settings differed between the runs.

Test Environment Specifications

ComponentConfiguration
EKS ClusterAmazon EKS 1.33
Node Instance Typec5d.12xlarge (48 vCPUs, 96GB RAM, 1.8TB NVMe SSD)
Node Group8 nodes dedicated for benchmark workloads
Executor Configuration23 executors × 5 cores × 20GB RAM each
Driver Configuration5 cores × 20GB RAM
DatasetTPC-DS 1TB (Parquet format)
StorageAmazon S3 with optimized S3A connector

Spark Configuration Comparison

ConfigurationNative SparkGluten + Velox
Spark Version3.5.33.5.2
Java RuntimeOpenJDK 17OpenJDK 17
Execution EngineJVM-based TungstenNative C++ Velox
Key PluginsStandard SparkGlutenPlugin, ColumnarShuffleManager
Off-heap MemoryDefault2GB enabled
Vectorized ProcessingLimited Java SIMDFull C++ vectorization
Memory ManagementJVM GCUnified native + JVM

Critical Gluten-Specific Configurations

# Essential Gluten Plugin Configuration
spark.plugins: "org.apache.gluten.GlutenPlugin"
spark.shuffle.manager: "org.apache.spark.shuffle.sort.ColumnarShuffleManager"
spark.memory.offHeap.enabled: "true"
spark.memory.offHeap.size: "2g"

# Java 17 Compatibility for Gluten-Velox
spark.driver.extraJavaOptions: "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED"
spark.executor.extraJavaOptions: "--add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED"

Performance Analysis: Top 20 Query Improvements

Gluten’s native execution path shines on wide, compute-heavy SQL. The table highlights the largest gains across the 104 TPC-DS queries, comparing median runtimes over multiple iterations.

RankTPC-DS QueryNative Spark (s)Gluten + Velox (s)Speedup% Improvement
1q93-v2.480.1814.635.48×448.1%
2q49-v2.425.686.663.86×285.5%
3q50-v2.438.5710.003.86×285.5%
4q59-v2.417.574.823.65×264.8%
5q5-v2.423.186.423.61×261.4%
6q62-v2.49.412.883.27×227.0%
7q97-v2.418.685.993.12×211.7%
8q40-v2.415.175.053.00×200.2%
9q90-v2.412.054.212.86×186.2%
10q23b-v2.4147.1752.962.78×177.9%
11q29-v2.417.336.452.69×168.7%
12q9-v2.460.9023.032.64×164.5%
13q96-v2.49.193.552.59×158.8%
14q84-v2.47.993.122.56×156.1%
15q6-v2.49.873.872.55×155.3%
16q99-v2.49.703.812.55×154.6%
17q43-v2.44.701.872.51×151.1%
18q65-v2.417.517.002.50×150.2%
19q88-v2.450.9020.692.46×146.1%
20q44-v2.422.909.362.45×144.7%

Speedup Distribution Across Queries

Speedup RangeCount% of Total (≈97 queries)
≥ 3× and < 5×9≈ 9%
≥ 2× and < 3×29≈ 30%
≥ 1.5× and < 2×30≈ 31%
≥ 1× and < 1.5×21≈ 22%
< 1× (slower with Gluten)8≈ 8%

Key Performance Insights

DimensionInsightImpact
Aggregate Gains
  • Total runtime dropped from 1.7 hours to 1.0 hour (42 minutes saved)
  • Overall speedup of 1.72× across the TPC-DS suite
  • Peak single-query speedup of 5.48× (q93-v2.4)
  • Shorter batch windows and faster SLAs
  • Operational stability preserved via seamless Spark fallbacks
Query Patterns
  • Complex analytical queries accelerate by 3×-5.5×
  • Join-heavy workloads benefit from Velox hash joins
  • Aggregations and scans see consistent 2×-3× improvements
  • Prioritize Gluten adoption for compute-bound SQL pipelines
  • Plan for faster dimensional modeling and BI refreshes
Resource Utilization
  • CPU efficiency improves by ~72%
  • Unified native memory dramatically reduces GC pressure
  • Columnar shuffle + native readers boost I/O throughput
  • Lower infrastructure spend for the same workload
  • Smoother execution with fewer GC pauses
  • More predictable runtimes under heavy data scans

Business Impact Assessment

Cost Optimization Summary

note

With a 1.72× speedup, organizations can achieve:

  • ≈42% lower compute spend for batch processing workloads
  • Faster time-to-insight for business-critical analytics
  • Higher cluster utilization through reduced job runtimes

Operational Benefits

tip
  • Minimal migration effort: Drop-in plugin with existing Spark SQL code
  • Production-ready reliability preserves operational stability
  • Kubernetes-native integration keeps parity with existing EKS data platforms

Technical Recommendations

When to Deploy Gluten + Velox

  • High-Volume Analytics: TPC-DS-style complex queries with joins and aggregations
  • Cost-Sensitive Workloads: Where 40%+ compute cost reduction justifies integration effort
  • Performance-Critical Pipelines: SLA-driven workloads requiring faster execution

Implementation Considerations

  • Query Compatibility: Test edge cases in your specific workload patterns
  • Memory Tuning: Optimize off-heap allocation based on data characteristics
  • Monitoring: Leverage native metrics for performance debugging and optimization

The benchmark results demonstrate that Gluten + Velox represents a significant leap forward in Spark SQL performance, delivering production-ready native acceleration without sacrificing Spark's distributed computing advantages.

Why a few queries regress?

caution

While Spark + Gluten + Velox was ~1.7× faster overall, a small set of TPC-DS queries ran slower. Gluten intentionally falls back to Spark’s JVM engine when an operator or expression isn’t fully supported natively. Those fallbacks introduce row↔columnar conversion boundaries and can change shuffle or partition behavior—explaining isolated regressions (q22, q67, q72 in our run).

To diagnose these cases:

  • Inspect the Spark physical plan for GlutenRowToArrowColumnar or VeloxColumnarToRowExec nodes surrounding a non-native operator.
  • Confirm native coverage by checking for WholeStageTransformer stages in the Gluten job.
  • Compare shuffle partition counts; Gluten fallbacks can alter skew handling versus native Spark.

Version differences did not skew the benchmark: Spark 3.5.3 (native) and Spark 3.5.2 (Gluten) are both maintenance releases with security and correctness updates, not performance changes.

Architecture Overview — Apache Spark vs. Apache Spark with Gluten + Velox

Understanding how Gluten intercepts Spark plans clarifies why certain workloads accelerate so sharply. The diagrams and tables below contrast the native execution flow with the Velox-enhanced path.

Execution Path Comparison

Memory & Processing Comparison

AspectNative SparkGluten + VeloxImpact
Memory ModelJVM heap objectsApache Arrow off-heap columnar40% less GC overhead
ProcessingRow-by-row iterationSIMD vectorized batches8-16 rows per CPU cycle
CPU CachePoor localityCache-friendly columns85% vs 60% efficiency
Memory Bandwidth40 GB/s typical65+ GB/s sustained60% bandwidth increase

What Is Apache Gluten — Why It Matters

Apache Gluten is a middleware layer that offloads Spark SQL execution from the JVM to high-performance native execution engines. For data engineers, this means:

Core Technical Benefits

  1. Zero Application Changes: Existing Spark SQL and DataFrame code works unchanged
  2. Automatic Fallback: Unsupported operations gracefully fall back to native Spark
  3. Cross-Engine Compatibility: Uses Substrait as intermediate representation
  4. Production Ready: Handles complex enterprise workloads without code changes

Gluten Plugin Architecture

Key Configuration Parameters

# Essential Gluten Configuration
sparkConf:
# Core Plugin Activation
"spark.plugins": "org.apache.gluten.GlutenPlugin"
"spark.shuffle.manager": "org.apache.spark.shuffle.sort.ColumnarShuffleManager"

# Memory Configuration
"spark.memory.offHeap.enabled": "true"
"spark.memory.offHeap.size": "4g" # Critical for Velox performance

# Fallback Control
"spark.gluten.sql.columnar.backend.velox.enabled": "true"
"spark.gluten.sql.columnar.forceShuffledHashJoin": "true"

What Is Velox — Why Gluten Needs It (Alternatives)

Velox is Meta's C++ vectorized execution engine optimized for analytical workloads. It serves as the computational backend for Gluten, providing:

Velox Core Components

LayerComponentPurpose
OperatorsFilter, Project, Aggregate, JoinVectorized SQL operations
ExpressionsVector functions, Type systemSIMD-optimized computations
MemoryApache Arrow buffers, Custom allocatorsCache-efficient data layout
I/OParquet/ORC readers, CompressionHigh-throughput data ingestion
CPUAVX2/AVX-512, ARM NeonHardware-accelerated processing

Velox vs Alternative Backends

FeatureVeloxClickHouseApache Arrow DataFusion
LanguageC++C++Rust
SIMD SupportAVX2/AVX-512/NeonAVX2/AVX-512Limited
Memory ModelApache Arrow ColumnarNative ColumnarApache Arrow Native
Spark IntegrationNative via GlutenVia GlutenExperimental
PerformanceExcellentExcellentGood
MaturityProduction (Meta)ProductionDeveloping

Configuring Spark + Gluten + Velox

The instructions in this section walk through the baseline artifacts you need to build an image, configure Spark defaults, and deploy workloads on the Spark Operator.

Docker Image Configuration

Create a production-ready Spark image with Gluten + Velox:

You can find the sample Dockerfile here: Dockerfile-spark-gluten-velox

Spark Configuration Examples

Use the templates below to bootstrap both shared Spark defaults and a sample SparkApplication manifest.

# spark-defaults.conf - Optimized for Gluten + Velox

# Core Gluten Configuration
spark.plugins org.apache.gluten.GlutenPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager

# Memory Configuration - Critical for Performance
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 4g
spark.executor.memoryFraction 0.8
spark.executor.memory 20g
spark.executor.memoryOverhead 6g

# Velox-specific Optimizations
spark.gluten.sql.columnar.backend.velox.enabled true
spark.gluten.sql.columnar.forceShuffledHashJoin true
spark.gluten.sql.columnar.backend.velox.bloom_filter.enabled true

# Java 17 Module Access (Required)
spark.driver.extraJavaOptions --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED
spark.executor.extraJavaOptions --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/sun.misc=ALL-UNNAMED

# Adaptive Query Execution
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.skewJoin.enabled true

# S3 Optimizations
spark.hadoop.fs.s3a.fast.upload.buffer disk
spark.hadoop.fs.s3a.multipart.size 128M
spark.hadoop.fs.s3a.connection.maximum 200
Why these defaults?
  • spark.plugins activates the Apache Gluten runtime so query plans can offload to Velox.
  • Off-heap configuration reserves Arrow buffers that prevent JVM garbage collection pressure.
  • Adaptive query execution settings keep shuffle partitions balanced under both native and Gluten runs.
  • S3 connector tuning avoids bottlenecks when scanning the 1TB TPC-DS dataset from Amazon S3.

Running Benchmarks

Follow the workflow below to reproduce the benchmark from data generation through post-run analysis.

TPC-DS Benchmark Setup

The complete TPC-DS harness is available in the repository: examples/benchmark/tpcds-benchmark-spark-gluten-velox/README.md.

Step 1: Generate TPC-DS Data (1TB scale)

Follow this link to generate the test data in S3 bucket

Step 2: Submit Native & Gluten Jobs

Prerequisites

Before submitting benchmark jobs, ensure:

  1. S3 Bucket is configured: Export the S3 bucket name from your Terraform outputs
  2. Benchmark data is available: Verify TPC-DS 1TB data exists in the same S3 bucket

Export S3 bucket name from Terraform outputs:

Export S3 bucket variable
# Get S3 bucket name from Terraform outputs
export S3_BUCKET=$(terraform -chdir=path/to/your/terraform output -raw s3_bucket_id_data)

# Verify the bucket and data exist
aws s3 ls s3://$S3_BUCKET/blog/BLOG_TPCDS-TEST-3T-partitioned/

Submit benchmark jobs:

Submit native Spark benchmark
envsubst < tpcds-benchmark-native-c5d.yaml | kubectl apply -f -

Step 3: Monitor Benchmark Progress

Check SparkApplication status
kubectl get sparkapplications -n spark-team-a

Step 4: Spark History Server Analysis

Access detailed execution plans and metrics:

Open Spark History Server locally
kubectl port-forward svc/spark-history-server 18080:80 -n spark-history-server
Navigation Checklist
  • Point your browser to http://localhost:18080.
  • Locate both spark-<ID>-native and spark-<ID>-gluten applications.
  • In the Spark UI, inspect:
    1. SQL tab execution plans
    2. Presence of WholeStageTransformer stages in Gluten jobs
    3. Stage execution times across both runs
    4. Executor metrics for off-heap memory usage

Step 5: Summarize Findings

tip
  • Export runtime metrics from the Spark UI or event logs for both jobs.
  • Capture query-level comparisons (duration, stage counts, fallbacks) to document where Gluten accelerated or regressed.
  • Feed the results into cost or capacity planning discussions—speedups translate directly into smaller clusters or faster SLA achievement.

Key Metrics to Analyze

tip

As you compare native and Gluten runs, focus on the following signals:

  1. Query Plan Differences:

    • Native: WholeStageCodegen stages
    • Gluten: WholeStageTransformer stages
  2. Memory Usage Patterns:

    • Native: High on-heap usage, frequent GC
    • Gluten: Off-heap Arrow buffers, minimal GC
  3. CPU Utilization:

    • Native: 60-70% efficiency
    • Gluten: 80-90+ % efficiency with SIMD

Performance Analysis and Pitfalls

Gluten reduces friction for Spark adopters, but a few tuning habits help avoid regressions. Use the notes below as a checklist during rollout.

Common Configuration Pitfalls

caution
# ❌ WRONG - Insufficient off-heap memory
"spark.memory.offHeap.size": "512m" # Too small for real workloads

# ✅ CORRECT - Adequate off-heap allocation
"spark.memory.offHeap.size": "4g" # 20-30% of executor memory

# ❌ WRONG - Missing Java module access
# Results in: java.lang.IllegalAccessError

# ✅ CORRECT - Required for Java 17
"spark.executor.extraJavaOptions": "--add-opens=java.base/java.nio=ALL-UNNAMED"

# ❌ WRONG - Velox backend not enabled
"spark.gluten.sql.columnar.backend.ch.enabled": "true" # ClickHouse, not Velox!

# ✅ CORRECT - Velox backend configuration
"spark.gluten.sql.columnar.backend.velox.enabled": "true"

Performance Optimization Tips

tip
  1. Memory Sizing:

    • Off-heap: 20-30% of executor memory
    • Executor overhead: 15-20% reserved for Arrow buffers
    • Driver memory: 4-8 GB for complex queries
  2. CPU Optimization:

    • Use AVX2-capable instance types (Intel Xeon, AMD EPYC)
    • Avoid ARM instances for maximum SIMD benefit
    • Set spark.executor.cores = 4-8 for optimal vectorization
  3. I/O Configuration:

    • Enable S3A fast upload: spark.hadoop.fs.s3a.fast.upload.buffer=disk
    • Increase connection pool to 200 connections: spark.hadoop.fs.s3a.connection.maximum=200
    • Use larger multipart sizes of 128 MB: spark.hadoop.fs.s3a.multipart.size=128M

Debugging Gluten Issues

note
# Enable debug logging
"spark.gluten.sql.debug": "true"
"spark.sql.planChangeLog.level": "WARN"

# Check for fallback operations
kubectl logs <spark-pod> | grep -i "fallback"

# Verify Velox library loading
kubectl exec <spark-pod> -- find /opt/spark -name "*velox*"

# Monitor off-heap memory usage
kubectl top pod <spark-pod> --containers

Verifying Gluten+Velox Execution in Spark History Server

When Gluten+Velox is working correctly, you'll see distinctive execution patterns in the Spark History Server that indicate native acceleration:

Key Indicators of Gluten+Velox Execution:

  • VeloxSparkPlanExecApi.scala references in stages and tasks
  • WholeStageCodegenTransformer nodes in the DAG visualization
  • ColumnarBroadcastExchange operations instead of standard broadcast
  • GlutenWholeStageColumnarRDD in the RDD lineage
  • Methods like executeColumnar and mapPartitions at VeloxSparkPlanExecApi.scala lines

Example DAG Pattern:

AQEShuffleRead
├── ColumnarBroadcastExchange
├── ShuffledColumnarBatchRDD [Unordered]
│ └── executeColumnar at VeloxSparkPlanExecApi.scala:630
└── MapPartitionsRDD [Unordered]
└── mapPartitions at VeloxSparkPlanExecApi.scala:632

What This Means:

  • VeloxSparkPlanExecApi: Gluten's interface layer to the Velox execution engine
  • Columnar operations: Data processed in columnar format (more efficient than row-by-row)
  • WholeStageTransformer: Multiple Spark operations fused into single native Velox operations
  • Off-heap processing: Memory management handled by Velox, not JVM garbage collector

If you see traditional Spark operations like mapPartitions at <WholeStageCodegen> without Velox references, Gluten may have fallen back to JVM execution for unsupported operations.

Conclusion

Apache Gluten with the Velox backend consistently accelerates Spark SQL workloads on Amazon EKS, delivering a 1.72× overall speedup and driving ≈42% lower compute spend in our TPC-DS 1TB benchmark. The performance gains stem from offloading compute-intensive operators to a native, vectorized engine, reducing JVM overhead and improving CPU efficiency.

When planning your rollout:

  • Start by mirroring the configurations documented above, then tune off-heap memory and shuffle behavior based on workload shape.
  • Use the Spark Operator deployment flow to A/B test native and Gluten runs so you can quantify gains and detect fallbacks early.
  • Monitor Spark UI and metrics exports to build a data-backed case for production adoption or cluster right-sizing.

With the Docker image, Spark defaults, and example manifests provided in this guide, you can reproduce the benchmark end-to-end and adapt the pattern for your own cost and performance goals.


For complete implementation examples and benchmark results, see the GitHub repository.