Apache Spark with Apache Celeborn Benchmarks

Introduction

Apache Celeborn is an open-source intermediate data service designed to optimize big data compute engines like Apache Spark and Flink. It primarily manages shuffle and spilled data to enhance performance, stability, and flexibility in big data processing. By acting as a Remote Shuffle Service (RSS), it addresses issues like low I/O efficiency in traditional shuffle mechanisms. Celeborn offers high performance through asynchronous processing and a highly available architecture, contributing to more robust big data analytics.

This document presents the performance characteristics of using Apache Celeborn with Apache Spark on a 3TB TPC-DS benchmark.

Executive Summary (TL;DR)

While Celeborn provides significant operational benefits for specific use cases, it does not serve as a universal performance accelerator. For the standardized TPC-DS 3TB benchmark, overall execution time increased by 16% compared to the native Spark shuffle service.

Key observations include:

Query-Dependent Performance: Results were highly query-dependent. The best-performing query (q99-v2.4) improved by 10.1%, whereas the worst-performing query (q91-v2.4) regressed by over 100%. The performance gains appear correlated with queries that have large shuffle operations, while the overhead of the remote service penalizes queries with small shuffles.
Operational Stability: Celeborn's primary advantage is providing a centralized and fault-tolerant shuffle service. This architecture prevents job failures caused by executor loss and can improve reliability for long-running, complex queries.
Infrastructure Overhead: Using Celeborn introduces higher costs, requiring dedicated master and worker nodes, high-throughput storage (e.g., EBS), and high-bandwidth networking. Careful capacity planning is essential to handle peak shuffle traffic.

In summary, Celeborn is a strategic choice for improving the stability of Spark jobs with large shuffle data, but for general workloads, the performance and cost overhead should be carefully evaluated.

Native Spark vs. Spark with Celeborn Benchmark

Benchmark Configuration

We benchmarked TPC-DS 3TB workloads on a dedicated Amazon EKS cluster to compare native Spark SQL execution with Spark enhanced by Apache Celeborn. To ensure an apples-to-apples comparison, both native Spark and Celeborn jobs ran on identical hardware, storage, and data. Each test was run 10 times and the average was taken.

Test Environment

Component	Configuration
EKS Cluster	Amazon EKS 1.33
Dataset	TPC-DS 3TB (Parquet format)
Spark Node Instance	r6g.8xlarge
Spark Node Group	8 nodes dedicated for benchmark workloads
Spark Storage	EBS GP3, 1000 Throughput, 16000 IOPS
Celeborn Node Instance	r6g.8xlarge
Celeborn Node Group	8 nodes (3 for Master, 5 for Worker pods)
Celeborn Storage	2 x EBS GP3 Volumes per Worker, 1000 Throughput, 16000 IOPS each

Application & Engine Configurations

Spark Executor and Driver

Component	Configuration
Executor Configuration	32 executors × 14 cores × 100GB RAM each
Driver Configuration	5 cores × 20GB RAM

Spark Settings for Celeborn

The following properties were set to enable the Celeborn shuffle manager.

spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.shuffle.service.enabled: false
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local

Celeborn Worker Configuration

Celeborn workers were configured to use two mounted EBS volumes for shuffle storage.

# EBS volumes are mounted at /mnt/disk1 and /mnt/disk2
celeborn.worker.storage.dirs: /mnt/disk1:disktype=SSD:capacity=100Gi,/mnt/disk2:disktype=SSD:capacity=100Gi

Performance Results and Analysis

Overall Benchmark Performance

The total execution time for the 3TB TPC-DS benchmark increased from 1792.8 seconds (native Spark shuffle) to 2079.6 seconds (Celeborn), representing an 16% performance regression overall.

This result demonstrates that for a broad, mixed workload like TPC-DS, the overhead of sending shuffle data over the network can outweigh the benefits for many queries.

Per-Query Performance Analysis

While the overall time increased, performance at the individual query level was highly variable.

The tables below highlight the queries that saw the biggest improvements and the worst regressions.

Queries with Performance Gains

Celeborn improved performance for 20 out of the 99 queries. The most significant gains were seen in queries known to have substantial shuffle phases.

Rank	TPC-DS Query	Performance Change (%)
1	q99-v2.4	10.1
2	q21-v2.4	9.2
3	q22-v2.4	6.8
4	q15-v2.4	6.5
5	q45-v2.4	5.8
6	q62-v2.4	4.8
7	q39a-v2.4	4.8
8	q79-v2.4	4.7
9	q66-v2.4	1.5
10	q26-v2.4	1.2

Queries with Performance Regressions

Conversely, a large number of queries performed significantly worse, with some showing over 100% degradation. These are typically queries with smaller shuffle data volumes where the cost of involving a remote service is higher than the benefit.

Rank	TPC-DS Query	Performance Change (%)
1	q91-v2.4	-135.8
2	q92-v2.4	-121.3
3	q39b-v2.4	-100.9
4	q65-v2.4	-78.9
5	q68-v2.4	-73.3
6	q50-v2.4	-68.0
7	q8-v2.4	-66.9
8	q84-v2.4	-63.2
9	q32-v2.4	-61.2
10	q31-v2.4	-57.9

Resource Utilization Analysis

Using a remote shuffle service fundamentally changes how a Spark application utilizes resources. We observed a clear shift from local disk I/O on executor pods to network I/O between executors and Celeborn workers.

Network I/O Analysis

With the native shuffler, network traffic is typically limited to reading data from the source and inter-node communication for tasks. With Celeborn, all shuffle data is transmitted over the network, leading to a significant increase in network utilization on Spark pods and high ingress on Celeborn worker pods.

Spark Pods with Default Shuffler

The graph below shows minimal network traffic on Spark pods, corresponding to reading the TPC-DS data.

Spark Pods with Celeborn

Here, network egress is significantly higher as executors are now sending shuffle data to the remote Celeborn workers.

Celeborn Worker Pods

The Celeborn worker pods show high network ingress, corresponding to the shuffle data being received from all the Spark executors.

Storage I/O Analysis

The inverse effect was observed for storage. The native shuffler writes intermediate data to local disks on each executor's node, generating significant disk I/O. Celeborn centralizes these writes on the remote workers' dedicated high-performance volumes.

Spark Pods with Default Shuffler

High disk I/O is visible on Spark pods as they perform local shuffle read/write operations.

Spark Pods with Celeborn

With shuffle operations offloaded, the local disk I/O on Spark pods becomes negligible.

Celeborn Worker Pods

The storage I/O load is now concentrated on the Celeborn workers, which are writing the aggregated shuffle data to their attached EBS volumes.

Celeborn Performance Configuration Comparison

This section details a performance comparison after adjusting specific Celeborn configuration values, contrasting them with default settings typically found in Helm charts. The goal was to investigate potential performance improvements by optimizing memory and buffer sizes.

Configuration Adjustments

Based on the Celeborn documentation (https://celeborn.apache.org/docs/latest/configuration/), the following parameters were modified to explore their impact on performance:

Configuration Parameter	Value Used
`CELEBORN_WORKER_MEMORY`	12g
`CELEBORN_WORKER_OFFHEAP_MEMORY`	100g
`celeborn.worker.flusher.buffer.size`	10m

Performance Impact Overview

The adjustments resulted in a varied performance impact across queries. While some queries showed improvements, a majority remained largely unchanged, and several experienced significant regressions.

Detailed Query Performance Comparison

The following table compares the performance between the default settings and the adjusted memory and buffer configurations. A positive percentage indicates performance improvement, while a negative percentage indicates regression.

While a small number of queries saw performance improvements, a larger portion experienced regressions, indicating that these configuration adjustments did not yield a net positive performance gain across the TPC-DS workload.

Queries with Performance Gains

The following queries showed the most significant performance improvements after the configuration adjustments:

Rank	TPC-DS Query	Performance Change (%)
1	q31-v2.4	10.9
2	q34-v2.4	9.4
3	q12-v2.4	5.8
4	q23a-v2.4	5.1
5	q98-v2.4	4.9
6	q77-v2.4	4.7
7	q41-v2.4	2.4
8	q35-v2.4	2.3
9	q69-v2.4	2.1
10	q86-v2.4	1.9

Queries with Performance Regressions

Conversely, a substantial number of queries experienced performance degradation with these configuration changes:

Rank	TPC-DS Query	Performance Change (%)
1	q13-v2.4	-59.5
2	q3-v2.4	-56.7
3	q9-v2.4	-36.2
4	q91-v2.4	-22.7
5	q11-v2.4	-22.3
6	q55-v2.4	-20.0
7	q42-v2.4	-19.0
8	q7-v2.4	-17.7
9	q28-v2.4	-14.1
10	q2-v2.4	-11.1

Storage Configuration Impact

Our analysis of storage configurations revealed meaningful performance differences between EBS volumes and ephemeral local storage. When comparing EBS volumes against ephemeral local storage for Celeborn workers, we observed approximately 5.7% overall performance improvement with ephemeral storage.

However, ephemeral storage comes with significant operational trade-offs:

Data Loss on Node Replacement: All shuffle data is permanently lost when nodes are terminated or replaced, requiring job restarts and potential data reprocessing. While Celeborn can replicate shuffle data to other workers for fault tolerance, this replication is also lost when using ephemeral storage on the replica nodes
No Portability: Unlike EBS volumes, ephemeral storage cannot be detached and reattached to other instances, limiting flexibility during maintenance or scaling operations
Capacity Constraints: Ephemeral storage capacity is fixed per instance type and cannot be independently scaled based on workload requirements
No Backup/Recovery: Ephemeral data cannot be backed up or recovered, increasing risk for long-running or critical workloads

For production environments, the performance gains must be weighed against these reliability and operational considerations.

We also evaluated different disk configurations to understand the impact of storage layout on performance. Specifically, we compared 4 ephemeral disks configured in a RAID 0 stripe to maximize I/O and throughput versus 4 ephemeral disks presented as individual storage devices. The results showed no discernible overall performance difference, with only a 0.0008% variance between configurations.

Spark Parameter Optimization

Testing tuning parameters recommended in the Celeborn documentation yielded mixed but important results. The most significant finding involved the spark.sql.adaptive.localShuffleReader.enabled parameter. Setting this parameter to false had a dramatic impact on query performance stability, reducing the worst-performing query's regression from ~240% to ~40%.

Thread configuration adjustments did not produce major performance impacts overall. However, these changes did allow Celeborn to utilize more compute resources effectively, enabling usage of up to 15 cores out of the available 64 cores on worker nodes. While this increased resource utilization didn't translate to significant performance gains, it demonstrates Celeborn's ability to scale resource consumption when configured appropriately.

Celeborn Default CPU usage

Celeborn with 4x thread configured

Overall Performance Assessment

Despite extensive tuning efforts across storage, Spark parameters, and Celeborn-specific configurations, all Celeborn configurations resulted in longer overall completion times compared to the default Spark shuffler. This consistent finding reinforces that while Celeborn provides operational benefits, achieving performance parity or improvement with native Spark shuffle requires careful workload-specific optimization and may not be achievable for all use cases.

Overall Conclusion

This benchmark report aimed to evaluate the performance characteristics of Apache Celeborn with Apache Spark on a 3TB TPC-DS workload. The initial comparison against native Spark shuffle revealed that while Celeborn offers significant operational stability benefits (e.g., fault tolerance for large shuffle operations), it introduced an overall 16% performance regression for this mixed workload. Performance was highly query-dependent, with some queries improving modestly and others regressing severely.

Furthermore, an investigation into specific Celeborn configuration adjustments (CELEBORN_WORKER_MEMORY, CELEBORN_WORKER_OFFHEAP_MEMORY, celeborn.worker.flusher.buffer.size) demonstrated that these optimizations did not yield a net positive performance gain. A majority of queries showed either negligible change or significant performance degradation, reinforcing the idea that generic tuning might not be universally effective.

In summary, Apache Celeborn is a strategic choice when operational stability, particularly for large and complex shuffle data, is paramount. However, for general analytical workloads, its deployment requires a careful evaluation of the performance overhead and associated infrastructure costs. Optimal performance tuning with Celeborn is highly dependent on workload characteristics and necessitates detailed, workload-specific analysis rather than relying on generalized configuration adjustments.

Introduction​

Executive Summary (TL;DR)​

Native Spark vs. Spark with Celeborn Benchmark​

Benchmark Configuration​

Test Environment​

Application & Engine Configurations​

Spark Executor and Driver​

Spark Settings for Celeborn​

Celeborn Worker Configuration​

Performance Results and Analysis​

Overall Benchmark Performance​

Per-Query Performance Analysis​

Queries with Performance Gains​

Queries with Performance Regressions​

Resource Utilization Analysis​

Network I/O Analysis​

Spark Pods with Default Shuffler​

Spark Pods with Celeborn​

Celeborn Worker Pods​

Storage I/O Analysis​

Spark Pods with Default Shuffler​

Spark Pods with Celeborn​

Celeborn Worker Pods​

Celeborn Performance Configuration Comparison​

Configuration Adjustments​

Performance Impact Overview​

Detailed Query Performance Comparison​

Queries with Performance Gains​

Queries with Performance Regressions​

Storage Configuration Impact​

Spark Parameter Optimization​

Celeborn Default CPU usage​

Celeborn with 4x thread configured​

Overall Performance Assessment​

Overall Conclusion​

Introduction

Executive Summary (TL;DR)

Native Spark vs. Spark with Celeborn Benchmark

Benchmark Configuration

Test Environment

Application & Engine Configurations

Spark Executor and Driver

Spark Settings for Celeborn

Celeborn Worker Configuration

Performance Results and Analysis

Overall Benchmark Performance

Per-Query Performance Analysis

Queries with Performance Gains

Queries with Performance Regressions

Resource Utilization Analysis

Network I/O Analysis

Spark Pods with Default Shuffler

Spark Pods with Celeborn

Celeborn Worker Pods

Storage I/O Analysis

Spark Pods with Default Shuffler

Spark Pods with Celeborn

Celeborn Worker Pods

Celeborn Performance Configuration Comparison

Configuration Adjustments

Performance Impact Overview

Detailed Query Performance Comparison

Queries with Performance Gains

Queries with Performance Regressions

Storage Configuration Impact

Spark Parameter Optimization

Celeborn Default CPU usage

Celeborn with 4x thread configured

Overall Performance Assessment

Overall Conclusion