Apache Spark with Apache Celeborn Benchmarks
Introduction
Apache Celeborn is an open-source intermediate data service designed to optimize big data compute engines like Apache Spark and Flink. It primarily manages shuffle and spilled data to enhance performance, stability, and flexibility in big data processing. By acting as a Remote Shuffle Service (RSS), it addresses issues like low I/O efficiency in traditional shuffle mechanisms. Celeborn offers high performance through asynchronous processing and a highly available architecture, contributing to more robust big data analytics.
This document presents the performance characteristics of using Apache Celeborn with Apache Spark on a 3TB TPC-DS benchmark.
Executive Summary (TL;DR)
While Celeborn provides significant operational benefits for specific use cases, it does not serve as a universal performance accelerator. For the standardized TPC-DS 3TB benchmark, overall execution time increased by 16% compared to the native Spark shuffle service.
Key observations include:
- Query-Dependent Performance: Results were highly query-dependent. The best-performing query (
q99-v2.4) improved by 10.1%, whereas the worst-performing query (q91-v2.4) regressed by over 100%. The performance gains appear correlated with queries that have large shuffle operations, while the overhead of the remote service penalizes queries with small shuffles. - Operational Stability: Celeborn's primary advantage is providing a centralized and fault-tolerant shuffle service. This architecture prevents job failures caused by executor loss and can improve reliability for long-running, complex queries.
- Infrastructure Overhead: Using Celeborn introduces higher costs, requiring dedicated master and worker nodes, high-throughput storage (e.g., EBS), and high-bandwidth networking. Careful capacity planning is essential to handle peak shuffle traffic.
In summary, Celeborn is a strategic choice for improving the stability of Spark jobs with large shuffle data, but for general workloads, the performance and cost overhead should be carefully evaluated.
Native Spark vs. Spark with Celeborn Benchmark
Benchmark Configuration
We benchmarked TPC-DS 3TB workloads on a dedicated Amazon EKS cluster to compare native Spark SQL execution with Spark enhanced by Apache Celeborn. To ensure an apples-to-apples comparison, both native Spark and Celeborn jobs ran on identical hardware, storage, and data. Each test was run 10 times and the average was taken.
Test Environment
| Component | Configuration |
|---|---|
| EKS Cluster | Amazon EKS 1.33 |
| Dataset | TPC-DS 3TB (Parquet format) |
| Spark Node Instance | r6g.8xlarge |
| Spark Node Group | 8 nodes dedicated for benchmark workloads |
| Spark Storage | EBS GP3, 1000 Throughput, 16000 IOPS |
| Celeborn Node Instance | r6g.8xlarge |
| Celeborn Node Group | 8 nodes (3 for Master, 5 for Worker pods) |
| Celeborn Storage | 2 x EBS GP3 Volumes per Worker, 1000 Throughput, 16000 IOPS each |
Application & Engine Configurations
Spark Executor and Driver
| Component | Configuration |
|---|---|
| Executor Configuration | 32 executors × 14 cores × 100GB RAM each |
| Driver Configuration | 5 cores × 20GB RAM |
Spark Settings for Celeborn
The following properties were set to enable the Celeborn shuffle manager.
spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.shuffle.service.enabled: false
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local
Celeborn Worker Configuration
Celeborn workers were configured to use two mounted EBS volumes for shuffle storage.
# EBS volumes are mounted at /mnt/disk1 and /mnt/disk2
celeborn.worker.storage.dirs: /mnt/disk1:disktype=SSD:capacity=100Gi,/mnt/disk2:disktype=SSD:capacity=100Gi
Performance Results and Analysis
Overall Benchmark Performance
The total execution time for the 3TB TPC-DS benchmark increased from 1792.8 seconds (native Spark shuffle) to 2079.6 seconds (Celeborn), representing an 16% performance regression overall.
This result demonstrates that for a broad, mixed workload like TPC-DS, the overhead of sending shuffle data over the network can outweigh the benefits for many queries.
Per-Query Performance Analysis
While the overall time increased, performance at the individual query level was highly variable.
The tables below highlight the queries that saw the biggest improvements and the worst regressions.
Queries with Performance Gains
Celeborn improved performance for 20 out of the 99 queries. The most significant gains were seen in queries known to have substantial shuffle phases.
| Rank | TPC-DS Query | Performance Change (%) |
|---|---|---|
| 1 | q99-v2.4 | 10.1 |
| 2 | q21-v2.4 | 9.2 |
| 3 | q22-v2.4 | 6.8 |
| 4 | q15-v2.4 | 6.5 |
| 5 | q45-v2.4 | 5.8 |
| 6 | q62-v2.4 | 4.8 |
| 7 | q39a-v2.4 | 4.8 |
| 8 | q79-v2.4 | 4.7 |
| 9 | q66-v2.4 | 1.5 |
| 10 | q26-v2.4 | 1.2 |
Queries with Performance Regressions
Conversely, a large number of queries performed significantly worse, with some showing over 100% degradation. These are typically queries with smaller shuffle data volumes where the cost of involving a remote service is higher than the benefit.
| Rank | TPC-DS Query | Performance Change (%) |
|---|---|---|
| 1 | q91-v2.4 | -135.8 |
| 2 | q92-v2.4 | -121.3 |
| 3 | q39b-v2.4 | -100.9 |
| 4 | q65-v2.4 | -78.9 |
| 5 | q68-v2.4 | -73.3 |
| 6 | q50-v2.4 | -68.0 |
| 7 | q8-v2.4 | -66.9 |
| 8 | q84-v2.4 | -63.2 |
| 9 | q32-v2.4 | -61.2 |
| 10 | q31-v2.4 | -57.9 |
Resource Utilization Analysis
Using a remote shuffle service fundamentally changes how a Spark application utilizes resources. We observed a clear shift from local disk I/O on executor pods to network I/O between executors and Celeborn workers.
Network I/O Analysis
With the native shuffler, network traffic is typically limited to reading data from the source and inter-node communication for tasks. With Celeborn, all shuffle data is transmitted over the network, leading to a significant increase in network utilization on Spark pods and high ingress on Celeborn worker pods.
Spark Pods with Default Shuffler
The graph below shows minimal network traffic on Spark pods, corresponding to reading the TPC-DS data.

Spark Pods with Celeborn
Here, network egress is significantly higher as executors are now sending shuffle data to the remote Celeborn workers.

Celeborn Worker Pods
The Celeborn worker pods show high network ingress, corresponding to the shuffle data being received from all the Spark executors.

Storage I/O Analysis
The inverse effect was observed for storage. The native shuffler writes intermediate data to local disks on each executor's node, generating significant disk I/O. Celeborn centralizes these writes on the remote workers' dedicated high-performance volumes.
Spark Pods with Default Shuffler
High disk I/O is visible on Spark pods as they perform local shuffle read/write operations.

Spark Pods with Celeborn
With shuffle operations offloaded, the local disk I/O on Spark pods becomes negligible.

Celeborn Worker Pods
The storage I/O load is now concentrated on the Celeborn workers, which are writing the aggregated shuffle data to their attached EBS volumes.

Celeborn Performance Configuration Comparison
This section details a performance comparison after adjusting specific Celeborn configuration values, contrasting them with default settings typically found in Helm charts. The goal was to investigate potential performance improvements by optimizing memory and buffer sizes.
Configuration Adjustments
Based on the Celeborn documentation (https://celeborn.apache.org/docs/latest/configuration/), the following parameters were modified to explore their impact on performance:
| Configuration Parameter | Value Used |
|---|---|
CELEBORN_WORKER_MEMORY | 12g |
CELEBORN_WORKER_OFFHEAP_MEMORY | 100g |
celeborn.worker.flusher.buffer.size | 10m |
Performance Impact Overview
The adjustments resulted in a varied performance impact across queries. While some queries showed improvements, a majority remained largely unchanged, and several experienced significant regressions.
Detailed Query Performance Comparison
The following table compares the performance between the default settings and the adjusted memory and buffer configurations. A positive percentage indicates performance improvement, while a negative percentage indicates regression.
While a small number of queries saw performance improvements, a larger portion experienced regressions, indicating that these configuration adjustments did not yield a net positive performance gain across the TPC-DS workload.
Queries with Performance Gains
The following queries showed the most significant performance improvements after the configuration adjustments:
| Rank | TPC-DS Query | Performance Change (%) |
|---|---|---|
| 1 | q31-v2.4 | 10.9 |
| 2 | q34-v2.4 | 9.4 |
| 3 | q12-v2.4 | 5.8 |
| 4 | q23a-v2.4 | 5.1 |
| 5 | q98-v2.4 | 4.9 |
| 6 | q77-v2.4 | 4.7 |
| 7 | q41-v2.4 | 2.4 |
| 8 | q35-v2.4 | 2.3 |
| 9 | q69-v2.4 | 2.1 |
| 10 | q86-v2.4 | 1.9 |
Queries with Performance Regressions
Conversely, a substantial number of queries experienced performance degradation with these configuration changes:
| Rank | TPC-DS Query | Performance Change (%) |
|---|---|---|
| 1 | q13-v2.4 | -59.5 |
| 2 | q3-v2.4 | -56.7 |
| 3 | q9-v2.4 | -36.2 |
| 4 | q91-v2.4 | -22.7 |
| 5 | q11-v2.4 | -22.3 |
| 6 | q55-v2.4 | -20.0 |
| 7 | q42-v2.4 | -19.0 |
| 8 | q7-v2.4 | -17.7 |
| 9 | q28-v2.4 | -14.1 |
| 10 | q2-v2.4 | -11.1 |
Conclusion on Configuration Adjustments
Overall, the specific adjustments to CELEBORN_WORKER_MEMORY, CELEBORN_WORKER_OFFHEAP_MEMORY, and celeborn.worker.flusher.buffer.size did not lead to a net positive performance improvement for the TPC-DS 3TB benchmark. While a few queries showed minor gains, the majority experienced performance degradation, with some regressions being quite significant. This suggests that these particular optimizations, while potentially beneficial in other contexts, were not effective for this specific mixed workload and could even be detrimental. Further fine-tuning or a different approach to Celeborn configuration might be necessary to achieve overall positive results.
Overall Conclusion
This benchmark report aimed to evaluate the performance characteristics of Apache Celeborn with Apache Spark on a 3TB TPC-DS workload. The initial comparison against native Spark shuffle revealed that while Celeborn offers significant operational stability benefits (e.g., fault tolerance for large shuffle operations), it introduced an overall 16% performance regression for this mixed workload. Performance was highly query-dependent, with some queries improving modestly and others regressing severely.
Furthermore, an investigation into specific Celeborn configuration adjustments (CELEBORN_WORKER_MEMORY, CELEBORN_WORKER_OFFHEAP_MEMORY, celeborn.worker.flusher.buffer.size) demonstrated that these optimizations did not yield a net positive performance gain. A majority of queries showed either negligible change or significant performance degradation, reinforcing the idea that generic tuning might not be universally effective.
In summary, Apache Celeborn is a strategic choice when operational stability, particularly for large and complex shuffle data, is paramount. However, for general analytical workloads, its deployment requires a careful evaluation of the performance overhead and associated infrastructure costs. Optimal performance tuning with Celeborn is highly dependent on workload characteristics and necessitates detailed, workload-specific analysis rather than relying on generalized configuration adjustments.