Skip to main content

Apache Spark with Apache Celeborn Benchmarks

Introduction

Apache Celeborn is an open-source intermediate data service designed to optimize big data compute engines like Apache Spark and Flink. It primarily manages shuffle and spilled data to enhance performance, stability, and flexibility in big data processing. By acting as a Remote Shuffle Service (RSS), it addresses issues like low I/O efficiency in traditional shuffle mechanisms. Celeborn offers high performance through asynchronous processing and a highly available architecture, contributing to more robust big data analytics.

This document presents the performance characteristics of using Apache Celeborn with Apache Spark on a 3TB TPC-DS benchmark.

Executive Summary (TL;DR)

While Celeborn provides significant operational benefits for specific use cases, it does not serve as a universal performance accelerator. For the standardized TPC-DS 3TB benchmark, overall execution time increased by 16% compared to the native Spark shuffle service.

Key observations include:

  • Query-Dependent Performance: Results were highly query-dependent. The best-performing query (q99-v2.4) improved by 10.1%, whereas the worst-performing query (q91-v2.4) regressed by over 100%. The performance gains appear correlated with queries that have large shuffle operations, while the overhead of the remote service penalizes queries with small shuffles.
  • Operational Stability: Celeborn's primary advantage is providing a centralized and fault-tolerant shuffle service. This architecture prevents job failures caused by executor loss and can improve reliability for long-running, complex queries.
  • Infrastructure Overhead: Using Celeborn introduces higher costs, requiring dedicated master and worker nodes, high-throughput storage (e.g., EBS), and high-bandwidth networking. Careful capacity planning is essential to handle peak shuffle traffic.

In summary, Celeborn is a strategic choice for improving the stability of Spark jobs with large shuffle data, but for general workloads, the performance and cost overhead should be carefully evaluated.

Native Spark vs. Spark with Celeborn Benchmark

Benchmark Configuration

We benchmarked TPC-DS 3TB workloads on a dedicated Amazon EKS cluster to compare native Spark SQL execution with Spark enhanced by Apache Celeborn. To ensure an apples-to-apples comparison, both native Spark and Celeborn jobs ran on identical hardware, storage, and data. Each test was run 10 times and the average was taken.

Test Environment

ComponentConfiguration
EKS ClusterAmazon EKS 1.33
DatasetTPC-DS 3TB (Parquet format)
Spark Node Instancer6g.8xlarge
Spark Node Group8 nodes dedicated for benchmark workloads
Spark StorageEBS GP3, 1000 Throughput, 16000 IOPS
Celeborn Node Instancer6g.8xlarge
Celeborn Node Group8 nodes (3 for Master, 5 for Worker pods)
Celeborn Storage2 x EBS GP3 Volumes per Worker, 1000 Throughput, 16000 IOPS each

Application & Engine Configurations

Spark Executor and Driver

ComponentConfiguration
Executor Configuration32 executors × 14 cores × 100GB RAM each
Driver Configuration5 cores × 20GB RAM

Spark Settings for Celeborn

The following properties were set to enable the Celeborn shuffle manager.

spark.shuffle.manager: org.apache.spark.shuffle.celeborn.SparkShuffleManager
spark.shuffle.service.enabled: false
spark.celeborn.master.endpoints: celeborn-master-0.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-1.celeborn-master-svc.celeborn.svc.cluster.local,celeborn-master-2.celeborn-master-svc.celeborn.svc.cluster.local

Celeborn Worker Configuration

Celeborn workers were configured to use two mounted EBS volumes for shuffle storage.

# EBS volumes are mounted at /mnt/disk1 and /mnt/disk2
celeborn.worker.storage.dirs: /mnt/disk1:disktype=SSD:capacity=100Gi,/mnt/disk2:disktype=SSD:capacity=100Gi

Performance Results and Analysis

Overall Benchmark Performance

The total execution time for the 3TB TPC-DS benchmark increased from 1792.8 seconds (native Spark shuffle) to 2079.6 seconds (Celeborn), representing an 16% performance regression overall.

This result demonstrates that for a broad, mixed workload like TPC-DS, the overhead of sending shuffle data over the network can outweigh the benefits for many queries.

Per-Query Performance Analysis

While the overall time increased, performance at the individual query level was highly variable.

The tables below highlight the queries that saw the biggest improvements and the worst regressions.

Queries with Performance Gains

Celeborn improved performance for 20 out of the 99 queries. The most significant gains were seen in queries known to have substantial shuffle phases.

RankTPC-DS QueryPerformance Change (%)
1q99-v2.410.1
2q21-v2.49.2
3q22-v2.46.8
4q15-v2.46.5
5q45-v2.45.8
6q62-v2.44.8
7q39a-v2.44.8
8q79-v2.44.7
9q66-v2.41.5
10q26-v2.41.2

Queries with Performance Regressions

Conversely, a large number of queries performed significantly worse, with some showing over 100% degradation. These are typically queries with smaller shuffle data volumes where the cost of involving a remote service is higher than the benefit.

RankTPC-DS QueryPerformance Change (%)
1q91-v2.4-135.8
2q92-v2.4-121.3
3q39b-v2.4-100.9
4q65-v2.4-78.9
5q68-v2.4-73.3
6q50-v2.4-68.0
7q8-v2.4-66.9
8q84-v2.4-63.2
9q32-v2.4-61.2
10q31-v2.4-57.9

Resource Utilization Analysis

Using a remote shuffle service fundamentally changes how a Spark application utilizes resources. We observed a clear shift from local disk I/O on executor pods to network I/O between executors and Celeborn workers.

Network I/O Analysis

With the native shuffler, network traffic is typically limited to reading data from the source and inter-node communication for tasks. With Celeborn, all shuffle data is transmitted over the network, leading to a significant increase in network utilization on Spark pods and high ingress on Celeborn worker pods.

Spark Pods with Default Shuffler

The graph below shows minimal network traffic on Spark pods, corresponding to reading the TPC-DS data.

Spark Pods with Celeborn

Here, network egress is significantly higher as executors are now sending shuffle data to the remote Celeborn workers.

Celeborn Worker Pods

The Celeborn worker pods show high network ingress, corresponding to the shuffle data being received from all the Spark executors.

Storage I/O Analysis

The inverse effect was observed for storage. The native shuffler writes intermediate data to local disks on each executor's node, generating significant disk I/O. Celeborn centralizes these writes on the remote workers' dedicated high-performance volumes.

Spark Pods with Default Shuffler

High disk I/O is visible on Spark pods as they perform local shuffle read/write operations.

Spark Pods with Celeborn

With shuffle operations offloaded, the local disk I/O on Spark pods becomes negligible.

Celeborn Worker Pods

The storage I/O load is now concentrated on the Celeborn workers, which are writing the aggregated shuffle data to their attached EBS volumes.

Celeborn Performance Configuration Comparison

This section details a performance comparison after adjusting specific Celeborn configuration values, contrasting them with default settings typically found in Helm charts. The goal was to investigate potential performance improvements by optimizing memory and buffer sizes.

Configuration Adjustments

Based on the Celeborn documentation (https://celeborn.apache.org/docs/latest/configuration/), the following parameters were modified to explore their impact on performance:

Configuration ParameterValue Used
CELEBORN_WORKER_MEMORY12g
CELEBORN_WORKER_OFFHEAP_MEMORY100g
celeborn.worker.flusher.buffer.size10m

Performance Impact Overview

The adjustments resulted in a varied performance impact across queries. While some queries showed improvements, a majority remained largely unchanged, and several experienced significant regressions.

Detailed Query Performance Comparison

The following table compares the performance between the default settings and the adjusted memory and buffer configurations. A positive percentage indicates performance improvement, while a negative percentage indicates regression.

While a small number of queries saw performance improvements, a larger portion experienced regressions, indicating that these configuration adjustments did not yield a net positive performance gain across the TPC-DS workload.

Queries with Performance Gains

The following queries showed the most significant performance improvements after the configuration adjustments:

RankTPC-DS QueryPerformance Change (%)
1q31-v2.410.9
2q34-v2.49.4
3q12-v2.45.8
4q23a-v2.45.1
5q98-v2.44.9
6q77-v2.44.7
7q41-v2.42.4
8q35-v2.42.3
9q69-v2.42.1
10q86-v2.41.9

Queries with Performance Regressions

Conversely, a substantial number of queries experienced performance degradation with these configuration changes:

RankTPC-DS QueryPerformance Change (%)
1q13-v2.4-59.5
2q3-v2.4-56.7
3q9-v2.4-36.2
4q91-v2.4-22.7
5q11-v2.4-22.3
6q55-v2.4-20.0
7q42-v2.4-19.0
8q7-v2.4-17.7
9q28-v2.4-14.1
10q2-v2.4-11.1

Conclusion on Configuration Adjustments

Overall, the specific adjustments to CELEBORN_WORKER_MEMORY, CELEBORN_WORKER_OFFHEAP_MEMORY, and celeborn.worker.flusher.buffer.size did not lead to a net positive performance improvement for the TPC-DS 3TB benchmark. While a few queries showed minor gains, the majority experienced performance degradation, with some regressions being quite significant. This suggests that these particular optimizations, while potentially beneficial in other contexts, were not effective for this specific mixed workload and could even be detrimental. Further fine-tuning or a different approach to Celeborn configuration might be necessary to achieve overall positive results.

Overall Conclusion

This benchmark report aimed to evaluate the performance characteristics of Apache Celeborn with Apache Spark on a 3TB TPC-DS workload. The initial comparison against native Spark shuffle revealed that while Celeborn offers significant operational stability benefits (e.g., fault tolerance for large shuffle operations), it introduced an overall 16% performance regression for this mixed workload. Performance was highly query-dependent, with some queries improving modestly and others regressing severely.

Furthermore, an investigation into specific Celeborn configuration adjustments (CELEBORN_WORKER_MEMORY, CELEBORN_WORKER_OFFHEAP_MEMORY, celeborn.worker.flusher.buffer.size) demonstrated that these optimizations did not yield a net positive performance gain. A majority of queries showed either negligible change or significant performance degradation, reinforcing the idea that generic tuning might not be universally effective.

In summary, Apache Celeborn is a strategic choice when operational stability, particularly for large and complex shuffle data, is paramount. However, for general analytical workloads, its deployment requires a careful evaluation of the performance overhead and associated infrastructure costs. Optimal performance tuning with Celeborn is highly dependent on workload characteristics and necessitates detailed, workload-specific analysis rather than relying on generalized configuration adjustments.