Skip to main content

Data Analytics

📄️ Preventing OOM Kills in Spark on EKS

Every organization running large scale Spark workloads on Kubernetes has dealt with this: a job runs for hours, processes terabytes of data, completes 80% of its work, and then executors start disappearing. No JVM exception. No heap dump. No warning in Spark UI. Just exit code 137 and hours of compute burned. The standard response is to throw more memory at it, bump memoryOverhead by another 10 GB, and hope for the best. That works until the next data spike.

📄️ In-Place EBS Volume Modification for Stateful Workloads

Almost every stateful data workload on EKS sits on EBS. Celeborn shuffle workers, Valkey cluster nodes, Kafka brokers, Trino spill, Pinot servers, ClickHouse, Starrocks. They all run as StatefulSets where each pod owns a PersistentVolumeClaim backed by an EBS volume. The capacity and IOPS that looked right when the cluster was first provisioned rarely stay right for long. This page is about modifying those volumes in place, while the pods keep serving traffic, with zero restarts and zero cluster impact.