Skip to main content

Data Analytics

📄️ Preventing OOM Kills in Spark on EKS

Every organization running large scale Spark workloads on Kubernetes has dealt with this: a job runs for hours, processes terabytes of data, completes 80% of its work, and then executors start disappearing. No JVM exception. No heap dump. No warning in Spark UI. Just exit code 137 and hours of compute burned. The standard response is to throw more memory at it, bump memoryOverhead by another 10 GB, and hope for the best. That works until the next data spike.