Best Practices
Data on EKS Best Practices
Production-proven best practices for running data and ML workloads on Amazon EKS. Cluster configuration, resource management, security, monitoring, and optimization strategies.
Overview
Through working with AWS customers, we've identified production-proven best practices for running data and ML workloads on EKS. These recommendations are continuously updated based on real-world deployments and customer feedback.
These Data on EKS Best Practices expand upon the EKS Best Practices Guide for data-centric use cases (batch processing, stream processing, machine learning). We recommend reviewing the EKS Best Practices as a primer before diving into these recommendations.
Best Practice Categories
Cluster Architecture
Design patterns for dynamic and static clusters, scaling strategies, and resource management.
Performance Optimization
Tuning strategies for Spark, Karpenter autoscaling, and resource allocation patterns.
Data Storage
Storage strategies for S3, EBS, EFS, and ephemeral storage optimization.
Security & Compliance
IRSA, network policies, encryption at rest, and security hardening.
Observability
Monitoring, logging, and alerting strategies for data workloads.
Production Readiness
High availability, disaster recovery, and operational excellence.
Cluster Design Patterns
The recommendations are built from working with customers using one of two cluster designs:
-
Dynamic Clusters - Scale with high "churn" rates. Run batch processing (Spark) with pods created for short periods. These clusters create/delete resources (pods, nodes) at high rates, adding unique pressures to Kubernetes components.
-
Static Clusters - Large but stable scaling behavior. Run longer-lived jobs (streaming, training). Avoiding interruptions is critical, requiring careful change management.
Scale Considerations
Large clusters typically have >500 nodes and >5000 pods, or create/destroy hundreds of resources per minute. However, scalability constraints differ per workload due to Kubernetes scalability complexity.
Detailed best practice guides are being developed. Check back for updates on cluster configuration, resource management, security, monitoring, and optimization strategies.