Skip to main content

Best Practices

Data on EKS Best Practices

Production-proven best practices for running data and ML workloads on Amazon EKS. Cluster configuration, resource management, security, monitoring, and optimization strategies.

Production TestedSecurity HardenedPerformance Optimized

Overview

Through working with AWS customers, we've identified production-proven best practices for running data and ML workloads on EKS. These recommendations are continuously updated based on real-world deployments and customer feedback.

These Data on EKS Best Practices expand upon the EKS Best Practices Guide for data-centric use cases (batch processing, stream processing, machine learning). We recommend reviewing the EKS Best Practices as a primer before diving into these recommendations.

Best Practice Categories

Cluster Architecture

Design patterns for dynamic and static clusters, scaling strategies, and resource management.

Dynamic ClustersStatic ClustersHigh Churn

Performance Optimization

Tuning strategies for Spark, Karpenter autoscaling, and resource allocation patterns.

Spark TuningAutoscalingCost Optimization

Data Storage

Storage strategies for S3, EBS, EFS, and ephemeral storage optimization.

S3 IntegrationEBS VolumesShuffle Data

Security & Compliance

IRSA, network policies, encryption at rest, and security hardening.

IRSANetwork PoliciesEncryption

Observability

Monitoring, logging, and alerting strategies for data workloads.

PrometheusCloudWatchDashboards

Production Readiness

High availability, disaster recovery, and operational excellence.

HA SetupBackup/RestoreGitOps

Cluster Design Patterns

The recommendations are built from working with customers using one of two cluster designs:

  • Dynamic Clusters - Scale with high "churn" rates. Run batch processing (Spark) with pods created for short periods. These clusters create/delete resources (pods, nodes) at high rates, adding unique pressures to Kubernetes components.

  • Static Clusters - Large but stable scaling behavior. Run longer-lived jobs (streaming, training). Avoiding interruptions is critical, requiring careful change management.

Scale Considerations

Large clusters typically have >500 nodes and >5000 pods, or create/destroy hundreds of resources per minute. However, scalability constraints differ per workload due to Kubernetes scalability complexity.

Coming Soon

Detailed best practice guides are being developed. Check back for updates on cluster configuration, resource management, security, monitoring, and optimization strategies.