CloudFormation Deployment for SageMaker HyperPod
This guide covers deploying SageMaker HyperPod infrastructure using CloudFormation templates. CloudFormation templates are available for both EKS and Slurm orchestration types, providing Infrastructure as Code (IaC) solutions.
When using the recommended in-console cluster creation to deploy your HyperPod cluster, it will be using CloudFormation in the backend to deploy the resources.
These CloudFormation templates are hosted in sagemaker-hyperpod-cluster-setup
GitHub Repository.
EKS Orchestration
Architecture Overview
The EKS CloudFormation deployment uses nested stacks to create a comprehensive HyperPod environment. Each stack is responsible for different infrastructure components.
Quick Start - EKS
🚨 Recommended: Follow the official AWS documentation for EKS orchestration for detailed instructions and latest best practices.
Deploy EKS HyperPod Cluster
# Deploy complete EKS-based HyperPod infrastructure
aws cloudformation create-stack \
--stack-name hyperpod-eks-main-stack \
--template-url https://aws-sagemaker-hyperpod-cluster-setup-us-west-2-prod.s3.us-west-2.amazonaws.com/templates/main-stack-eks-based-template.yaml \
--parameters ParameterKey=ResourceNamePrefix,ParameterValue=my-hyperpod \
ParameterKey=AvailabilityZoneIds,ParameterValue="usw2-az1,usw2-az2" \
ParameterKey=HyperPodClusterName,ParameterValue=ml-cluster \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
Important Parameters:
AvailabilityZoneIds
: Must correspond to your target region and have capacity for your instance typesResourceNamePrefix
: Use consistent naming across all resourcesHyperPodClusterName
: Name for your HyperPod clusterStage
: Deployment stage (prod/dev) - affects which S3 bucket is used for templates
EKS Nested Stacks
The main CloudFormation template from sagemaker-hyperpod-cluster-setup
repository includes these nested stacks:
1. VPCStack - vpc-template.yaml
Creates highly available VPC architecture across multiple Availability Zones.
Resources Created:
- VPC with Internet Gateway
- Public and Private Subnets across multiple AZs
- NAT Gateways for outbound connectivity
- Route Tables and associations
Key Parameters:
VpcCIDR
: VPC CIDR rangeAvailabilityZoneIds
: List of AZ IDs for multi-AZ deploymentResourceNamePrefix
: Consistent naming prefix
Required if Disabled:
VpcId
- Used by multiple other stacksNatGatewayIds
- Used by PrivateSubnetStack
2. PrivateSubnetStack - private-subnet-template.yaml
Creates dedicated private subnets for SageMaker HyperPod cross-account ENIs.
Resources Created:
- Secondary CIDR Block for HyperPod
- Private Subnets in specified AZs
- Private Route Tables with NAT Gateway routing
Key Parameters:
AvailabilityZoneIds
: Target availability zonesNatGatewayIds
: NAT Gateways for outbound access
Required if Disabled:
PrivateSubnetIds
- Used by HyperPodClusterStackPrivateRouteTableIds
- Used by S3EndpointStack
3. SecurityGroupStack - security-group-template.yaml
Configures security group with EFA and FSx communication rules optimized for HyperPod.
Resources Created:
- Security Group with HyperPod-optimized rules
- Intra-Security Group communication rules
- EFA communication rules (all ports within SG)
- FSx for Lustre Rules (TCP ports 988, 1018-1023)
- Outbound internet access
Key Parameters:
VpcId
: Target VPC for security groupSecurityGroupIds
: Additional security groups to reference
Required if Disabled:
SecurityGroupId
- Used by EKSClusterStack and HyperPodClusterStack
4. EKSClusterStack - eks-template.yaml
Creates EKS cluster optimized for HyperPod workloads.
Resources Created:
- EKS Private Subnets (separate from HyperPod subnets)
- IAM Cluster Service Role
- EKS Cluster with latest Kubernetes version
- Essential EKS Add-ons:
- VPC CNI (latest version)
- kube-proxy
- CoreDNS
- EKS Pod Identity Agent
- EKS Access Entries for cluster management
Key Parameters:
KubernetesVersion
: EKS cluster versionEKSClusterName
: Name for the EKS clusterSecurityGroupIds
: Security groups for cluster
Required if Disabled:
EKSClusterName
- Used by HelmChartStack and HyperPodClusterStack
5. S3BucketStack - s3-bucket-template.yaml
Creates encrypted S3 bucket for lifecycle scripts and cluster artifacts.
Resources Created:
- S3 Bucket with server-side encryption
- Bucket policy for HyperPod access
- Versioning and lifecycle policies
Key Parameters:
ResourceNamePrefix
: Bucket naming prefixS3BucketName
: Custom bucket name (optional)
Required if Disabled:
S3BucketName
- Used by LifeCycleScriptStack and HyperPodClusterStack
6. S3EndpointStack - s3-endpoint-template.yaml
Creates VPC endpoint for private S3 connectivity from HyperPod instances.
Resources Created:
- VPC Endpoint for S3 service
- Route table associations for private subnets
- Endpoint policy for secure access
Key Parameters:
VpcId
: Target VPCPrivateRouteTableIds
: Route tables to associate
7. LifeCycleScriptStack - lifecycle-script-template.yaml
Deploys Lambda function to create and manage default lifecycle scripts.
Resources Created:
- AWS Lambda Function for script management
- IAM Role for Lambda execution
- Default EKS Lifecycle Scripts uploaded to S3
Key Parameters:
S3BucketName
: Target bucket for lifecycle scriptsGithubRawUrl
: Source URL for lifecycle scriptsOnCreatePath
: Path to on_create.sh script
8. SageMakerIAMRoleStack - sagemaker-iam-role-template.yaml
Creates IAM role with comprehensive permissions for HyperPod cluster operations.
Resources Created:
- SageMaker Execution Role
- Policies for HyperPod, EKS, and AWS service access
- Trust relationships for SageMaker service
Key Parameters:
ResourceNamePrefix
: Role naming prefixSageMakerIAMRoleName
: Custom role name (optional)
Required if Disabled:
SageMakerIAMRoleName
- Used by HyperPodClusterStack
9. HelmChartStack - helm-chart-template.yaml
Automates installation of HyperPod dependency Helm charts on EKS cluster.
Resources Created:
- AWS Lambda Function for Helm chart deployment
- EKS Access Entry for Lambda function
- Kubernetes resources from HyperPod Helm charts:
- Health monitoring agents
- Training operators
- Inference operators (optional)
- Observability components
Key Parameters:
EKSClusterName
: Target EKS clusterHelmRepoUrl
: Helm repository URLHelmOperators
: List of operators to installNamespace
: Kubernetes namespace for deployment
10. HyperPodClusterStack - hyperpod-cluster-template.yaml
Creates the SageMaker HyperPod cluster with flexible instance group configuration.
Resources Created:
- SageMaker HyperPod Cluster
- Multiple configurable instance groups (up to 20)
- Auto-scaling configuration
- Node recovery settings
Key Parameters:
HyperPodClusterName
: Cluster nameNodeRecovery
: Automatic or ManualNodeProvisioningMode
: Continuous or OnDemandAutoScalerType
: HyperPod native or Karpenter- Instance Group Settings (1-20): Each with:
- Instance type and count
- EBS volume configuration
- Lifecycle script settings
- Custom AMI options
11. FsxStack - fsx-template.yaml
Creates high-performance FSx for Lustre file system for shared storage.
Resources Created:
- FSx for Lustre file system
- Mount targets in private subnets
- Security group rules for FSx access
Key Parameters:
StorageCapacity
: File system size in GiBPerUnitStorageThroughput
: Performance tierDeploymentType
: SCRATCH_1, SCRATCH_2, or PERSISTENTDataCompressionType
: LZ4 compression option
12. PrivateSubnetTagsStack - private-subnet-tagging-template.yaml
Applies necessary tags to private subnets for HyperPod and EKS integration.
Resources Created:
- Lambda function for subnet tagging
- Tags for EKS cluster discovery
- Tags for HyperPod subnet identification
13. InferenceOperatorMainStack - inference-operator-main-stack-template.yaml
Deploys inference operator components for HyperPod inference workloads (optional).
Resources Created:
- Inference operator Helm charts
- TLS certificates for secure communication
- Service accounts and RBAC configurations
- IAM roles for inference operations
14. ObservabilityCustomStack
Deploys comprehensive observability stack with Prometheus and Grafana integration (optional).
Resources Created:
- Amazon Managed Prometheus workspace
- Amazon Managed Grafana workspace
- CloudWatch integration
- Custom dashboards for HyperPod metrics
Manual CloudFormation Deployment
The CloudFormation templates are automatically synced from AWS-managed S3 buckets to the sagemaker-hyperpod-cluster-setup
repository. For custom deployments:
Using AWS-Hosted Templates (Recommended)
The templates reference AWS-managed S3 buckets by default:
- Production:
aws-sagemaker-hyperpod-cluster-setup-{region}-prod
- Development:
aws-sagemaker-hyperpod-cluster-setup-{region}-dev
Deploy directly using the main stack template:
# Deploy EKS-based HyperPod cluster
aws cloudformation create-stack \
--stack-name hyperpod-eks-cluster \
--template-url https://aws-sagemaker-hyperpod-cluster-setup-us-west-2-prod.s3.us-west-2.amazonaws.com/templates/main-stack-eks-based-template.yaml \
--parameters ParameterKey=ResourceNamePrefix,ParameterValue=my-hyperpod \
ParameterKey=AvailabilityZoneIds,ParameterValue="usw2-az1,usw2-az2" \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
Using Custom S3 Bucket
To host templates in your own S3 bucket:
-
Clone the Repository
git clone https://github.com/aws/sagemaker-hyperpod-cluster-setup.git
cd sagemaker-hyperpod-cluster-setup -
Upload Templates to Your Bucket
BUCKET_NAME=<your-bucket-name>
aws s3 cp eks/cloudformation/ s3://$BUCKET_NAME/templates/ --recursive
aws s3 cp eks/cloudformation/resources/ s3://$BUCKET_NAME/resources/ --recursive -
Deploy with Custom Bucket
aws cloudformation create-stack \
--stack-name hyperpod-eks-cluster \
--template-body file://eks/cloudformation/main-stack-eks-based-template.yaml \
--parameters ParameterKey=CustomBucketName,ParameterValue=$BUCKET_NAME \
ParameterKey=ResourceNamePrefix,ParameterValue=my-hyperpod \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
Slurm Orchestration
Quick Start - Slurm
🚨 Recommended: Follow the official AWS documentation for Slurm orchestration for detailed instructions and latest best practices.
Deploy Slurm HyperPod Cluster
# Deploy complete Slurm-based HyperPod infrastructure
aws cloudformation create-stack \
--stack-name hyperpod-slurm-main-stack \
--template-url https://aws-sagemaker-hyperpod-cluster-setup-us-west-2-prod.s3.us-west-2.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml \
--parameters ParameterKey=ResourceNamePrefix,ParameterValue=my-hyperpod \
ParameterKey=AvailabilityZoneIds,ParameterValue="usw2-az1,usw2-az2" \
ParameterKey=HyperPodClusterName,ParameterValue=ml-cluster \
--capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
Slurm CloudFormation Templates
The Slurm deployment uses the main stack template from sagemaker-hyperpod-cluster-setup
repository:
Main Stack - main-stack-slurm-based-template.yaml
The main Slurm stack includes these nested components:
1. VPCStack - slurm-vpc-template.yaml
Creates VPC infrastructure optimized for Slurm HyperPod clusters.
Resources Created:
- VPC with public and private subnets
- Internet Gateway and NAT Gateway
- Route tables and security groups
- S3 and DynamoDB VPC endpoints
Key Parameters:
VpcCIDR
: VPC CIDR rangeAvailabilityZoneIds
: Target availability zonesResourceNamePrefix
: Consistent naming
2. SecurityGroupStack - slurm-security-group-template.yaml
Configures security groups for Slurm cluster communication.
Resources Created:
- Security group with Slurm-specific rules
- EFA communication rules
- FSx for Lustre access rules
- SSH and cluster communication ports
3. S3BucketStack
Creates S3 bucket for Slurm lifecycle scripts and cluster artifacts.
Resources Created:
- Encrypted S3 bucket
- Bucket policies for HyperPod access
- Lifecycle script storage
4. LifeCycleScriptStack - slurm-lifecycle-script-template.yaml
Manages Slurm-specific lifecycle scripts.
Resources Created:
- Lambda function for script deployment
- Slurm Lifecycle Scripts uploaded to S3
- Slurm configuration files
5. SageMakerIAMRoleStack - slurm-iam-role-template.yaml
Creates IAM role for Slurm HyperPod operations.
Resources Created:
- SageMaker execution role
- Policies for Slurm cluster management
- Trust relationships for HyperPod service
6. FsxStack - slurm-fsx-template.yaml
Creates FSx for Lustre file system for Slurm clusters.
Resources Created:
- High-performance FSx for Lustre file system
- Mount targets in private subnets
- Security group rules for FSx access
7. HyperPodClusterStack - slurm-hyperpod-cluster-template.yaml
Creates the Slurm-based HyperPod cluster with up to 20 configurable instance groups.
Resources Created:
- SageMaker HyperPod Cluster with Slurm orchestration
- Controller nodes (head nodes)
- Login nodes
- Compute nodes (worker nodes)
- Multiple instance group configurations
Key Parameters:
HyperPodClusterName
: Cluster nameNodeRecovery
: Automatic or ManualInstanceGroupSettings1-20
: Up to 20 instance groups with:- Instance type and count
- EBS volume configuration
- Lifecycle script settings
- Custom AMI options
8. SSMStack - slurm-ssm-template.yaml
Configures Systems Manager for cluster access and management.
Resources Created:
- SSM endpoints for private subnet access
- IAM roles for SSM connectivity
- Session Manager configuration
Manual Deployment Steps
Prerequisites
-
IAM Role Creation
aws iam create-role \
--role-name AmazonSagemakerExecutionRole \
--assume-role-policy-document file://0.AmazonSageMakerClustersExecutionRoleTrustedEntities.json
aws iam create-policy \
--policy-name AmazonSagemakerExecutionPolicy \
--policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json
POLICY=$(aws iam list-policies --query 'Policies[?PolicyName==`AmazonSagemakerExecutionPolicy`]' | jq '.[0].Arn' | tr -d '"')
aws iam attach-role-policy \
--role-name AmazonSagemakerExecutionRole \
--policy-arn $POLICY -
S3 Bucket for Lifecycle Scripts
BUCKET="sagemaker-lifecycle-$(python3 -S -c 'import uuid; print(str(uuid.uuid4().hex)[:10])')"
aws s3 mb s3://${BUCKET}
Deployment Sequence
- Deploy VPC Stack (if not using existing VPC)
- Deploy FSx Lustre Stack (optional, for shared storage)
- Upload Lifecycle Scripts
aws s3 cp --recursive LifecycleScripts/base-config s3://${BUCKET}/LifecycleScripts/base-config
- Deploy HyperPod Cluster
Configuration Examples
Slurm Cluster Configuration
[
{
"InstanceGroupName": "controller-machine",
"InstanceType": "ml.c5.xlarge",
"InstanceCount": 1,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET}/LifecycleScripts/base-config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${ROLE}",
"ThreadsPerCore": 1
},
{
"InstanceGroupName": "compute-nodes",
"InstanceType": "ml.trn1.32xlarge",
"InstanceCount": 4,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET}/LifecycleScripts/base-config/",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${ROLE}",
"ThreadsPerCore": 1
}
]
VPC Configuration
{
"SecurityGroupIds": ["$SECURITY_GROUP"],
"Subnets": ["$SUBNET_ID"]
}
Lifecycle Scripts
CloudFormation deployments automatically handle lifecycle scripts that:
For Slurm
- Configure Slurm scheduler and accounting
- Mount FSx Lustre file system
- Install Docker, Enroot, and Pyxis
- Set up user accounts and POSIX permissions
- Configure EFA and networking
For EKS
- Install HyperPod dependency Helm charts
- Configure Kubernetes resources
- Set up monitoring and logging
- Configure EFA device plugins
Best Practices
- Parameter Validation: Always validate parameters before deployment
- Resource Naming: Use consistent naming conventions with prefixes
- Security: Review security group rules and IAM policies
- Monitoring: Deploy observability stacks for production workloads
- Backup: Use S3 for lifecycle script and configuration backup
Troubleshooting
Common Issues
Stack Creation Fails: Check CloudWatch logs for detailed error messages
aws logs describe-log-groups --log-group-name-prefix /aws/sagemaker/Clusters
Resource Limits: Verify service quotas for your instance types
aws service-quotas get-service-quota \
--service-code sagemaker \
--quota-code L-1194F6F1
VPC Configuration: Ensure subnets are in correct availability zones with capacity
Getting Help
- Review CloudFormation events in the AWS Console
- Check awsome-distributed-training repository for updates
Clean Up
Delete stacks in reverse order of creation:
# Delete HyperPod cluster first
aws sagemaker delete-cluster --cluster-name ml-cluster
# Then delete CloudFormation stacks
aws cloudformation delete-stack --stack-name hyperpod-cluster-stack
aws cloudformation delete-stack --stack-name fsx-lustre-stack
aws cloudformation delete-stack --stack-name vpc-stack
The CloudFormation templates provide a robust, repeatable way to deploy SageMaker HyperPod infrastructure with AWS best practices built-in.