Terraform Deployment for SageMaker HyperPod

This guide covers deploying SageMaker HyperPod infrastructure using Terraform modules from the awsome-distributed-training repository. Terraform modules are available for both EKS and Slurm orchestration types.

Architecture Overview

The Terraform modules provide Infrastructure as Code (IaC) for deploying complete SageMaker HyperPod environments including:

VPC with public and private subnets
Security groups configured for EFA communication
FSx for Lustre file system (high-performance shared storage)
S3 bucket for lifecycle scripts
IAM roles and policies
SageMaker HyperPod cluster with chosen orchestration

EKS Orchestration

Architecture Diagram

The EKS Terraform modules create a comprehensive infrastructure stack:

HyperPod EKS Terraform Modules

Quick Start - EKS

Clone and Navigate

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf

Customize Configuration

Review the default configurations in terraform.tfvars and create a custom configuration:

cat > custom.tfvars << EOL 
kubernetes_version = "1.32"
eks_cluster_name = "my-eks-cluster"
hyperpod_cluster_name = "my-hp-cluster"
resource_name_prefix = "hp-eks-test"
aws_region = "us-west-2"
availability_zone_id = "usw2-az2"
instance_groups = {
    accelerated-instance-group-1 = {
        instance_type = "ml.p5en.48xlarge",
        instance_count = 5,
        ebs_volume_size_in_gb = 100,
        threads_per_core = 2,
        enable_stress_check = true,
        enable_connectivity_check = true,
        lifecycle_script = "on_create.sh"
    }
}
EOL

Deploy Infrastructure

First, clone the HyperPod Helm charts repository:

git clone https://github.com/aws/sagemaker-hyperpod-cli.git /tmp/helm-repo

Initialize and deploy:

terraform init
terraform plan -var-file=custom.tfvars
terraform apply -var-file=custom.tfvars

Set Environment Variables

cd ..
chmod +x terraform_outputs.sh
./terraform_outputs.sh
source env_vars.sh

Using Existing EKS Cluster

To use an existing EKS cluster, configure your custom.tfvars:

create_eks_module = false
existing_eks_cluster_name = "my-eks-cluster"
existing_security_group_id = "sg-1234567890abcdef0"
create_vpc_module = false
existing_vpc_id = "vpc-1234567890abcdef0"
existing_nat_gateway_id = "nat-1234567890abcdef0"
hyperpod_cluster_name = "my-hp-cluster"
resource_name_prefix = "hp-eks-test"
aws_region = "us-west-2"
availability_zone_id = "usw2-az2"
instance_groups = {
    accelerated-instance-group-1 = {
        instance_type = "ml.p5en.48xlarge",
        instance_count = 5,
        ebs_volume_size_in_gb = 100,
        threads_per_core = 2,
        enable_stress_check = true,
        enable_connectivity_check = true,
        lifecycle_script = "on_create.sh"
    }
}

Slurm Orchestration

Quick Start - Slurm

Clone and Navigate

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf

Customize Configuration

cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your specific requirements

Example configuration:

# terraform.tfvars
resource_name_prefix = "hyperpod"
aws_region = "us-west-2"
availability_zone_id = "usw2-az2"

hyperpod_cluster_name = "ml-cluster"

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.2xlarge"
    instance_count = 1
    ebs_volume_size = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  login-nodes = {
    instance_type    = "ml.m5.4xlarge"
    instance_count   = 1
    ebs_volume_size  = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.g5.4xlarge"
    instance_count = 2
    ebs_volume_size = 500
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
}

Deploy Infrastructure

terraform init
terraform plan
terraform apply

Extract Outputs

./terraform_outputs.sh
source env_vars.sh

Slurm Modules

The Slurm Terraform deployment includes these modules:

vpc: Creates VPC with public/private subnets, IGW, NAT Gateway
security_group: EFA-enabled security group for HyperPod
fsx_lustre: High-performance Lustre file system
s3_bucket: Storage for lifecycle scripts
sagemaker_iam_role: IAM role with required permissions
lifecycle_script: Uploads and configures Slurm lifecycle scripts
hyperpod_cluster: SageMaker HyperPod cluster with Slurm

Reusing Existing Resources

Both EKS and Slurm modules support reusing existing infrastructure. Set the corresponding create_*_module to false and provide the existing resource ID:

create_vpc_module = false
existing_vpc_id = "vpc-1234567890abcdef0"
existing_private_subnet_id = "subnet-1234567890abcdef0"
existing_security_group_id = "sg-1234567890abcdef0"

Lifecycle Scripts

The Terraform modules automatically handle lifecycle scripts:

For Slurm

Uploads base Slurm configuration from ../../LifecycleScripts/base-config/
Configures Slurm scheduler
Mounts FSx Lustre file system
Installs Docker, Enroot, and Pyxis
Sets up user accounts and permissions

For EKS

Deploys HyperPod dependency Helm charts
Configures EKS cluster for HyperPod integration
Sets up necessary Kubernetes resources

Accessing Your Cluster

Slurm Cluster Access

After deployment, use the provided helper script:

./easy-ssh.sh <cluster-name> <region>

Or manually:

aws ssm start-session --target sagemaker-cluster:${CLUSTER_ID}_${CONTROLLER_GROUP}-${INSTANCE_ID}

EKS Cluster Access

Configure kubectl to access your EKS cluster:

aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER_NAME
kubectl get nodes

Configuration Examples

High-Performance Computing Setup

For large-scale training workloads:

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.4xlarge"
    instance_count = 1
    ebs_volume_size = 200
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.p5.48xlarge"
    instance_count = 8
    ebs_volume_size = 1000
    threads_per_core = 2
    lifecycle_script = "on_create.sh"
  }
}

Development Environment

For smaller development clusters:

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.xlarge"
    instance_count = 1
    ebs_volume_size = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.g5.xlarge"
    instance_count = 2
    ebs_volume_size = 200
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
}

Monitoring and Validation

After deployment, validate your cluster:

# For Slurm clusters
sinfo
squeue

# For EKS clusters
kubectl get nodes
kubectl get pods -A

Clean Up

To destroy the infrastructure:

# Validate the destroy plan first
terraform plan -destroy

# If using custom.tfvars
terraform plan -destroy -var-file=custom.tfvars

# Destroy resources
terraform destroy

# If using custom.tfvars
terraform destroy -var-file=custom.tfvars

Best Practices

Version Control: Store your terraform.tfvars or custom.tfvars files in version control
State Management: Use remote state storage (S3 + DynamoDB) for production deployments
Resource Tagging: Use consistent tagging strategies via the resource_name_prefix
Security: Review IAM policies and security group rules before deployment
Cost Optimization: Choose appropriate instance types and counts for your workload

Troubleshooting

Common Issues

Terraform Init Fails: Ensure you have proper AWS credentials configured

aws configure list

Resource Creation Fails: Check availability zone capacity for your chosen instance types

aws ec2 describe-availability-zones --region us-west-2

EKS Access Issues: Verify your IAM permissions include EKS cluster access

Slurm Issues: Check lifecycle script logs in CloudWatch or on the instances

Getting Help

Review the awsome-distributed-training repository for updates
Check AWS documentation for SageMaker HyperPod
Validate your configuration with terraform plan before applying

The Terraform modules provide a robust, repeatable way to deploy SageMaker HyperPod infrastructure with best practices built-in.

Architecture Overview​

EKS Orchestration​

Architecture Diagram​

Quick Start - EKS​

Using Existing EKS Cluster​

Slurm Orchestration​

Quick Start - Slurm​

Slurm Modules​

Reusing Existing Resources​

Lifecycle Scripts​

For Slurm​

For EKS​

Accessing Your Cluster​

Slurm Cluster Access​

EKS Cluster Access​

Configuration Examples​

High-Performance Computing Setup​

Development Environment​

Monitoring and Validation​

Clean Up​

Best Practices​

Troubleshooting​

Common Issues​

Getting Help​