Terraform Deployment for SageMaker HyperPod

This guide covers deploying SageMaker HyperPod infrastructure using Terraform modules from the awsome-distributed-training repository. Terraform modules are available for both EKS and Slurm orchestration types.

Architecture Overview

The Terraform modules provide Infrastructure as Code (IaC) for deploying complete SageMaker HyperPod environments including:

VPC with public and private subnets
Security groups configured for EFA communication
FSx for Lustre file system (high-performance shared storage)
S3 bucket for lifecycle scripts
IAM roles and policies
SageMaker HyperPod cluster with chosen orchestration

EKS Orchestration

Architecture Diagram

The EKS Terraform modules create a comprehensive infrastructure stack:

HyperPod EKS Terraform Modules

Quick Start - EKS

Clone and Navigate

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf

Customize Configuration

Start by reviewing the default configurations in the terraform.tfvars file and create a custom.tfvars file with your parameter overrides.

For example, the following custom.tfvars file would enable the creation of all new resources including a new EKS Cluster and a HyperPod instance group of 5 ml.p5en.48xlarge instances in us-west-2 using a training plan:

cat > custom.tfvars << EOL 
kubernetes_version = "1.33"
eks_cluster_name = "my-eks-cluster"
hyperpod_cluster_name = "my-hp-cluster"
resource_name_prefix = "hp-eks-test"
aws_region = "us-west-2"
instance_groups = [
  {
    name = "accelerated-instance-group-1"
    instance_type = "ml.p5en.48xlarge",
    instance_count = 5,
    availability_zone_id  = "usw2-az2",
    ebs_volume_size_in_gb = 100,
    threads_per_core = 2,
    enable_stress_check = true,
    enable_connectivity_check = true,
    lifecycle_script = "on_create.sh"
    training_plan_arn = arn:aws:sagemaker:us-west-2:123456789012:training-plan/training-plan-example
  }
]
EOL

Deploy Infrastructure

First, clone the HyperPod Helm charts repository:

git clone https://github.com/aws/sagemaker-hyperpod-cli.git /tmp/helm-repo

Initialize and deploy:

terraform init
terraform plan -var-file=custom.tfvars
terraform apply -var-file=custom.tfvars

Set Environment Variables

cd ..
chmod +x terraform_outputs.sh
./terraform_outputs.sh
source env_vars.sh

Using an Existing EKS Cluster with HyperPod

To use an existing EKS cluster, configure your custom.tfvars to use an existing EKS Cluster (referenced by name) along with an existing Security Group, VPC, and NAT Gateway (referenced by ID):

cat > custom.tfvars << EOL 
create_eks_module = false
existing_eks_cluster_name = "my-eks-cluster"
existing_security_group_id = "sg-1234567890abcdef0"
create_vpc_module = false
existing_vpc_id = "vpc-1234567890abcdef0"
existing_nat_gateway_id = "nat-1234567890abcdef0"
hyperpod_cluster_name = "my-hp-cluster"
resource_name_prefix = "hp-eks-test"
aws_region = "us-west-2"
instance_groups = [
  {
    name = "accelerated-instance-group-1"
    instance_type = "ml.p5en.48xlarge",
    instance_count = 5,
    availability_zone_id  = "usw2-az2",
    ebs_volume_size_in_gb = 100,
    threads_per_core = 2,
    enable_stress_check = true,
    enable_connectivity_check = true,
    lifecycle_script = "on_create.sh"
    training_plan_arn = arn:aws:sagemaker:us-west-2:123456789012:training-plan/training-plan-example
  }
]
EOL

Enabling Optional Addons

Set the following parameters to true in your custom.tfvars file to enable optional addons for your HyperPod cluster (e.g. create_task_governance_module = true):

Parameter	Usage
`create_task_governance_module`	Installs the HyperPod task governance addon for job queuing, prioritization, and scheduling on multi-team compute clusters
`create_hyperpod_training_operator_module`	Installs the HyperPod training operator addon for intelligent fault recovery, hang job detection, and process-level management capabilities (required for Checkpointless and Elastic training)
`create_hyperpod_inference_operator_module`	Installs the HyperPod inference operator addon for deployment and management of machine learning inference endpoints
`create_observability_module`	Installs the HyperPod Observability addon to publish key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards

Advanced Observability Metrics

In addition to enabling the HyperPod Observability addon by setting create_observability_module = true, you can also configure the following metrics that you wish to collect on your cluster:

Parameter	Default	Options	Usage
`training_metric_level`	`BASIC`	`BASIC, ADVANCED`	Task duration, type, fault data (Advanced: Event-based task performance), Learn More Here
`task_governance_metric_level`	`DISABLED`	`DISABLED, ADVANCED`	Team-level resource allocation, Learn More Here
`scaling_metric_level`	`DISABLED`	`DISABLED, ADVANCED`	KEDA auto-scaling metrics, Learn More Here
`cluster_metric_level`	`BASIC`	`BASIC, ADVANCED`	Cluster health, instance count (Advanced: Detailed Kube-state cluster metrics), Learn More Here
`node_metric_level`	`BASIC`	`BASIC, ADVANCED`	CPU, disk, OS-level usage (Advanced: Full node exporter suite), Learn More Here
`network_metric_level`	`DISABLED`	`DISABLED, ADVANCED`	Elastic Fabric Adapter metrics, Learn More Here
`accelerated_compute_metric_level`	`BASIC`	`BASIC, ADVANCED`	GPU utilization, temperature (Advanced: All NVIDIA GPU DCGM, Neuron metrics), Learn More Here
`logging_enabled`	`false`	`true, false`	When enabled, this will automatically create the required log groups in Amazon CloudWatch and start recording all container and pod logs as log streams

FSx for Lustre Module

By default, the FSx for Lustre module installs the Amazon FSx for Lustre Container Storage Interface (CSI) Driver, but does not dynamically provision a new filesystem. For existing filesystems, you can follow these steps in the AI on SageMaker HyperPod Workshop for static provisioning. If you wish to create a new filesystem using Terraform, add the parameter create_new_fsx_filesystem = true to your custom.tfvars file, and review the fsx_storage_capacity (default 1200 GiB) and fsx_throughput (default 250 MBps/TiB) parameters to ensure they are set according to your requirements. When create_new_fsx_filesystem = true the FSx for Lustre module will statically create a new filesystem along with a StorageClass, PersistentVolume, and PersistentVolumeClaim (PVC). By default the PVC will be mapped to the default namespace. If you wish to use another namespace, use the fsx_pvc_namespace parameter to specify it. By default, specifying a non-default namespace will trigger the creation of that namespace. If you are using an existing EKS cluster where the target namespace already exists, set create_fsx_pvc_namespace = false to skip creation.

Amazon GuardDuty EKS Runtime Monitoring

If your target account has Amazon GuardDuty EKS Runtime Monitoring enabled, an interface VPC endpoint is automatically created to allow the security agent to deliver events to GuardDuty while event data remains within the AWS network. Because this VPC endpoint is not managed by Terraform, the associated Elastic Network Interfaces (ENIs) and Security Group that are automatically deployed by GuardDuty can block destruction when you are ready to clean up. To mitigate this, we've included an optional GuardDuty cleanup script guardduty-cleanup.sh that is invoked only at destruction time using a Terraform null_resource. This script finds the GuardDuty VPC endpoint associated with your HyperPod VPC and deletes it, waits for the associated ENIs to be cleaned up, then deletes the associated Security Group. To enable this script at plan and apply time, simply add the parameter enable_guardduty_cleanup = true to your custom.tfvars file. This script won't run when you issue a terraform apply command, but will run when you issue a terraform destroy command.

Slurm Orchestration

Quick Start - Slurm

Clone and Navigate

git clone https://github.com/aws-samples/awsome-distributed-training.git
cd awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf

Customize Configuration

cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your specific requirements

Example configuration:

# terraform.tfvars
resource_name_prefix = "hyperpod"
aws_region = "us-west-2"
availability_zone_id = "usw2-az2"

hyperpod_cluster_name = "ml-cluster"

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.2xlarge"
    instance_count = 1
    ebs_volume_size = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  login-nodes = {
    instance_type    = "ml.m5.4xlarge"
    instance_count   = 1
    ebs_volume_size  = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.g5.4xlarge"
    instance_count = 2
    ebs_volume_size = 500
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
}

Deploy Infrastructure

terraform init
terraform plan
terraform apply

Extract Outputs

./terraform_outputs.sh
source env_vars.sh

Slurm Modules

The Slurm Terraform deployment includes these modules:

vpc: Creates VPC with public/private subnets, IGW, NAT Gateway
security_group: EFA-enabled security group for HyperPod
fsx_lustre: High-performance Lustre file system
s3_bucket: Storage for lifecycle scripts
sagemaker_iam_role: IAM role with required permissions
lifecycle_script: Uploads and configures Slurm lifecycle scripts
hyperpod_cluster: SageMaker HyperPod cluster with Slurm

Reusing Existing Resources

Both EKS and Slurm modules support reusing existing infrastructure. Set the corresponding create_*_module to false and provide the existing resource ID:

create_vpc_module = false
existing_vpc_id = "vpc-1234567890abcdef0"
existing_private_subnet_id = "subnet-1234567890abcdef0"
existing_security_group_id = "sg-1234567890abcdef0"

Lifecycle Scripts

The Terraform modules automatically handle lifecycle scripts:

For Slurm

Uploads base Slurm configuration from ../../LifecycleScripts/base-config/
Configures Slurm scheduler
Mounts FSx Lustre file system
Installs Docker, Enroot, and Pyxis
Sets up user accounts and permissions

For EKS

Deploys HyperPod dependency Helm charts
Configures EKS cluster for HyperPod integration
Sets up necessary Kubernetes resources

Accessing Your Cluster

Slurm Cluster Access

After deployment, use the provided helper script:

./easy-ssh.sh <cluster-name> <region>

Or manually:

aws ssm start-session --target sagemaker-cluster:${CLUSTER_ID}_${CONTROLLER_GROUP}-${INSTANCE_ID}

EKS Cluster Access

Configure kubectl to access your EKS cluster:

aws eks update-kubeconfig --region $AWS_REGION --name $EKS_CLUSTER_NAME
kubectl get nodes

Configuration Examples

High-Performance Computing Setup

For large-scale training workloads:

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.4xlarge"
    instance_count = 1
    ebs_volume_size = 200
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.p5.48xlarge"
    instance_count = 8
    ebs_volume_size = 1000
    threads_per_core = 2
    lifecycle_script = "on_create.sh"
  }
}

Development Environment

For smaller development clusters:

instance_groups = {
  controller-machine = {
    instance_type = "ml.c5.xlarge"
    instance_count = 1
    ebs_volume_size = 100
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
  compute-nodes = {
    instance_type = "ml.g5.xlarge"
    instance_count = 2
    ebs_volume_size = 200
    threads_per_core = 1
    lifecycle_script = "on_create.sh"
  }
}

Monitoring and Validation

After deployment, validate your cluster:

# For Slurm clusters
sinfo
squeue

# For EKS clusters
kubectl get nodes
kubectl get pods -A

Clean Up

To destroy the infrastructure:

# Before destroying resources, list state to exclude any resources you wish to retain from deletion:
terraform state list
terraform state rm < resource_to_preserve >

# Validate the destroy plan first
terraform plan -destroy

# If using custom.tfvars
terraform plan -destroy -var-file=custom.tfvars

# Destroy resources
terraform destroy

# If using custom.tfvars
terraform destroy -var-file=custom.tfvars

Best Practices

Version Control: Store your terraform.tfvars or custom.tfvars files in version control
State Management: Use remote state storage (S3 + DynamoDB) for production deployments
Resource Tagging: Use consistent tagging strategies via the resource_name_prefix
Security: Review IAM policies and security group rules before deployment
Cost Optimization: Choose appropriate instance types and counts for your workload

Troubleshooting

Common Issues

Terraform Init Fails: Ensure you have proper AWS credentials configured

aws configure list

Resource Creation Fails: Check availability zone capacity for your chosen instance types

aws ec2 describe-availability-zones --region us-west-2

EKS Access Issues: Verify your IAM permissions include EKS cluster access

Slurm Issues: Check lifecycle script logs in CloudWatch or on the instances

Getting Help

Review the awsome-distributed-training repository for updates
Check AWS documentation for SageMaker HyperPod
Validate your configuration with terraform plan before applying

The Terraform modules provide a robust, repeatable way to deploy SageMaker HyperPod infrastructure with best practices built-in.

Architecture Overview​

EKS Orchestration​

Architecture Diagram​

Quick Start - EKS​

Using an Existing EKS Cluster with HyperPod​

Enabling Optional Addons​

Advanced Observability Metrics​

FSx for Lustre Module​

Amazon GuardDuty EKS Runtime Monitoring​

Slurm Orchestration​

Quick Start - Slurm​

Slurm Modules​

Reusing Existing Resources​

Lifecycle Scripts​

For Slurm​

For EKS​

Accessing Your Cluster​

Slurm Cluster Access​

EKS Cluster Access​

Configuration Examples​

High-Performance Computing Setup​

Development Environment​

Monitoring and Validation​

Clean Up​

Best Practices​

Troubleshooting​

Common Issues​

Getting Help​