Skip to main content

SageMaker Studio + Hyperpod Integration Guide

This guide provides step-by-step instructions for setting up Amazon SageMaker Studio with Hyperpod on Amazon EKS, including FSx Lustre storage configuration.

SageMaker Studio with Hyperpod integration

Table of Contents

  1. Prerequisites
  2. EKS Cluster Setup
  3. FSx for Lustre Configuration
  4. SageMaker Studio Domain Setup
  5. SageMaker Studio IDE Configuration
  6. Setup EKS Access Entry
  7. Setup EKS Cluster connection
  8. Optional: Setup Hyperpod Task Governance and CloudWatch Observability
  9. Optional: MLFlow Setup

Prerequisites

Before starting, ensure you have:

  • AWS CLI configured with appropriate permissions
  • Access to AWS Management Console
  • Familiarity with Amazon EKS, SageMaker, and FSx Lustre

EKS Cluster Setup

To create an Amazon EKS cluster, and setup a SageMaker Hyperpod one, you can follow one of these steps:

  1. Option 1: Initial Cluster Setup
  2. Option 2: Using CloudFormation (see Infrastructure as Code section)

FSx for Lustre Configuration

To create an Amazon FSx for Lustre volume, for being used by both EKS pods and SageMaker Studio, you can follow the steps Set up your shared file system. The choice of Dynamic vs Static provisioning is yours!


SageMaker Studio Domain Setup

You can deploy the CloudFormation template from the awsome-distributed-training repository, which creates the following resources:

  1. SageMaker Studio domain
  2. Lifecycle configurations for installing necessary packages for Studio IDE, such as kubectl and jq. Lifecycle configurations will be created for both JupyterLab and Code Editor
  3. A Lambda function that:
    1. Associates the created security-group-for-inbound-nfs security group to the Studio domain
    2. Associates the security-group-for-inbound-nfs security group to the FSx for Lustre ENIs
    3. Optional: If SharedFSx is set to True, creates the partition shared in the FSx for Lustre volume, and associates it to the Studio domain

Shared FSx Partition

  1. If SharedFSx is set to False, a Lambda function that:
    1. Creates the partition /{user_profile_name}, and associates it to the Studio user profile
    2. Creates an Event bridge rule that invokes the previously defined Lambda function each time a new user is created.

Partitioned FSx

The CloudFormation template requires the following parameters:

  1. EKSClusterName: Name of the EKS Cluster
  2. ExistingFSxLustreId: Id of the created FSx for Lustre volume
  3. FSxClaimName: Name of the claim created for the FSx for Lustre volume
  4. ExistingVpcId: Dropdown menu for selecting the EKS cluster VPC
  5. ExistingSubnetIds: Dropdown menu for selecting the EKS cluster Private Subnet IDs.
Important

While specifying the ExistingSubnetIds, please pass in only the subnet IDs associated with your HyperPod cluster, and not your EKS cluster. You can find this in your environment variables (env_vars) that you used to create the cluster, as PRIVATE_SUBNET_ID. Alternatively, you can check your console for private subnets with names <PREFIX> Private Subnet 1 (correct) vs. <PREFIX> EKS Cluster Private Subnet 1.

Do not attempt to delete the CloudFormation stack before all EFA network interfaces have been removed, as they are dependencies of your private subnet and will preclude it from being deleted as a stack resource. This may take several minutes as the HyperPod nodes are shut down.


SageMaker Studio IDE Configuration

As an admin user, once your SageMaker Studio Domain is provisioned, you may go in and create users as you see fit.

note

This step DOES NOT assume that you already have a Studio Domain. To create one, check out the next section titled "SageMaker Studio Domain Setup". alt text

You can now select your preferred IDE from SageMaker Studio.

SageMaker Studio Home

For the purpose of this workshop, we are going to create a Code Editor environment.

From the top-left menu:

  1. Click on Code Editor
  2. Click on Create Code Editor Space
  3. Enter a name
  4. Click on Create Space
  5. From the Attach custom filesystem - optional dropdown menu, select the FSx for Lustre volume
  6. From the Lifecycle configuration dropdown menu, select the available lifecycle configuration

Code Editor setup

Click on Run Space. Wait until the space is created, then click Open Code Editor

To verify that your file system was mounted, you can check if you have a path mounted in the Code Editor space custom-file-system/fsx_lustre/<FSX_ID>:

Code Editor setup

You can also run:

df -h

If you set SharedFSx to False, you can verify separate partitions for two users. Example output from user1:

Filesystem                      Size  Used Avail Use% Mounted on
overlay 37G 494M 37G 2% /
tmpfs 64M 0 64M 0% /dev
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
shm 392M 0 392M 0% /dev/shm
/dev/nvme1n1 5.0G 529M 4.5G 11% /home/sagemaker-user
/dev/nvme0n1p1 180G 31G 150G 18% /opt/.sagemakerinternal
10.1.53.46@tcp:/ylacfb4v/aman1 1.2T 7.5M 1.2T 1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
127.0.0.1:/ 8.0E 0 8.0E 0% /mnt/custom-file-systems/efs/fs-052756a07c3a5ba97_fsap-0b5e6e7c68f22fee3
tmpfs 1.9G 0 1.9G 0% /proc/acpi
tmpfs 1.9G 0 1.9G 0% /sys/firmware

Example output from user2:

Filesystem                      Size  Used Avail Use% Mounted on
overlay 37G 478M 37G 2% /
tmpfs 64M 0 64M 0% /dev
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
shm 392M 0 392M 0% /dev/shm
/dev/nvme0n1p1 180G 31G 150G 18% /opt/.sagemakerinternal
/dev/nvme1n1 5.0G 529M 4.5G 11% /home/sagemaker-user
127.0.0.1:/ 8.0E 0 8.0E 0% /mnt/custom-file-systems/efs/fs-052756a07c3a5ba97_fsap-0a323a3e5a27e1bdc
10.1.53.46@tcp:/ylacfb4v/aman2 1.2T 7.5M 1.2T 1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
tmpfs 1.9G 0 1.9G 0% /proc/acpi
tmpfs 1.9G 0 1.9G 0% /sys/firmware

The difference here is the mountpoint for FSxl (ylacfb4v) has separate partitions set up. You can then cd /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33 and write from each user and verify that the other user isn't able to see those files!

Alternatively, if you set SharedFSx to True, you can check the the mount using df -h, and it will show something like:

Filesystem                       Size  Used Avail Use% Mounted on
overlay 37G 478M 37G 2% /
tmpfs 64M 0 64M 0% /dev
tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup
shm 392M 0 392M 0% /dev/shm
/dev/nvme0n1p1 180G 31G 150G 18% /opt/.sagemakerinternal
/dev/nvme1n1 5.0G 529M 4.5G 11% /home/sagemaker-user
10.1.53.46@tcp:/ylacfb4v/shared 1.2T 7.5M 1.2T 1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
127.0.0.1:/ 8.0E 0 8.0E 0% /mnt/custom-file-systems/efs/fs-0e16e272aba907ad3_fsap-08ae9b9f68be028d7
tmpfs 1.9G 0 1.9G 0% /proc/acpi
tmpfs 1.9G 0 1.9G 0% /sys/firmware

with the /shared partition.


Setup EKS Access Entry

To allow users to deploy training workloads on Hyperpod, you should setup EKS Access Entry for the SageMaker Studio IAM Role.

Run the following commands:

export EKS_CLUSTER_NAME=<YOUR_CLUSTER_NAME>

Replace YOUR_CLUSTER_NAME With the name of the EKS cluster.

CALLER_IDENTITY=$(aws sts get-caller-identity --output json)
ACCOUNT_ID=$(echo "$CALLER_IDENTITY" | jq -r .Account)
USER_ARN=$(echo "$CALLER_IDENTITY" | jq -r .Arn)
PRINCIPAL_TYPE=$(echo "$USER_ARN" | cut -d':' -f6 | cut -d'/' -f1)
USER_NAME=$(echo "$USER_ARN" | cut -d'/' -f2)
ROLE_NAME=$(echo "$USER_ARN" | cut -d'/' -f2)
USER_ARN="arn:aws:iam::${ACCOUNT_ID}:role/${ROLE_NAME}"

Create an EKS Access Entry:

aws eks create-access-entry \
--cluster-name "$EKS_CLUSTER_NAME" \
--principal-arn "$USER_ARN" \
--type "STANDARD"

Associate an IAM policy to the access entry:

aws eks associate-access-policy \
--cluster-name "$EKS_CLUSTER_NAME" \
--principal-arn "$USER_ARN" \
--policy-arn "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy" \
--access-scope '{"type": "cluster"}'

You are now ready to setup the cluster connection.

If you run

kubectl get svc

you should see an output similar to:

NAME             TYPE        CLUSTER-IP   EXTERNAL-IP PORT(S)   AGE
svc/kubernetes ClusterIP 10.100.0.1 <none> 443/TCP 1m

Setup EKS Cluster connection

Important: Follow the steps in Verifying cluster connection to EKS until "Verify helm Chart Installation".

You should now be able to operate with the EKS cluster.

SageMaker Studio with Hyperpod integration


Optional: Setup Hyperpod Task Governance and CloudWatch Observability

  1. For Hyperpod Task Governance, If not previously done, follow the steps in Task Governance
  2. For CloudWatch Observability insights, if not done previously, follow the steps in Container Insights

Run the following command:

export EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>
export EKS_CLUSTER_EXECUTION_ROLE_NAME=<EKS_CLUSTER_EXECUTION_ROLE_NAME>

Replace:

  • EKS_CLUSTER_NAME: EKS Cluster name
  • EKS_CLUSTER_EXECUTION_ROLE_NAME: IAM Role name used by the EKS Cluster
aws iam attach-role-policy \
--role-name $EKS_CLUSTER_EXECUTION_ROLE_NAME \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

Optional: You can install the CloudWatch adds-on through AWS SDK:

aws eks create-addon \
--addon-name amazon-cloudwatch-observability \
--cluster-name $EKS_CLUSTER_NAME

For additional information, please refer to Upgrading to Container Insights with enhanced observability for Amazon EKS in CloudWatch


Optional: MLFlow Setup

Follow the AWS documentation: Assign IAM roles to Kubernetes service accounts

Note: You can perform these steps from SageMaker Studio. The Studio Domain execution role is configured with the necessary IAM Policy. If you need to edit the role, ensure these steps are performed by an Admin user.

Run the following commands:

export EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>
export EKS_MLFLOW_POLICY_ARN=<EKS_MLFLOW_POLICY_ARN>

Replace:

  • EKS_CLUSTER_NAME: EKS Cluster name
  • EKS_MLFLOW_POLICY_ARN: MLflow policy ARN from the output of the Cloudformation template

SageMaker Studio CFN Output

eksctl create iamserviceaccount \
--name eks-hyperpod-sa \
--namespace default \
--cluster $EKS_CLUSTER_NAME \
--role-name eks-hyperpod-mlflow-role \
--attach-policy-arn $EKS_MLFLOW_POLICY_ARN \
--approve
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: eks-hyperpod-sa
namespace: default
EOF