SageMaker Studio + Hyperpod Integration Guide

This guide provides step-by-step instructions for setting up Amazon SageMaker Studio with Amazon SageMaker Hyperpod SLURM, including FSx Lustre storage configuration.

We will help set up your Studio environment so that:

You can use familiar environments such as JupyterLab and CodeEditor to interact with your SLURM SMHP cluster
You can access your cluster's FSxL file system from your JupyterLab/CodeEditor instance
Your CodeEditor/JupyterLab instance will essentially function as a login node to the SMHP SLURM cluster!

Why login nodes? Login nodes allow users to login to the cluster, submit jobs, and view and manipulate data without running on the critical slurmctld scheduler node. This also allows you to run monitoring servers like aim, Tensorboard, or Grafana/Prometheus.

SageMaker Studio with Hyperpod integration

Prerequisites
Cluster Setup
FSx for Lustre Configuration
SageMaker Studio Domain Setup
SageMaker Studio IDE Configuration
Monitor SLURM Installation
Pitfalls

Prerequisites

Before starting, ensure you have:

AWS CLI configured with appropriate permissions
Access to AWS Management Console
Familiarity with SageMaker HyperPod, SLURM, SageMaker Studio, and FSxL

Cluster Setup

To create an Amazon SageMaker HyperPod SLURM cluster, you can follow one of these steps:

Option 1: Initial Cluster Setup
Option 2: Using CloudFormation (see Infrastructure as Code section)

FSx for Lustre Configuration

A FSx for Lustre (FSxL) file system is created for you as part of the cluster setup! We will use this file system for both your SMHP cluster nodes and your Studio Domain.

You can move on to the next section.

SageMaker Studio Domain Setup

You can deploy the CloudFormation template from studio-slurm.yaml the awsome-distributed-training repository, which creates the following resources:

SageMaker Studio domain
Lifecycle configurations for installing necessary packages for Studio IDE, including SLURM. Lifecycle configurations will be created for both JupyterLab and Code Editor. We will set it up so that your CodeEditor/JupyterLab instance will essentially be configured as a login node for your SageMaker HyperPod cluster!
A Lambda function that:
1. Associates the created security-group-for-inbound-nfs security group to the Studio domain
2. Associates the security-group-for-inbound-nfs security group to the FSx for Lustre ENIs
3. Optional: If SharedFSx is set to True, creates the partition shared in the FSx for Lustre volume, and associates it to the Studio domain

Shared FSx Partition

If SharedFSx is set to False, a Lambda function that:
1. Creates the partition /{user_profile_name}, and associates it to the Studio user profile
2. Creates an Event bridge rule that invokes the previously defined Lambda function each time a new user is created.

Partitioned FSx

The CloudFormation template requires the following parameters:

AdditionalUsers: Your configured SLURM users (POSIX) that you want to give access to write to your Studio's file system space (comma separated). ubuntu is added by default, so you don't need to add it in.
ExistingFSxLustreId: Id of the created FSx for Lustre file system
ExistingSubnetIds: Dropdown menu for selecting the SMHP cluster Private Subnet IDs.
ExistingVpcId: Dropdown menu for selecting the SMHP cluster VPC
HeadNodeName: The name of your SMHP SLURM cluster's head node (default controller-machine)
HyperPodClusterName: The name of your SMHP SLURM cluster (default: ml-cluster)
SecurityGroupId: Id of the security group that allows communication with the HyperPod Slurm controller node (for MUNGE authentication)

SageMaker Studio IDE Configuration

As an admin user, once your SageMaker Studio Domain is provisioned, you may go in and create users as you see fit.

note

This step DOES NOT assume that you already have a Studio Domain. To create one, check out the next section titled "SageMaker Studio Domain Setup". alt text

You can now select your preferred IDE from SageMaker Studio.

SageMaker Studio Home

For the purpose of this workshop, we are going to create a Code Editor environment.

From the top-left menu:

Click on Code Editor
Click on Create Code Editor Space
Enter a name
Click on Create Space
From the Attach custom filesystem - optional dropdown menu, select the FSx for Lustre volume
From the Lifecycle configuration dropdown menu, select the available lifecycle configuration

Code Editor setup

Click on Run Space. Wait until the space is created, then click Open Code Editor

To verify that your file system was mounted, you can check if you have a path mounted in the Code Editor space custom-file-system/fsx_lustre/$FSX_ID:

Code Editor setup

You can also run:

df -h

If you set SharedFSx to False, you can verify separate partitions for two users. Example output from user1:

Filesystem                      Size  Used Avail Use% Mounted on
overlay                          37G  494M   37G   2% /
tmpfs                            64M     0   64M   0% /dev
tmpfs                           1.9G     0  1.9G   0% /sys/fs/cgroup
shm                             392M     0  392M   0% /dev/shm
/dev/nvme1n1                    5.0G  529M  4.5G  11% /home/sagemaker-user
/dev/nvme0n1p1                  180G   31G  150G  18% /opt/.sagemakerinternal
10.1.53.46@tcp:/ylacfb4v/aman1  1.2T  7.5M  1.2T   1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
127.0.0.1:/                     8.0E     0  8.0E   0% /mnt/custom-file-systems/efs/fs-052756a07c3a5ba97_fsap-0b5e6e7c68f22fee3
tmpfs                           1.9G     0  1.9G   0% /proc/acpi
tmpfs                           1.9G     0  1.9G   0% /sys/firmware

Example output from user2:

Filesystem                      Size  Used Avail Use% Mounted on
overlay                          37G  478M   37G   2% /
tmpfs                            64M     0   64M   0% /dev
tmpfs                           1.9G     0  1.9G   0% /sys/fs/cgroup
shm                             392M     0  392M   0% /dev/shm
/dev/nvme0n1p1                  180G   31G  150G  18% /opt/.sagemakerinternal
/dev/nvme1n1                    5.0G  529M  4.5G  11% /home/sagemaker-user
127.0.0.1:/                     8.0E     0  8.0E   0% /mnt/custom-file-systems/efs/fs-052756a07c3a5ba97_fsap-0a323a3e5a27e1bdc
10.1.53.46@tcp:/ylacfb4v/aman2  1.2T  7.5M  1.2T   1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
tmpfs                           1.9G     0  1.9G   0% /proc/acpi
tmpfs                           1.9G     0  1.9G   0% /sys/firmware

The difference here is the mountpoint for FSxl (ylacfb4v) has separate partitions set up. You can then cd /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33 and write from each user and verify that the other user isn't able to see those files!

Alternatively, if you set SharedFSx to True, you can check the the mount using df -h, and it will show something like:

Filesystem                       Size  Used Avail Use% Mounted on
overlay                           37G  478M   37G   2% /
tmpfs                             64M     0   64M   0% /dev
tmpfs                            1.9G     0  1.9G   0% /sys/fs/cgroup
shm                              392M     0  392M   0% /dev/shm
/dev/nvme0n1p1                   180G   31G  150G  18% /opt/.sagemakerinternal
/dev/nvme1n1                     5.0G  529M  4.5G  11% /home/sagemaker-user
10.1.53.46@tcp:/ylacfb4v/shared  1.2T  7.5M  1.2T   1% /mnt/custom-file-systems/fsx_lustre/fs-0104f3de83efe0f33
127.0.0.1:/                      8.0E     0  8.0E   0% /mnt/custom-file-systems/efs/fs-0e16e272aba907ad3_fsap-08ae9b9f68be028d7
tmpfs                            1.9G     0  1.9G   0% /proc/acpi
tmpfs                            1.9G     0  1.9G   0% /sys/firmware

with the /shared partition.

Monitor SLURM Installation

Once you create your JupyterLab/CodeEditor instance, it will kick off the LifeCycleConfiguration (LCC). We've configured the LCC so that:

It installs necessary packages and dependencies
Downloads a script to install SLURM and set up MUNGE authentication
Logs progress to a file on your CodeEditor/JupyterLab instance

Before being able to run SLURM commands, please wait until the LCC fully installs SLURM and configures your instance as a login node. You can monitor the progress in the logs. To find the log file, head over to CloudWatch --> Logs --> Log Groups.

In the search box, search for /aws/sagemaker/studio and select it. You will be redirected to all the Log Streams under the /aws/sagemaker/studio log group.

Under Log Streams, search for <your-domain-id>/j/CodeEditor/default/LifecycleConfigOnStart (you can find the domain id from your CloudFormation stack outputs). In the logs, you will see

Starting background installation. Check /tmp/slurm_lifecycle_20250326_053740.log for progress...
Installation started in the background. Monitor the progress with:
tail -f /tmp/slurm_lifecycle_20250326_053740.log

Grab the tail command, and paste it onto your CodeEditor/JupyterLab terminal. You will see that SLURM is getting installed and configured. This process takes ~5-7 minutes, so go grab a cup of coffee!

You'll know that SLURM is installed when you see

Testing Slurm configuration...
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
dev*             up   infinite      4   idle ip-10-1-4-244,ip-10-1-31-165,ip-10-1-52-212,ip-10-1-90-199
ml.c5.4xlarge    up   infinite      4   idle ip-10-1-4-244,ip-10-1-31-165,ip-10-1-52-212,ip-10-1-90-199
=======================================
=======================================
SLURM is now configured! You can now interact with your cluster from your Studio environment!!
=======================================
=======================================

Pitfalls and known issues

You can't run srun.

You can run all other slurm commands, including sbatch, squeue, and sinfo. However, srun requires specific ports to be open for I/O, which isn't possible on Studio IDE containers today. As a workaround, if you MUST run srun, try

# Source environment variables, written by your LCC
source env_vars

# Run your srun command via ssm
aws ssm start-session \
    --target "sagemaker-cluster:${CLUSTER_ID}_${HEAD_NODE_NAME}-${CONTROLLER_ID}" \
    --document-name AWS-StartInteractiveCommand \
    --parameters '{
        "command":["srun -N 4 hostname"]
    }'

By using ssm, you are still using the controller machine to submit srun jobs to your cluster nodes.

We recommend running sbatch commands directly instead.

Example sbatch script:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/fsx/&lt;partition_name&gt;/slurm_%j.out
#SBATCH --error=/fsx/&lt;partition_name&gt;/slurm_%j.err
#SBATCH --nodes=1

echo "Testing write access to /fsx/&lt;partition_name&gt;"
date
hostname
whoami
nvidia-smi

Choosing a directory for files:

Note: When creating sbatch files, make sure you specify your --output and --error paths to an fsx path that both your Studio User and SLURM user (specified in LCC) have permission to write to. The safest bet would be to specify /fsx/<partition_name>/, where <partition_name> will either be shared or your studio user name, depending on what you set for SharedFSx. The permissions are handled via an ACL and automatically done by the LCC scripts.

SLURM failed to set up

This is a rare occurence, but it may happen because the MUNGE authentication key was incorrectly copied over from the controller machine. To remediate, you can follow the steps in the logs:

Here are the manual steps you can try:

################################################################################
#                        Manual MUNGE Key Installation                          #
################################################################################

1. Source environment variables & create a temporary file for the MUNGE key:
   source env_vars
   TEMP_FILE=$(mktemp)

2. Get MUNGE key hexdump:
   aws ssm start-session \
       --target "sagemaker-cluster:${CLUSTER_ID}_${HEAD_NODE_NAME}-${CONTROLLER_ID}" \
       --document-name AWS-StartInteractiveCommand \
       --parameters '{"command":["\n\n sudo hexdump -C /etc/munge/munge.key"]}' \
       > "${TEMP_FILE}"

3. Convert hexdump to binary and install:
   cat "${TEMP_FILE}" | grep "^[0-9a-f].*  |" | \
       sed 's/^[0-9a-f]\{8\}  //' | \
       cut -d'|' -f2 | \
       tr -d '|\n' | \
       sudo tee /etc/munge/munge.key > /dev/null

4. Restart MUNGE service:
   sudo service munge restart

5. Verify cluster status:
   sinfo

6. Cleanup:
   rm ${TEMP_FILE}

sinfo should work now!
################################################################################

Table of Contents​

Prerequisites​

Cluster Setup​

FSx for Lustre Configuration​

SageMaker Studio Domain Setup​

SageMaker Studio IDE Configuration​

Monitor SLURM Installation​

Pitfalls and known issues​

Table of Contents