Skip to main content

SSH into Cluster

Reference Documentation

For detailed SSH access methods and troubleshooting, see SSH into HyperPod.

Login to Your Cluster

Once the cluster status changes to InService you can connect to the cluster via SSM. You'll need to grab the cluster id, node group name and the instance id:

KeyExample ValueWhere to get
Cluster idq2vei6nzqldzARN in describe-cluster
Instance Idi-08982ccd4b6b34eb1list-cluster-nodes
Node Group Namecontroller-machinelist-cluster-nodes

1. Install the SSM Session Manager Plugin

sudo curl "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "/tmp/session-manager-plugin.deb"
sudo dpkg -i /tmp/session-manager-plugin.deb

2. Connect Using easy-ssh.sh

Run the easy-ssh.sh script:

awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh -c controller-machine ml-cluster

If asked "Would you like to add ml-cluster to ~/.ssh/config", please type in "yes".

3. Switch to the ubuntu user

sudo su - ubuntu
pwd # Should print out /fsx/ubuntu
warning

If you see an error like TargetNotConnected, check the cluster status with aws sagemaker describe-cluster --cluster-name ml-cluster. It needs to be InService prior to accessing the cluster.

You should see output similar to:

=================================================
==== 🚀 HyperPod Cluster Easy SSH Script! 🚀 ====
=================================================
Cluster id: dfzt1l941fbe
Instance id: i-01577219f576e5835
Node Group: controller-machine
aws ssm start-session --target sagemaker-cluster:dfzt1l941fbe_controller-machine-i-01577219f576e5835
Would you like to add ml-cluster to ~/.ssh/config (yes/no)?
> yes
✅ adding ml-cluster to ~/.ssh/config:
Starting session with SessionId: ...
#