Login Node
Login nodes allow users to login to the cluster, submit jobs, and view and manipulate data without running on the critical slurmctld scheduler node. This also allows you to run monitoring servers like aim, Tensorboard, or Grafana/Prometheus.
In this guide we'll assume you have a cluster setup already with a FSx Filesystem.
Setup
- First modify your
cluster_config.jsonfile and add a section:
{
"InstanceGroupName": "login-group",
"InstanceType": "ml.m5.4xlarge",
"InstanceCount": 1,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET}/src",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${ROLE}",
"ThreadsPerCore": 2
},
You'll also need to remove the VpcConfig section from the cluster_config.json file.
- Next update your
provisioning_parameters.jsonfile to include the line:
"login_group": "login-group",
- Upload that to S3:
# copy to the S3 Bucket
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
- Verify the
provisioning_parameters.jsonwere correctly updated. You should see the new parameterlogin_group:
aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json -
- Finally update your cluster:
aws sagemaker update-cluster --cli-input-json file://cluster-config.json --region $AWS_REGION
Login
- Using the
easy-ssh.shscript we'll login to the login node:
./easy-ssh.sh -c login-group ml-cluster
- Change the home directory to
/fsx/ubuntu:
usermod -m -d /fsx/ubuntu ubuntu