Skip to main content

Login Node

Login nodes allow users to login to the cluster, submit jobs, and view and manipulate data without running on the critical slurmctld scheduler node. This also allows you to run monitoring servers like aim, Tensorboard, or Grafana/Prometheus.

In this guide we'll assume you have a cluster setup already with a FSx Filesystem.

Setup

  1. First modify your cluster_config.json file and add a section:
{
"InstanceGroupName": "login-group",
"InstanceType": "ml.m5.4xlarge",
"InstanceCount": 1,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET}/src",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${ROLE}",
"ThreadsPerCore": 2
},

You'll also need to remove the VpcConfig section from the cluster_config.json file.

  1. Next update your provisioning_parameters.json file to include the line:
  "login_group": "login-group",
  1. Upload that to S3:
# copy to the S3 Bucket
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
  1. Verify the provisioning_parameters.json were correctly updated. You should see the new parameter login_group:
aws s3 cp s3://${BUCKET}/src/provisioning_parameters.json -
  1. Finally update your cluster:
aws sagemaker update-cluster  --cli-input-json file://cluster-config.json --region $AWS_REGION

Login

  1. Using the easy-ssh.sh script we'll login to the login node:
./easy-ssh.sh -c login-group ml-cluster
  1. Change the home directory to /fsx/ubuntu:
usermod -m -d /fsx/ubuntu ubuntu