Helpful Advice

📄️ Troubleshooting

So your cluster failed to create, what do you do now?

📄️ Bastion Host

Ok so what if we want to access our cluster with normal ssh and not ssm?

Login nodes allow users to login to the cluster, submit jobs, and view and manipulate data without running on the critical slurmctld scheduler node. This also allows you to run monitoring servers like aim, Tensorboard, or Grafana/Prometheus.

📄️ Gres (--gpus)

This section describes how to setup Slurm Gres which allows scheduling jobs based on the number of gpu's needed i.e. --gpus=4. Please see the below note before proceeding with the setup:

📄️ Diagnose GPU Failures

To diagnose a node with a bad gpu ip-10-1-69-242 on SageMaker HyperPod, do the following:

📄️ Heterogenous Cluster

Adding Worker Groups to an existing cluster

📄️ Configure Cgroups for Slurm

Cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.) of a collection of processes. In traditional environments, Cgroups allow system administrators to allocate resources such as CPU time, system memory, disk bandwidth, etc., among user-defined groups of tasks (processes) running on a system. We can configure Slurm to use Cgroups to constrain resources at the Slurm job and task level. A popular usecase for implementing Cgroups with Slurm is to use Process tracking proctrac/cgroup to isolate processes to a slurm job, thus ensuring all processes created by the job are contained within a cgroup, which helps in monitoring and controlling the resource usage by the job. It also helps in cleaning up processes after the job ends, ensuring that there are no "zombie" processes left running on the system.

📄️ Enable Slurm epilog Script

Slurm epilog scripts can be used to perform tasks automatically after a job completes on a cluster. Implementing Slurm epilog scripts allows users and administrators to automate essential post-job tasks, such as resource cleanup, logging and monitoring, notifcations, and data management:

📄️ Delete Cluster Nodes

The SageMaker BatchDeleteClusterNode API allows you to delete specific nodes within a SageMaker HyperPod cluster. BatchDeleteClusterNodes accepts a cluster name and a list of node IDs.

📄️ Troubleshoot IAM Permissions

Resolving AWS Configure Permissions Issues on HyperPod Nodes

📄️ Containers