Skip to main content

Helpful Advice

General tips and best practices applicable to all HyperPod deployments

📄️ Configure Cgroups for Slurm

Cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, etc.) of a collection of processes. In traditional environments, Cgroups allow system administrators to allocate resources such as CPU time, system memory, disk bandwidth, etc., among user-defined groups of tasks (processes) running on a system. We can configure Slurm to use Cgroups to constrain resources at the Slurm job and task level. A popular usecase for implementing Cgroups with Slurm is to use Process tracking proctrac/cgroup to isolate processes to a slurm job, thus ensuring all processes created by the job are contained within a cgroup, which helps in monitoring and controlling the resource usage by the job. It also helps in cleaning up processes after the job ends, ensuring that there are no "zombie" processes left running on the system.