Parallelcluster

Multi-Tenant Slurm on AWS ParallelCluster, Part 2: QoS Deep Dive

Part 1 of this series stood up an AWS ParallelCluster wired to an Aurora-backed Slurm accounting database, plus three users (alice, bob, charlie) sharing consistent UIDs across the head node and every compute node. Every job is now landing in slurmdbd with full attribution — but nothing is enforcing how the cluster gets shared. Whoever submits first wins, GPUs at any cost, 72-hour wall times, no per-team caps. That’s the gap Slurm’s Quality of Service (QoS) layer fills. ...

Multi-Tenant Slurm on AWS ParallelCluster, Part 1: Accounting Database + Multi-User Setup

A shared GPU cluster without enforcement is a queue with a tragedy of the commons baked in. One researcher launches a 64-GPU sweep at 9pm; the on-call engineer can’t get a 1-GPU interactive session for debugging the next morning; a third user’s batch job — submitted a week ago and patiently waiting — keeps getting shuffled to the back because nothing prevents the latest submissions from front-running it. The scheduler is doing exactly what it was configured to do: nothing about it. ...