Multi-Tenant Slurm on AWS ParallelCluster, Part 2: QoS Deep Dive

Part 1 of this series stood up an AWS ParallelCluster wired to an Aurora-backed Slurm accounting database, plus three users (alice, bob, charlie) sharing consistent UIDs across the head node and every compute node. Every job is now landing in slurmdbd with full attribution — but nothing is enforcing how the cluster gets shared. Whoever submits first wins, GPUs at any cost, 72-hour wall times, no per-team caps. That’s the gap Slurm’s Quality of Service (QoS) layer fills. ...

May 17, 2026 · 15 min · Keita Watanabe

Multi-Tenant Slurm on AWS ParallelCluster, Part 1: Accounting Database + Multi-User Setup

A shared GPU cluster without enforcement is a queue with a tragedy of the commons baked in. One researcher launches a 64-GPU sweep at 9pm; the on-call engineer can’t get a 1-GPU interactive session for debugging the next morning; a third user’s batch job — submitted a week ago and patiently waiting — keeps getting shuffled to the back because nothing prevents the latest submissions from front-running it. The scheduler is doing exactly what it was configured to do: nothing about it. ...

May 16, 2026 · 23 min · Keita Watanabe

Building Blocks for Foundation Model Training and Inference on AWS

For a long time, “scaling” in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as Kaplan et al. (2020), which reported predictable power-law trends in loss as you scale model parameters, dataset size, and training compute. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized. But the frontier has evolved—and scaling is no longer a single curve. NVIDIA’s “from one to three scaling laws” framing usefully emphasizes that, beyond pre-training, performance increasingly scales through post-training (e.g., supervised fine-tuning (SFT) and reinforcement learning (RL)-based methods) and through test-time compute (“long thinking,” search/verification, multi-sample strategies). ...

May 7, 2026 · 21 min · Aman Shanbhag, Pavel Belevich, Keita Watanabe