Skip to main content

Get Started Training Llama 2 with PyTorch FSDP in 5 Minutes

These scripts provide an easy way to get started with multinode FSDP training on Slurm. It is designed to be as simple as possible, requires no data preparation, and uses a simple Conda environment.

Prerequisites

Before running this training, you'll need to create a Hyperpod cluster with an FSx for Lustre file system. Instructions can be found in 1. Cluster Setup. Please follow them if you haven't done so already.

Setup

Create Environment

On your cluster head node:

TODO: we should stick with a single place for the ADT repo. Make this part of the initial cluster setup. Add it ot FSx. Every customer use it, so let's make it part of the required steps to deploy the cluster.

  1. Navigate to your home directory (assuming this was setup as a shared directory) and clone the repo:

    cd ~
    git clone https://github.com/aws-samples/awsome-distributed-training/
    cd awsome-distributed-training/3.test_cases/pytorch/FSDP/slurm
  2. Run the create_venv.sh script.

    • This script will first download and install Miniconda, then create a Conda env called pt_fsdp.
    • By creating this environment on the shared FSx for Lustre volume, all compute nodes in our cluster will have access to it.
    . ./create_venv.sh

Data

For this example, we'll be using the allenai/c4 dataset. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there's no data prep required for running this training.

If you'd like to instead use your own dataset, you can do so by formatting it as a HuggingFace dataset, and passing its location to the --dataset_path argument.

Training

Create HuggingFace Token

For this dataset, we will need a Hugging Face access token. First, create a Hugging Face account. Then generate your access token with read permissions. Set your HuggingFace Token as an environment variable in your Python Virtual Environment by running:

TODO: we ask customers to create HF TOKENS all the time. As this is a pre-requisites for most of the content on the ADT repo, we SHOULD move this to the initial cluster setup part of the content.

export HF_TOKEN=<YOUR HF ACCESS TOKEN>

Launch Training

The script to launch a Slurm batch training job can be found in llama2_7b-training.sbatch. You can adjust the number of training nodes by modifying #SBATCH --nodes=4. You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model_utils/arguments.py. Note that we use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way if our training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

To launch your training, run

sbatch llama2_7b-training.sbatch

You'll find a new file in the logs directory of the form logs/llama2_7b-FSDP_[JOB ID].out. This will be continuously updated with your training logs. Don't be worried if you see a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should see your model training, with an output similar to below for Llama2 :

+ TORCHRUN_ARGS=('--nproc_per_node=8' '--nnodes=4' '--rdzv_id=2513' '--rdzv_backend=c10d' '--rdzv_endpoint=p5-dy-gpu-1')
+ TORCHRUN=torchrun
+ export TRAIN_SCRIPT=./train.py
+ TRAIN_SCRIPT=./train.py
+ TRAINING_ARGS=('--max_context_width=4096' '--num_key_value_heads=32' '--intermediate_size=11008' '--hidden_width=4096' '--num_layers=32' '--num_heads=32' '--model_type=llama_v2' '--tokenizer=hf-internal-testing/llama-tokenizer' '--checkpoint_freq=5000' '--validation_freq=500' '--max_steps=5000' '--checkpoint_dir=./checkpoints' '--dataset=c4' '--dataset_config_name=en' '--resume_from_checkpoint=./checkpoints' '--train_batch_size=1' '--val_batch_size=1' '--sharding_strategy=full' '--offload_activations=1')
...
0: 2025-04-04 19:56:52 I [train.py:156] Creating Model
0: 2025-04-04 19:57:57 I [train.py:172] Created model with total parameters: 6889410560 (6.89 B)
...
1: p5-dy-gpu-2:62571:62571 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
1: p5-dy-gpu-2:62574:62574 [4] NCCL INFO cudaDriverVersion 12040
2: p5-dy-gpu-3:60823:61204 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.14.0
2: p5-dy-gpu-3:60823:61204 [2] NCCL INFO NET/OFI Using Libfabric version 1.22
...
0: 2025-04-04 19:58:26 I [train.py:103] Batch 0 Loss: 11.63327, Speed: 2.80 samples/sec, lr: 0.000006
0: 2025-04-04 19:58:28 I [train.py:103] Batch 1 Loss: 11.64674, Speed: 17.06 samples/sec, lr: 0.000013
0: 2025-04-04 19:58:30 I [train.py:103] Batch 2 Loss: 11.56934, Speed: 17.61 samples/sec, lr: 0.000019
0: 2025-04-04 19:58:32 I [train.py:103] Batch 3 Loss: 11.30075, Speed: 17.66 samples/sec, lr: 0.000025
0: 2025-04-04 19:58:33 I [train.py:103] Batch 4 Loss: 11.00539, Speed: 17.66 samples/sec, lr: 0.000031
0: 2025-04-04 19:58:35 I [train.py:103] Batch 5 Loss: 10.39471, Speed: 17.28 samples/sec, lr: 0.000038

To modify training for different model sizes, change the corresponding parameters based on the values in the Llama 2 and Llama 3 papers:

ParameterLlama 2 7BLlama 2 13BLlama 2 70BLlama 3.1 8BLlama 3.1 70BLlama 3.2 1BLlama 3.2 3B
intermediate_size1100813824286721433628672819211008
num_key_value_heads324088888
hidden_width4096512081924096819220483072
num_layers32408032801628
num_heads32406432643224
max_context_length4096409640968192819281928192

If you need to cancel or modify your job, see the Slurm commands available in the Slurm documentation.