Llama-3 70B (trn1.32xlarge) using NxD

Llama

This tutorial demonstrates launching a Llama 3 70B training job on SageMaker HyperPod (cluster of 16 x ml.trn1.32xlarge instances).

Prerequisutes

This guide assumes that you have a SMHP SLURM cluster of 16 x ml.trn1.32xlarge instances with a shared parallel filesystem like Amazon FSx for Lustre. If you don't have this yet, please follow the instructions listed in 1. Cluster Setup.
WLOG, we assume that you are operating from the home directory of the controller machine as user ubuntu (default provisioned power user).

Parallelism

For this sample, we will use the Neuronx Distributed (NxD) package alongside the PyTorch Neuron package. NeuronX Distributed is a package used to support different distributed training frameworks and provides a mechanism for those frameworks to run on xla based Neuron cores. NxD supports a bunch of data and model parallelism strategies that we will look into below. PyTorch Neuron is a software package that enables PyTorch training, evaluation, and inference on Neuron devices.

We will be utilizing 3D parallelism for this sample. 3D parallelism combines data parallelism with model (tensor + pipeline) parallelism into a cohesive framework, creating a 3-D mesh of devices. Each axis of this "mesh" corresponds to one of:

Data Parallelism Axis: Distributes training data across devices
Tensor Parallelism Axis: Parallelizes tensor layers' computations/calculations across devices
Pipeline Parallelism Axis: Distributes the model's layers across devices

This combination of data + model parallelism allows for efficient scaling and utilization of hardware resources. For instance, tensor parallelism requires the highest communication bandwidth and is best suited for Trainium chips within the same Trn1 node with strong NeuronLink interconnect. Pipeline parallelism, which has lower communication requirements, can be used across nodes. Data parallelism, which requires the least communication, can span across multiple nodes. To learn more about the Trainium Architecture, check the Neuron Docs.

Compilation

To get the best performance on Trainium, it's a good idea to compile the model before training. This can be done by running neuron_parallel_compile torchrun your_model.py and setting the model to train for a few steps (5-10). This will build the graphs of the model, and store them in a cache so the next time you run torchrun your_model.py training will start much faster.

Steps

We will be doing the following:

Download the llama3 model and tokenizer: Download the model weights (checkpoints) and the tokenizer. We will also convert the checkpoints based on the distributed training configuration
Download and preprocess the wiki-corpus dataset
Train with NxD!

Prerequisutes​

Parallelism​

Compilation​

Steps​

Prerequisutes

Parallelism

Compilation

Steps