Skip to main content

Train Llama 3.1 8B model using SageMaker HyperPod

In this section, we showcase how to pre-train Llama3.1-8B, Llama3 8B model using Trn1.32xlarge/Trn1n.32xlarge instances using the Neuron Distributed library. To train the LLama model in this example, we will apply the following optimizations using the Neuron Distributed library:

  1. Tensor Parallelism

  2. Sequence Parallel

  3. Selective checkpointing

  4. ZeRO-1

Setup your environment

Login to ECR and pull the pytorch-training-neuronx image

region=us-east-2
dlc_account_id=763104351884
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker pull 763104351884.dkr.ecr.us-east-2.amazonaws.com/pytorch-training-neuronx:2.1.2-neuronx-py310-sdk2.19.1-ubuntu20.04

On your x86-64 based development environment:

Navigate to your home directory or your preferred project directory, clone the repo.

cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/pytorch/neuronx-distributed/llama3/kubernetes

We will build docker image using the Dockerfile in this directory.

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=llama3_trn
export TAG=:latest
docker build $DOCKER_NETWORK -t ${REGISTRY}${IMAGE}${TAG} .
Why $DOCKER_NETWORK?

The environment variable $DOCKER_NETWORK is set to --network=sagemaker only if you deployed the SageMaker Studio Code Editor CloudFormation stack in the Set Up Your Development Environment section. This is necessary because SageMaker Studio uses a specific network configuration for its containers. Otherwise, it remains unset.

Then push the image to your private registry

# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
echo "Creating repository ${REGISTRY}${IMAGE} ..."
aws ecr create-repository --repository-name ${IMAGE}
else
echo "Repository ${REGISTRY}${IMAGE} already exists"
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}

Create your training job and start it

Generate Job Spec Files for tokenization and training

The default config in the script launches a 8B Llama 3.1 model. When you run the generate-jobspec.sh script it creates 2 yaml files tokenize_data.yaml and llama3_train.yaml

You will have to update the HF_ACCESS_TOKEN in order for the tokenization to work.

Please edit the ./generate-jobspec.sh script with your desired environment settings.

./generate-jobspec.sh

Tokenize Data

The example uses wikicorpus dataset from Hugginface Hub. The tokenize_data.yaml job downloads the dataset and tokenizes it. Finally store the dataset in Fsx Lustre which can be used for training the model.

kubectl apply -f ./tokenize_data.yaml

Train Model

The train_llama3.yaml job spec file trains llama 3.1 8B model with the tokenized data from previous step. By default the code uses 1 trn1.32xlarge but can be changed to any number of nodes.

kubectl apply -f ./llama3_train.yaml