LoRA - Trainium
This example showcases how to train llama 3.1 models using AWS Trainium instances and Huggingface Optimum Neuron. 🤗 Optimum Neuron is the interface between the 🤗 Transformers library and AWS Accelerators including AWS Trainium and AWS Inferentia. It provides a set of tools enabling easy model loading, training and inference on single- and multi-Accelerator settings for different downstream tasks.
Prerequisites
Before running this training, you'll need to create a SageMaker HyperPod cluster with at least 1 trn1.32xlarge/ trn1n.32xlarge instance group.
Please make sure that you deploy Neuron device plugin, EFA device plugin, and Kubeflow training operator to your cluster.
See What Dependencies are Installed on Your EKS Cluster for details.
To build a container image, you need a x86-64 based development environment with Docker installed. If you use recent Mac with Apple Silicon, they are not x86-64 based but ARM based. You can use SageMaker Code Editor for this purpose.
Since llama 3.1 is a gated model users have to register in Huggingface and obtain an HF_Access_Token before running this example.
We need to setup an PVC for FSx to store the tokenized data and training checkpoints. Please follow the link here to setup FSx CSI Driver and PVC.
Verified instance types, instance counts
- ml.trn1.32xlarge x (1,2)
- ml.trn1n.32xlarge x (1,2)
Validate the cluster configuration
- View the AWS Console following this instruction.
- Set environment variables. This is done in Verifying cluster connection to EKS.
Create and mount the FSx Lustre File System to the SageMaker HyperPod
-
First install the FSx for Lustre CSI driver following this instruction, and we will use dynamic provisioning
-
Create a persistent volume claim that uses the
fsx-claim
storage claim with namespacekubeflow
:
cat <<EOF> pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: fsx-claim
namespace: kubeflow
spec:
accessModes:
- ReadWriteMany
storageClassName: fsx-sc
resources:
requests:
storage: 1200Gi
EOF
kubectl apply -f pvc.yaml
This persistent volume claim will kick off the dynamic provisioning of an FSx for Lustre file system based on the specifications provided in the storage class.
View the status of the persistent volume claim:
kubectl describe pvc fsx-claim -n kubeflow
Mount the volume to container
After the pvc status is set to Bound
before deploying any pods that reference the persistent volume claim. The status may remain in a Pending state ( ~10 mins) while the file system is being provisioned.
Use the following command to mount the volume.
cat <<EOF> pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: fsx-app
namespace: kubeflow
spec:
containers:
- name: app
image: ubuntu
command: ["/bin/sh"]
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
volumeMounts:
- name: persistent-storage
mountPath: /data
volumes:
- name: persistent-storage
persistentVolumeClaim:
claimName: fsx-claim
EOF
kubectl apply -f pod.yaml
Apply Low-Rank Adaptation (LoRA) Finetune Llama 3.1 8B model with Optimum Neuron using SageMaker HyperPod
In this section, we showcase how to finetune Llama3.1-8B, Llama3 8B model using Trn1.32xlarge/Trn1n.32xlarge instances using the Optimum Neuron library. To finetune the LLama model in this example, we will apply the following optimization techniques:
Login to ECR and pull the huggingface-pytorch-training-neuronx
image
region=us-east-1
dlc_account_id=763104351884
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
docker pull ${dlc_account_id}.dkr.ecr.${region}.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-v1.0
On your x86-64 based development environment:
Navigate to your home directory or your preferred project directory, clone the repo.
cd ~
git clone https://github.com/Captainia/awsome-distributed-training.git
git checkout optimum-neuron-eks
cd 3.test_cases/pytorch/optimum-neuron/llama3/kubernetes/fine-tuning
Build Docker Image and push to ECR
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=peft-optimum-neuron
export TAG=:latest
docker build --network=sagemaker -t ${REGISTRY}${IMAGE}${TAG} .
We will build docker image using the Dockerfile in this directory.
The environment variable$DOCKER_NETWORK
is set to --network=sagemaker
only if you deployed the SageMaker Studio Code Editor. This is necessary because SageMaker Studio uses a specific network configuration for its containers. Otherwise, it remains unset.
Then push the image to your private registry
# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
echo "Creating repository ${REGISTRY}${IMAGE} ..."
aws ecr create-repository --repository-name ${IMAGE}
else
echo "Repository ${REGISTRY}${IMAGE} already exists"
fi
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}
Generate Job Spec Files for tokenization and training
The default config in the script launches a 8B Llama 3.1 model. When you run the generate-jobspec.sh script it creates 2 yaml files tokenize_data.yaml and llama3_train.yaml
You will have to update the HF_ACCESS_TOKEN in order for the tokenization to work.
Please edit the ./generate-jobspec.sh
script with your desired environment settings.
./generate-jobspec.sh
Tokenize Data
The example uses wikicorpus dataset from Hugginface Hub. The tokenize_data.yaml job downloads the dataset and tokenizes it. Finally store the dataset in Fsx Lustre which can be used for training the model.
kubectl apply -f ./tokenize_data.yaml
Compile the model
Training on Trainium requires model compilation using the neuron_parallel_compile utility.
kubectl apply -f ./compile_peft.yaml
This step does the following:
- Extracts computation graphs from a trial run (~10 training steps)
- Performs parallel pre-compilation of these graphs
- Uses identical scripts to actual training but with reduced max_steps
- Prepares the model for efficient execution on Trainium hardware
The compilation process is essential for optimizing model performance on the specialized Trainium architecture.
Train Model
The launch_peft_train.yaml job spec file finetunes llama 3.1 8B model with the tokenized data from previous step. By default the code uses 1 trn1.32xlarge but can be changed to any number of nodes.
kubectl apply -f ./launch_peft_train.yaml
The training process uses tensor parallelism with degree 8 and leverages all 32 NeuronCores in the ml.trn1.32xlarge instance. Key features include:
- Data parallel degree of 4
- BFloat16 precision (XLA_USE_BF16=1) for reduced memory footprint
- Gradient accumulation steps of 3 for larger effective batch size
- LoRA configuration with: r=16 (rank) lora_alpha=16 lora_dropout=0.05 Target modules: q_proj and v_proj
Consolidation the trained weights
During distributed training, model checkpoints are split across multiple devices. The consolidation process:
- Combines distributed checkpoints into a unified model
- Processes tensors in memory-efficient chunks
- Creates sharded outputs with an index file
- Saves the consolidated weights in safetensor format
This step is crucial for bringing together the distributed training results into a usable format.
kubectl apply -f ./consolidation.yaml
Merge LoRA weights
The final step merges the LoRA adapters with the base model.
kubectl apply -f ./merge_lora.yaml
This process does the following:
- Loads the base model and LoRA configuration
- Transforms LoRA weight names to match base model structure
- Merges the adapters with the original model weights
- Saves the final model in a sharded format
The resulting merged model combines the base model's knowledge with the task-specific adaptations learned during fine-tuning, while maintaining the efficiency benefits of LoRA training.