Install and use the Hyperpod CLI
Prerequisites
-
You'll need to create a cluster with Amazon EKS on SageMaker HyperPod. Instructions can be found in 1. Cluster Setup. Please follow them if you haven't done so already.
-
Dependencies - Make sure that you deployed GPU device plugin, EFA device plugin, and Kubeflow training operator to your cluster. See Installing the required tools section and Add-ons section.
-
Fsx Lustre file system - please follow the steps detailed in "Set up your shared file system" under Orchestrated by EKS section here to create Fsx.
-
HyperPod Cli (recommended not mandatory) - Please follow the "Installing the Hyperpod CLI" installation instructions under Add-ons section.
Environment
Set up the virtual environment. Make sure you're using Python 3.9 or greater.
python3 -m venv ${PWD}/venv
source venv/bin/activate
Download and setup HyperPod recipes repo.
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 install -r requirements.txt
Data
HyperPod recipes support tokenized data in any of the below formats
- JSON
- JSONGZ (Compressed JSON)
- ARROW
Launch training
Using HyperPod Cli (Recommended)
We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for pretraining llama 3 8b model. You can check the recipe config for more details.
"recipes.model.data.use_synthetic_data":"true" ."recipes.model.hf_access_token": "<your_hf_token>"hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
After you’ve submitted a training job, you can use the following command to verify if you submitted it successfully.
kubectl get pods
NAME READY STATUS RESTARTS AGE
hf-llama3-<your-alias>-worker-0 0/1 running 0 36s
After the job STATUS changes to Running, you can examine the log by using the following command.
kubectl logs <name of pod>
Once the job is completed the STATUS of the pds will turn to Completed when you run kubectl get pods.
Using recipes launcher
Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.
- In
recipes_collection/cluster/k8s.yaml, update persistent_volume_claims . It mounts the fsx claim to the /data directory of each computing pod
- claimName: fsx-claim
mountPath: data
- In
recipes_collection/config.yaml, update repo_url_or_path under git to use the Hyperpod-recipes git URL
git:
repo_url_or_path: <training_adapter_repo>
branch: null
commit: null
entry_script: null
token: null
HyperPod recipes provides a launch script for each recipe under launcher_scripts directory. In order to pretrain llama 3.1 8b model, update the launch scripts under launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
The launch script should look like below
#!/bin/bash
#Users should setup their cluster type in /recipes_collection/config.yaml
REGION="<region>"
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
recipes.run.name="hf-llama3" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
cluster=k8s \
cluster_type=k8s \
container="${IMAGE}" \
recipes.model.data.train_dir=$TRAIN_DIR \
recipes.model.data.val_dir=$VAL_DIR
recipes.model.data.use_synthetic_data=true .recipes.model.hf_access_token=<your_hf_token>Once the script is ready you can launch the training job using below command
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
After you’ve submitted a training job, you can use the following command to verify if you submitted it successfully.
kubectl get pods
NAME READY STATUS RESTARTS AGE
hf-llama3-<your-alias>-worker-0 0/1 running 0 36s
After the job STATUS changes to Running, you can examine the log by using the following command.
kubectl logs <name of pod>
Once the job is completed the STATUS of the pods will turn to Completed when you run kubectl get pods.