Get Started Training Llama 2 with PyTorch FSDP in 5 Minutes

This example showcases an easy way to get started with multi node FSDP training on Amazon EKS on SageMaker HyperPod. It is designed to be as simple as possible, requires no data preparation, and uses a docker image.

Prerequisites

Before starting, ensure you have completed the following setup:

Infrastructure Requirements

✅ SageMaker HyperPod EKS cluster deployed and running
✅ GPU node groups with appropriate instance types (e.g., ml.g5.8xlarge, ml.p5en.48xlarge)
✅ GPU device plugin installed on the cluster
✅ EFA device plugin installed for high-performance networking
✅ Kubeflow Training Operator installed on the cluster

Development Environment

✅ AWS CLI v2 installed and configured with appropriate permissions
✅ kubectl installed and configured to access your EKS cluster
✅ Docker installed on your development machine (x86-64 based)
✅ envsubst utility for template processing
✅ Git for cloning repositories
✅ HuggingFace account and token for dataset access

AWS Permissions

Your AWS credentials should have permissions for:

✅ Amazon ECR - push/pull container images
✅ Amazon EKS - access cluster resources
✅ Amazon EC2 - describe instances and availability zones
✅ AWS STS - get caller identity

Cluster Validation

Verify your cluster is ready:

# Check cluster status and GPU availability
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,INSTANCETYPE:.metadata.labels.node\.kubernetes\.io/instance-type,GPU:.status.allocatable.nvidia\.com/gpu,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"

# Verify Kubeflow Training Operator is running
kubectl get pods -n kubeflow

# Check GPU device plugin
kubectl get daemonset -n kube-system | grep nvidia

# Verify EFA device plugin
kubectl get daemonset -n kube-system | grep aws-efa

# Verify you can create resources
kubectl auth can-i create pytorchjobs

Verified Instance Types and Counts

This example has been verified with:

ml.p5en.48xlarge x 2 - High-performance training setup

Please note you can change the model size to accommodate for other instance types.

Model Size Configurations

The following table shows the parameters for different Llama model sizes based on the Llama 2 and Llama 3 papers:

Parameter	Llama 2 7B	Llama 2 13B	Llama 2 70B	Llama 3.1 8B	Llama 3.1 70B	Llama 3.2 1B	Llama 3.2 3B
intermediate_size	11008	13824	28672	14336	28672	8192	11008
num_key_value_heads	32	40	8	8	8	8	8
hidden_width	4096	5120	8192	4096	8192	2048	3072
num_layers	32	40	80	32	80	16	28
num_heads	32	40	64	32	64	32	24
max_context_length	4096	4096	4096	8192	8192	8192	8192

These configurations can be used to adjust the model parameters in your training scripts based on your compute requirements and available instance types.

Step 1: Setup the Docker Image

1.1 Clone the Repository

The first step is to get the FSDP training code and Docker configuration. We'll clone the AWS distributed training examples repository which contains pre-built PyTorch FSDP examples optimized for Kubernetes.

cd ~
git clone https://github.com/aws-samples/awsome-distributed-training/
cd awsome-distributed-training/3.test_cases/pytorch/FSDP

1.2 Build a Docker Image

Now we'll build a container image that includes PyTorch, FSDP training code, and all necessary dependencies. First, we need to authenticate with the public ECR registry to access base images.

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/hpc-cloud
export REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/

Build the container image:

If you are on a Mac, use buildx to target linux/amd64 architecture:

docker buildx build --platform linux/amd64 -t ${REGISTRY}fsdp:pytorch2.5.1 .

Alternatively, if you are running in a SageMaker Studio environment

docker build $DOCKER_NETWORK -t ${REGISTRY}fsdp:pytorch2.5.1 .    

Why $DOCKER_NETWORK?

The environment variable $DOCKER_NETWORK is set to --network=sagemaker only if you deployed the SageMaker Studio Code Editor CloudFormation stack in the Set Up Your Development Environment section. This is necessary because SageMaker Studio uses a specific network configuration for its containers. Otherwise, it remains unset.

Building the image can take 5~7 minutes. If successful, you should see the following success message at the end:

Successfully built 123ab12345cd
Successfully tagged 123456789012.dkr.ecr.us-west-2.amazonaws.com/fsdp:pytorch2.5.1

1.3 Push the Image to Amazon ECR

In this step we create a container registry if one does not exist, and push the container image to it. This makes the image available to your EKS cluster nodes.

# Create registry if needed
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "fsdp" | wc -l)
if [ "$REGISTRY_COUNT" -eq 0 ]; then
    aws ecr create-repository --repository-name fsdp
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}fsdp:pytorch2.5.1

Pushing the image may take some time depending on your network bandwidth. If you use EC2 / CloudShell as your development machine, it will take 6~8 minutes.

Step 2: Data and HuggingFace Setup

2.1 Understanding the Dataset

For this example, we'll be using the allenai/c4 dataset. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there's no data prep required for running this training.

If you'd like to instead use your own dataset, you can do so by formatting it as a HuggingFace dataset, and passing its location to the --dataset_path argument.

2.2 Create HuggingFace Token

For this dataset, we will need a Hugging Face access token. First, create a Hugging Face account. Then generate your access token with read permissions.

We will reference this token in the next step by setting it as an environment variable.

Step 3: Start Your Training Run

3.1 Install envsubst

This example uses envsubst to generate a Kubernetes manifest file from a template file and parameters. If you don't have envsubst on your development environment, install it by following the Installation instruction.

3.2 Generate Manifest from Template

With the envsubst command, generate fsdp.yaml from fsdp.yaml-template. Please configure instance type, number of nodes, number of GPUs, number of EFAs, based on your cluster's specification.

You can check your cluster's specification by running the following command:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,INSTANCETYPE:.metadata.labels.node\.kubernetes\.io/instance-type,GPU:.status.allocatable.nvidia\.com/gpu,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"

NAME                           INSTANCETYPE    GPU   EFA
hyperpod-i-055aeff9546187dee   ml.g5.8xlarge   1     1
hyperpod-i-09662f64f615c96f5   ml.g5.8xlarge   1     1
hyperpod-i-099e2a84aba621d52   ml.g5.8xlarge   1     1
hyperpod-i-0a6fea3329235be91   ml.g5.8xlarge   1     1
hyperpod-i-0ac3feb733dc0f00e   ml.g5.8xlarge   1     1
hyperpod-i-0bf7dce836e063fa6   ml.g5.8xlarge   1     1
hyperpod-i-0ddf28f3ff2870f1b   ml.g5.8xlarge   1     1
hyperpod-i-0fe48912b03d2c22e   ml.g5.8xlarge   1     1

Change directories to the kubernetes directory:

cd kubernetes/

Set environment variables and run envsubst to generate fsdp.yaml.

For ml.g5.8xlarge x 8:

export IMAGE_URI=${REGISTRY}fsdp:pytorch2.5.1
export INSTANCE_TYPE=ml.g5.8xlarge
export NUM_NODES=8
export GPU_PER_NODE=1
export EFA_PER_NODE=1
export FI_PROVIDER=efa
export HF_TOKEN=<Your HuggingFace Token>

cat fsdp.yaml-template | envsubst > fsdp.yaml

3.3 Deploy the Training Job

Now the manifest file fsdp.yaml is generated, and you are ready to deploy the training workload. Run the following command to deploy the training workload.

kubectl apply -f ./fsdp.yaml

You should see the following message:

pytorchjob.kubeflow.org/fsdp created

3.4 Monitor Your Training Job

To see the status of your job, use the commands below:

kubectl get pytorchjob
kubectl get pods

NAME   STATE     AGE
fsdp   Running   5m

NAME                    READY   STATUS              RESTARTS   AGE
etcd-7787559c74-pw4jp   1/1     Running             0          74s
fsdp-worker-0           0/1     ContainerCreating   0          74s
fsdp-worker-1           0/1     ContainerCreating   0          74s
fsdp-worker-2           0/1     ContainerCreating   0          74s
fsdp-worker-3           0/1     ContainerCreating   0          74s
fsdp-worker-4           0/1     ContainerCreating   0          74s
fsdp-worker-5           0/1     ContainerCreating   0          74s
fsdp-worker-6           0/1     ContainerCreating   0          74s
fsdp-worker-7           0/1     ContainerCreating   0          74s

When you run for the first time, it takes 3~4 minutes until the Pod statuses change from ContainerCreating to Running.

NAME                    READY   STATUS    RESTARTS   AGE
etcd-7787559c74-pw4jp   1/1     Running   0          3m43s
fsdp-worker-0           1/1     Running   0          3m43s
fsdp-worker-1           1/1     Running   0          3m43s
fsdp-worker-2           1/1     Running   0          3m43s
fsdp-worker-3           1/1     Running   0          3m43s
fsdp-worker-4           1/1     Running   0          3m43s
fsdp-worker-5           1/1     Running   0          3m43s
fsdp-worker-6           1/1     Running   0          3m43s
fsdp-worker-7           1/1     Running   0          3m43s

Each of the pods produces job logs. One of the pods is elected master during job initialization. Only this pod will show the progress of the training job in its log. To find out which pod is currently the master, run the command below:

kubectl logs fsdp-worker-0 | grep master_addr=

[2024-06-25 22:20:17,556] torch.distributed.elastic.agent.server.api: [INFO]   master_addr=fsdp-worker-1

This shows that the pod fsdp-worker-1 is currently the master. To look at the current job logs, use the command below:

kubectl logs -f fsdp-worker-1

    :
2024-06-25 22:22:36 I [train.py:102] Batch 0 Loss: 11.63946, Speed: 0.27 samples/sec, lr: 0.000006
2024-06-25 22:22:57 I [train.py:102] Batch 1 Loss: 11.66096, Speed: 0.39 samples/sec, lr: 0.000013
2024-06-25 22:23:17 I [train.py:102] Batch 2 Loss: 11.56659, Speed: 0.40 samples/sec, lr: 0.000019
2024-06-25 22:23:37 I [train.py:102] Batch 3 Loss: 11.14039, Speed: 0.40 samples/sec, lr: 0.000025
    :

You can execute nvtop command inside a running container within a Pod to see GPU utilization:

kubectl exec -it fsdp-worker-4 -- nvtop

3.5 Stop the Training

To stop the current training job, use the following command:

kubectl delete -f ./fsdp.yaml

Alternative: Start Training with the HyperPod CLI

Note: This page shows how to run the sample application with HyperPod CLI, instead of kubectl. If you didn't install the HyperPod CLI, see the Install HyperPod CLI page.

Set environment variables

Check your cluster's specification by running following command:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,INSTANCETYPE:.metadata.labels.node\.kubernetes\.io/instance-type,GPU:.status.allocatable.nvidia\.com/gpu,EFA:.status.allocatable.vpc\.amazonaws\.com/efa"

NAME                           INSTANCETYPE    GPU   EFA
hyperpod-i-055aeff9546187dee   ml.g5.8xlarge   1     1
hyperpod-i-09662f64f615c96f5   ml.g5.8xlarge   1     1
hyperpod-i-099e2a84aba621d52   ml.g5.8xlarge   1     1
hyperpod-i-0a6fea3329235be91   ml.g5.8xlarge   1     1
hyperpod-i-0ac3feb733dc0f00e   ml.g5.8xlarge   1     1
hyperpod-i-0bf7dce836e063fa6   ml.g5.8xlarge   1     1
hyperpod-i-0ddf28f3ff2870f1b   ml.g5.8xlarge   1     1
hyperpod-i-0fe48912b03d2c22e   ml.g5.8xlarge   1     1

Set following environment variables based on your cluster configuration.

export IMAGE_URI=${REGISTRY}fsdp:pytorch2.5.1
export INSTANCE_TYPE=ml.g5.8xlarge
export NUM_NODES=8
export GPU_PER_NODE=1

Generate a job configuration file

Run following command to generate a job configuration file (hpcli-fsdp.yaml) for HyperPod CLI.

cat > hpcli-fsdp.yaml << EOL
defaults:
 - override hydra/job_logging: stdout

hydra:
 run:
  dir: .
 output_subdir: null

training_cfg:
 entry_script: /fsdp/train.py
 script_args:
    - --max_context_width: 4096
    - --num_key_value_heads: 32
    - --intermediate_size: 11008
    - --hidden_width: 4096
    - --num_layers: 32
    - --num_heads: 32
    - --model_type: llama_v2
    - --tokenizer: hf-internal-testing/llama-tokenizer
    - --checkpoint_freq: 5000
    - --validation_freq: 500
    - --max_steps: 5000
    - --checkpoint_dir: /checkpoints
    - --dataset: allenai/c4
    - --dataset_config_name: en
    - --resume_from_checkpoint: /checkpoints
    - --train_batch_size: 1
    - --val_batch_size: 1
    - --sharding_strategy: full
    - --offload_activation: 1

 run:
  name: fsdp
  nodes: ${NUM_NODES}
  ntasks_per_node: ${GPU_PER_NODE}
cluster:
 cluster_type: k8s
 instance_type: ${INSTANCE_TYPE}
 cluster_config:
  service_account_name: null

  volumes:
    - volumeName: local
      hostPath: "/mnt/k8s-disks/0"
      mountPath: "/local"

  namespace: kubeflow
  label_selector:
      required:
          sagemaker.amazonaws.com/node-health-status:
              - Schedulable
      preferred:
          sagemaker.amazonaws.com/deep-health-check-status:
              - Passed
      weights:
          - 100
  pullPolicy: Always
  restartPolicy: OnFailure

  annotations:
    sagemaker.amazonaws.com/enable-job-auto-resume: True
    sagemaker.amazonaws.com/job-max-retry-count: 10

base_results_dir: ./result
container: ${IMAGE_URI}

env_vars:
 LOGLEVEL: DEBUG
 TORCH_DISTRIBUTED_DEBUG: DETAIL
 TORCH_NCCL_ENABLE_MONITORING: 1
 TORCH_NCCL_TRACE_BUFFER_SIZE: 20000
 TORCH_NCCL_DUMP_ON_TIMEOUT: 1
 TORCH_NCCL_DEBUG_INFO_TEMP_FILE: /local/nccl_trace_rank_
 PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
 NCCL_DEBUG: INFO
 NCCL_SOCKET_IFNAME: ^lo
 TORCH_NCCL_ASYNC_ERROR_HANDLING: 1
EOL

Start training job

Now the job configuration file hpcli-fsdp.yaml is generated, and you are ready to start the training job.

Before startuing the job, you need to select the cluster with hyperpod connect-cluster command.

hyperpod connect-cluster --cluster-name ml-cluster

Then run hyperpod start-job command to start the job.

hyperpod start-job --config-file ./hpcli-fsdp.yaml

{
 "Console URL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/ml-cluster"
}

Monitor

To see the status of your job, use the commands below:

hyperpod get-job --job-name fsdp -n kubeflow

{
 "Name": "fsdp",
 "Namespace": "kubeflow",
 "Label": {
  "app": "fsdp",
  "app.kubernetes.io/managed-by": "Helm"
 },
 "CreationTimestamp": "2024-09-26T01:06:51Z",
 "Status": {
  "conditions": [
   {
    "lastTransitionTime": "2024-09-26T01:06:51Z",
    "lastUpdateTime": "2024-09-26T01:06:51Z",
    "message": "PyTorchJob fsdp is created.",
    "reason": "PyTorchJobCreated",
    "status": "True",
    "type": "Created"
   },
   {
    "lastTransitionTime": "2024-09-26T01:07:02Z",
    "lastUpdateTime": "2024-09-26T01:07:02Z",
    "message": "PyTorchJob kubeflow/fsdp is running.",
    "reason": "PyTorchJobRunning",
    "status": "True",
    "type": "Running"
   }
  ],
  "replicaStatuses": {
   "Worker": {
    "active": 8,
    "selector": "training.kubeflow.org/job-name=fsdp,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker"
   }
  },
  "startTime": "2024-09-26T01:07:00Z"
 },
 "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/k8-g5-8x-4"
}

If you need more detailed information of the job, you can use --verbose option.

hyperpod get-job --job-name fsdp -n kubeflow --verbose

{
 "Name": "fsdp",
 "Namespace": "kubeflow",
 "Label": {
  "app": "fsdp",
  "app.kubernetes.io/managed-by": "Helm"
 },
 "Annotations": {
  "meta.helm.sh/release-name": "fsdp",
  "meta.helm.sh/release-namespace": "kubeflow",
  "sagemaker.amazonaws.com/enable-job-auto-resume": "true",
  "sagemaker.amazonaws.com/job-max-retry-count": "10"
 },
 "Metadata": {
  "CreationTimestamp": "2024-09-26T01:06:51Z",
  "Generation": 1,
  "ResourceVersion": "4240104",
  "UID": "39364a40-70c7-4d03-abab-160c124e7367"
 },
 "Kind": "PyTorchJob",
 "ApiVersion": "kubeflow.org/v1",
 "Spec": {
  "pytorchReplicaSpecs": {
   "Worker": {
    "replicas": 8,
    "template": {
     "spec": {
      "affinity": {
       "nodeAffinity": {
        "preferredDuringSchedulingIgnoredDuringExecution": [
         {
          "preference": {
           "matchExpressions": [
            {
             "key": "sagemaker.amazonaws.com/deep-health-check-status",
             "operator": "In",
             "values": [
              "Passed"
             ]
            }
           ]
          },
          "weight": 100
         }
        ],
        "requiredDuringSchedulingIgnoredDuringExecution": {
         "nodeSelectorTerms": [
          {
           "matchExpressions": [
            {
             "key": "sagemaker.amazonaws.com/node-health-status",
             "operator": "In",
             "values": [
              "Schedulable"
             ]
            }
           ]
          }
         ]
        }
       }
      },
      "containers": [
       {
        "command": [
         "/etc/config/train-script.sh"
        ],
        "env": [
         {
          "name": "CUDA_DEVICE_MAX_CONNECTIONS",
          "value": "1"
         },
         {
          "name": "CUDA_VISIBLE_DEVICES",
          "value": "0"
         },
         {
          "name": "FI_EFA_FORK_SAFE",
          "value": "1"
         },
         {
          "name": "FI_PROVIDER",
          "value": "efa"
         },
         {
          "name": "LOGLEVEL",
          "value": "DEBUG"
         },
         {
          "name": "NCCL_DEBUG",
          "value": "INFO"
         },
         {
          "name": "NCCL_IGNORE_DISABLED_P2P",
          "value": "1"
         },
         {
          "name": "NCCL_PROTO",
          "value": "simple"
         },
         {
          "name": "NCCL_SOCKET_IFNAME",
          "value": "^lo,docker0"
         },
         {
          "name": "PYTORCH_CUDA_ALLOC_CONF",
          "value": "expandable_segments:True"
         },
         {
          "name": "TORCH_DISTRIBUTED_DEBUG",
          "value": "DETAIL"
         },
         {
          "name": "TORCH_DIST_INIT_BARRIER",
          "value": "1"
         },
         {
          "name": "TORCH_NCCL_ASYNC_ERROR_HANDLING",
          "value": "1"
         },
         {
          "name": "TORCH_NCCL_DEBUG_INFO_TEMP_FILE",
          "value": "/local/nccl_trace_rank_"
         },
         {
          "name": "TORCH_NCCL_DUMP_ON_TIMEOUT",
          "value": "1"
         },
         {
          "name": "TORCH_NCCL_ENABLE_MONITORING",
          "value": "1"
         },
         {
          "name": "TORCH_NCCL_TRACE_BUFFER_SIZE",
          "value": "20000"
         }
        ],
        "image": "842413447717.dkr.ecr.us-west-2.amazonaws.com/fsdp:pytorch2.2",
        "imagePullPolicy": "Always",
        "name": "pytorch",
        "resources": {
         "limits": {
          "nvidia.com/gpu": 1,
          "vpc.amazonaws.com/efa": 1
         },
         "requests": {
          "nvidia.com/gpu": 1,
          "vpc.amazonaws.com/efa": 1
         }
        },
        "securityContext": {
         "capabilities": {
          "add": [
           "IPC_LOCK"
          ]
         }
        },
        "volumeMounts": [
         {
          "mountPath": "/local",
          "name": "local"
         },
         {
          "mountPath": "/etc/config",
          "name": "train-script"
         },
         {
          "mountPath": "/dev/shm",
          "name": "shm"
         }
        ]
       }
      ],
      "restartPolicy": "OnFailure",
      "volumes": [
       {
        "hostPath": {
         "path": "/mnt/k8s-disks/0"
        },
        "name": "local"
       },
       {
        "hostPath": {
         "path": "/dev/shm",
         "type": "Directory"
        },
        "name": "shm"
       },
       {
        "configMap": {
         "defaultMode": 420,
         "items": [
          {
           "key": "train-script.sh",
           "mode": 365,
           "path": "train-script.sh"
          }
         ],
         "name": "train-script-fsdp"
        },
        "name": "train-script"
       }
      ]
     }
    }
   }
  }
 },
 "Status": {
  "conditions": [
   {
    "lastTransitionTime": "2024-09-26T01:06:51Z",
    "lastUpdateTime": "2024-09-26T01:06:51Z",
    "message": "PyTorchJob fsdp is created.",
    "reason": "PyTorchJobCreated",
    "status": "True",
    "type": "Created"
   },
   {
    "lastTransitionTime": "2024-09-26T01:07:02Z",
    "lastUpdateTime": "2024-09-26T01:07:02Z",
    "message": "PyTorchJob kubeflow/fsdp is running.",
    "reason": "PyTorchJobRunning",
    "status": "True",
    "type": "Running"
   }
  ],
  "replicaStatuses": {
   "Worker": {
    "active": 8,
    "selector": "training.kubeflow.org/job-name=fsdp,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker"
   }
  },
  "startTime": "2024-09-26T01:07:00Z"
 },
 "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/k8-g5-8x-4"
}

You can use hyperpod list-pods command to list pods.

hyperpod list-pods --job-name fsdp -n kubeflow

{
 "pods": [
  {
   "PodName": "fsdp-worker-0",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-1",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-2",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-3",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-4",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-5",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-6",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  },
  {
   "PodName": "fsdp-worker-7",
   "Namespace": "kubeflow",
   "Status": "Running",
   "CreationTime": "2024-09-26 01:07:01+00:00"
  }
 ]
}

You can use hyperpod get-log command to print logs from a pod.

hyperpod get-log --job-name fsdp --pod fsdp-worker-0 -n kubeflow

    :
2024-09-26 01:09:17 I [train.py:102] Batch 0 Loss: 11.67824, Speed: 0.40 samples/sec, lr: 0.000006
2024-09-26 01:09:34 I [train.py:102] Batch 1 Loss: 11.71413, Speed: 0.47 samples/sec, lr: 0.000013
2024-09-26 01:09:52 I [train.py:102] Batch 2 Loss: 11.55315, Speed: 0.46 samples/sec, lr: 0.000019
2024-09-26 01:10:09 I [train.py:102] Batch 3 Loss: 11.21573, Speed: 0.47 samples/sec, lr: 0.000025
2024-09-26 01:10:26 I [train.py:102] Batch 4 Loss: 10.91101, Speed: 0.46 samples/sec, lr: 0.000031
    :

Troubleshoot

When you don't see logs from pods, use kubectl to check the status of underlying Kubernetes resources.

# List PyTorchJobs
kubectl get pytorchjobs -n kubeflow

# Get details of a PyTorchJob
kubectl describe pytorchjob fsdp -n kubeflow

# List Pods
kubectl get pods -n kubeflow

# Get details of a Pod
kubectl describe pod fsdp-worker-0 -n kubeflow

Stop

To stop the current training job, use the following command.

hyperpod cancel-job --job-name fsdp -n kubeflow

And verify that list of jobs is empty.

hyperpod list-jobs -n kubeflow

{
 "jobs": []
}

Prerequisites​

Infrastructure Requirements​

Development Environment​

AWS Permissions​

Cluster Validation​

Verified Instance Types and Counts​

Model Size Configurations​

Step 1: Setup the Docker Image​

1.1 Clone the Repository​

1.2 Build a Docker Image​

1.3 Push the Image to Amazon ECR​

Step 2: Data and HuggingFace Setup​

2.1 Understanding the Dataset​

2.2 Create HuggingFace Token​

Step 3: Start Your Training Run​

3.1 Install envsubst​

3.2 Generate Manifest from Template​

3.3 Deploy the Training Job​

3.4 Monitor Your Training Job​

3.5 Stop the Training​

Alternative: Start Training with the HyperPod CLI​

Set environment variables​

Generate a job configuration file​

Start training job​

Monitor​

Troubleshoot​

Stop​