Skip to main content

NVIDIA Isaac Lab Training Pipeline

The NVIDIA Isaac Lab pipeline enables reinforcement learning (RL) policy training and evaluation using NVIDIA Isaac Lab on GPU-accelerated Amazon EC2 instances managed by AWS Batch. It supports two operational modes -- training new RL policies from scratch and evaluating pre-trained policies -- both orchestrated by AWS Step Functions with asynchronous task token callbacks.

Learn more

Read the AWS blog post GPU-Accelerated Robotic Simulation Training with NVIDIA Isaac Lab in VAMS for a detailed walkthrough of the pipeline architecture, setup, and usage with example training scenarios.

Overview

PropertyValue
Pipeline IDsisaaclab-training, isaaclab-evaluation
Configuration flagapp.pipelines.useIsaacLabTraining.enabled
Execution typeLambda (asynchronous with callback)
ComputeAWS Batch with GPU instances (G6, G6E, G5 families)
StorageAmazon Elastic File System (Amazon EFS) for checkpoints
Training timeout8 hours
Evaluation timeout2 hours

Architecture

Isaac Lab Pipeline Architecture

The pipeline uses a two-level AWS Step Functions pattern. The VAMS workflow invokes the vamsExecute Lambda function, which starts an internal Step Functions state machine. The internal state machine manages the AWS Batch GPU job lifecycle and reports completion back to the VAMS workflow via task tokens.

AWS infrastructure components

ComponentServicePurpose
Container imageAmazon Elastic Container Registry (Amazon ECR)Isaac Lab Docker image built from NVIDIA NGC base
Compute environmentAWS Batch (Amazon EC2)GPU instance management with G6, G6E, G5 instance types
Job queueAWS BatchJob scheduling and priority management
Checkpoint storageAmazon EFSPersistent storage for training checkpoints across jobs
OrchestrationAWS Step FunctionsWorkflow management with error handling
MonitoringAmazon CloudWatch Container InsightsECS cluster metrics and logging

Configuration

Add the following to your config.json under app.pipelines:

{
"app": {
"pipelines": {
"useIsaacLabTraining": {
"enabled": true,
"acceptNvidiaEula": true,
"autoRegisterWithVAMS": true,
"keepWarmInstance": false
}
}
}
}
OptionDefaultDescription
enabledfalseEnable or disable the pipeline deployment.
acceptNvidiaEulafalseRequired when enabled. You must accept the NVIDIA Software License Agreement by setting this to true. Deployment fails if this is false when the pipeline is enabled.
autoRegisterWithVAMStrueAutomatically register both the isaaclab-training and isaaclab-evaluation pipelines and workflows with VAMS at deploy time.
keepWarmInstancefalseWhen true, maintains one warm GPU instance (8 vCPUs for g6.2xlarge) in the AWS Batch compute environment. Reduces cold-start latency at the cost of continuous GPU instance charges.
NVIDIA EULA acceptance required

The Isaac Lab container is built from the NVIDIA NGC base image. You must review and accept the NVIDIA Software License Agreement before enabling this pipeline. The CDK deployment will fail with a validation error if acceptNvidiaEula is not set to true.

Prerequisites

  • GPU instance availability -- Request quota increases for G6, G6E, or G5 instance families in your deployment region if needed. The compute environment uses BEST_FIT_PROGRESSIVE allocation across multiple instance types for optimal availability.
  • VPC with NAT Gateway -- The pipeline requires private subnets with internet access (via NAT Gateway) because the Isaac Lab container needs to download NVIDIA Omniverse assets at runtime.
  • Amazon EFS -- An Amazon EFS file system is automatically created in isolated subnets for training checkpoint persistence.
  • Large EBS volume -- A 100 GB GP3 EBS volume is configured via launch template to accommodate the Isaac Lab container image (10+ GB).

Training mode

The training mode trains new RL policies from scratch using the RSL-RL, RL Games, or SKRL reinforcement learning libraries.

Training input parameters

Pass training configuration as inputParameters when triggering the pipeline:

{
"trainingConfig": {
"mode": "train",
"task": "Isaac-Cartpole-Direct-v0",
"numEnvs": 4096,
"maxIterations": 1500,
"rlLibrary": "rsl_rl",
"seed": 42
},
"computeConfig": {
"numNodes": 1
}
}
ParameterDefaultDescription
trainingConfig.mode"train"Must be "train" for training mode.
trainingConfig.task"Isaac-Cartpole-v0"The Isaac Lab task environment name.
trainingConfig.numEnvs4096Number of parallel simulation environments.
trainingConfig.maxIterations1500Maximum training iterations.
trainingConfig.rlLibrary"rsl_rl"RL library to use. Options: "rsl_rl", "rl_games", "skrl".
trainingConfig.seednullOptional random seed for reproducibility.
computeConfig.numNodes1Number of compute nodes. Values greater than 1 enable multi-node distributed training via torchrun.

Training output

Training results are uploaded to the VAMS asset bucket under the job UUID prefix:

OutputFormatDescription
checkpoints/model_*.ptPyTorchModel checkpoint files saved during training
metrics.csvCSVTraining metrics exported from TensorBoard event files
*_git_diff.txtTextConfiguration diff files (converted from .diff for VAMS compatibility)
training-config.jsonJSONInput configuration saved for reference

Multi-node training

When computeConfig.numNodes is greater than 1, the pipeline uses AWS Batch multi-node parallel jobs with torchrun for distributed training. The main node (index 0) coordinates training and uploads results. All nodes send heartbeats to AWS Step Functions to prevent timeout.

Evaluation mode

The evaluation mode runs a pre-trained policy against the simulation environment and captures metrics and video recordings.

Evaluation input parameters

{
"trainingConfig": {
"mode": "evaluate",
"task": "Isaac-Cartpole-Direct-v0",
"numEnvs": 100,
"numEpisodes": 50,
"stepsPerEpisode": 1000,
"recordVideo": false,
"rlLibrary": "rsl_rl"
}
}
ParameterDefaultDescription
trainingConfig.mode(required)Must be "evaluate" for evaluation mode.
trainingConfig.numEnvs100Number of parallel environments for evaluation.
trainingConfig.numEpisodes50Number of evaluation episodes to run.
trainingConfig.stepsPerEpisode1000Steps per evaluation episode.
trainingConfig.recordVideofalseWhether to record evaluation videos (videos are always generated as they are required for the Isaac Lab play script to terminate).
trainingConfig.policyS3UrinullAmazon S3 URI to a .pt policy file. The openPipeline Lambda discovers this automatically from the VAMS asset.

Evaluation output

OutputFormatDescription
metrics.csvCSVEvaluation metrics from TensorBoard
videos/*.mp4MP4Evaluation episode recordings
evaluation-config.jsonJSONInput configuration saved for reference

Custom environments

The pipeline supports custom Isaac Lab environments packaged as Python packages. Upload your custom environment package (.tar.gz, .zip, or .whl) to an Amazon S3 location and reference it in the pipeline configuration:

{
"trainingConfig": {
"mode": "train",
"task": "MyCustomTask-v0"
},
"customEnvironmentS3Uri": "s3://my-bucket/envs/my-custom-env.tar.gz"
}

The container downloads the package at runtime and installs it with pip install -e before starting training or evaluation.

Heartbeat mechanism

Long-running training jobs send periodic heartbeats (every 5 minutes) to both the internal and external AWS Step Functions state machines to prevent timeout. The heartbeat thread runs in the background during the entire training or evaluation process. The internal state machine has a 30-minute heartbeat timeout, so any interruption lasting longer than 30 minutes will cause the job to be marked as failed.

Monitoring training progress

Training progress can be monitored through Amazon CloudWatch Logs for the AWS Batch job. Container Insights is enabled on the ECS cluster for detailed resource utilization metrics.