Shared File System Setup for SageMaker HyperPod (Slurm)

Why Shared File Systems Matter

A high-performance shared file system is critical for achieving optimal performance in distributed machine learning workloads on SageMaker HyperPod. Without proper shared storage, your training jobs will be severely bottlenecked by data I/O operations.

Performance Impact

Data Loading Bottlenecks: Without shared storage, each compute node must independently load training data, creating massive I/O overhead
Checkpoint Synchronization: Model checkpoints and intermediate results need fast, consistent access across all nodes
Memory Efficiency: Shared file systems enable efficient data caching and reduce memory pressure on individual nodes
Scaling Limitations: Local storage approaches fail to scale beyond single-node training

FSx for Lustre Benefits

Amazon FSx for Lustre is specifically designed for high-performance computing workloads and provides:

High Throughput: Up to hundreds of GB/s of aggregate throughput
Low Latency: Sub-millisecond latencies for small file operations
POSIX Compliance: Standard file system semantics that work with existing ML frameworks
S3 Integration: Seamless data repository associations with Amazon S3
Elastic Scaling: Storage capacity that can grow with your workload demands

Setup Options

Option 1: Auto-Provisioned (Console Quick Setup)

When you create a HyperPod cluster through the AWS Console using the Quick Setup path, FSx for Lustre is automatically provisioned and configured for you.

What Gets Created Automatically

The console automatically provisions:

FSx for Lustre file system with optimal performance settings
Proper networking configuration in the same VPC and subnet as your cluster
Security group rules allowing NFS traffic between cluster nodes and FSx
Mount configuration that automatically mounts FSx at /fsx on all cluster nodes
IAM permissions for the cluster to access the file system

Verification Steps

After your cluster reaches InService status, verify FSx is properly mounted:

SSH into your cluster (see SSH setup guide)

Check mounted file systems:

df -h | grep fsx

You should see output similar to:

10.1.71.197@tcp:/oyuutbev  1.2T  5.5G  1.2T   1% /fsx

Test write access:

echo "Hello FSx" > /fsx/test.txt
cat /fsx/test.txt
rm /fsx/test.txt

Verify performance (optional):

# Test write performance
dd if=/dev/zero of=/fsx/testfile bs=1M count=1000

# Test read performance  
dd if=/fsx/testfile of=/dev/null bs=1M

# Clean up
rm /fsx/testfile

Option 2: Manual Setup (CLI/SDK Users)

If you're creating your HyperPod cluster via CLI or SDK, or want more control over your FSx configuration, you can manually create and attach an FSx file system.

When to Use Manual Setup

Custom performance requirements: Need specific throughput or storage configurations
Existing infrastructure: Want to integrate with existing FSx file systems
Advanced networking: Require custom VPC or subnet configurations
Cost optimization: Need precise control over storage capacity and performance tiers

Step-by-Step FSx Creation

Create the FSx file system:

# Set your configuration variables
SUBNET_ID="subnet-xxxxxxxxx"  # Same subnet as your HyperPod cluster
SECURITY_GROUP_ID="sg-xxxxxxxxx"  # Security group allowing NFS traffic

# Create FSx file system
aws fsx create-file-system \
  --file-system-type LUSTRE \
  --storage-capacity 1200 \
  --subnet-ids $SUBNET_ID \
  --security-group-ids $SECURITY_GROUP_ID \
  --lustre-configuration DeploymentType=PERSISTENT_2,PerUnitStorageThroughput=250,DataCompressionType=LZ4

Wait for file system creation:

# Get the file system ID from the previous command output
FSX_ID="fs-xxxxxxxxx"

# Wait for AVAILABLE status
aws fsx describe-file-systems --file-system-ids $FSX_ID --query 'FileSystems[0].Lifecycle'

Get mount information:

# Get DNS name and mount name
aws fsx describe-file-systems --file-system-ids $FSX_ID \
  --query 'FileSystems[0].[DNSName,LustreConfiguration.MountName]' --output table

Integration with HyperPod Cluster

To use your manually created FSx with HyperPod, you'll need to configure the cluster lifecycle scripts:

Update lifecycle scripts to mount your FSx file system
Specify FSx parameters in your cluster configuration
Ensure proper IAM permissions for FSx access

For detailed lifecycle script configuration, see the HyperPod cluster setup documentation.

Option 3: Bring Your Own FSx

If you have an existing FSx for Lustre file system, you can integrate it with your HyperPod cluster.

Requirements and Considerations

Network Requirements:

FSx file system must be in the same VPC as your HyperPod cluster
FSx file system must be in the same Availability Zone as your cluster nodes
Security groups must allow NFS traffic (port 988) between cluster and FSx

Performance Considerations:

Ensure FSx performance tier matches your workload requirements
Consider data locality - accessing FSx from different AZs reduces performance
Verify sufficient throughput capacity for your cluster size

Integration Steps

Verify network compatibility:

# Check your cluster's subnet and AZ
aws sagemaker describe-cluster --cluster-name <your-cluster-name> \
  --query 'VpcConfig.Subnets'

# Check your FSx file system's subnet and AZ
aws fsx describe-file-systems --file-system-ids <your-fsx-id> \
  --query 'FileSystems[0].SubnetIds'

Update security groups if needed:

# Allow NFS traffic from cluster security group to FSx
aws ec2 authorize-security-group-ingress \
  --group-id <fsx-security-group-id> \
  --protocol tcp \
  --port 988 \
  --source-group <cluster-security-group-id>

Configure cluster lifecycle scripts to mount the existing FSx file system using this documentation.

Using FSx in Slurm Jobs

Mount Points and Paths

By default, FSx for Lustre is mounted at /fsx on all cluster nodes. This creates a shared namespace accessible from:

Head node: /fsx
Compute nodes: /fsx
All Slurm jobs: /fsx

Best Practices for Data Access

Data Organization

# Recommended FSx directory structure
/fsx/
├── datasets/           # Training datasets
├── checkpoints/        # Model checkpoints  
├── outputs/           # Training outputs and logs
├── code/              # Shared training scripts
└── scratch/           # Temporary files

Performance Optimization

Use appropriate I/O patterns:

# Good: Sequential reads of large files
# Bad: Random access to many small files

Leverage data caching:

# Pre-load datasets to FSx before training
aws s3 sync s3://your-bucket/dataset /fsx/datasets/

Optimize checkpoint frequency:

# Balance checkpoint frequency with I/O overhead
# Save checkpoints to FSx, not local storage

Data Management

Link FSx to S3 for data persistence:

# Create data repository association
aws fsx create-data-repository-association \
  --file-system-id $FSX_ID \
  --file-system-path /datasets \
  --data-repository-path s3://your-bucket/datasets

Use FSx data lifecycle policies:

# Automatically sync data between FSx and S3
# Configure import/export policies based on access patterns

Troubleshooting

Common Issues

FSx not mounted:

# Check if FSx service is running
systemctl status lustre-client

# Manually mount if needed (replace with your FSx details)
sudo mount -t lustre fs-xxxxx.fsx.region.amazonaws.com@tcp:/mountname /fsx

Permission denied errors:

# Check FSx permissions
ls -la /fsx/

# Fix ownership if needed
sudo chown -R ubuntu:ubuntu /fsx/

Poor performance:

# Check FSx throughput utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/FSx \
  --metric-name TotalIOTime \
  --dimensions Name=FileSystemId,Value=$FSX_ID

Performance Monitoring

Monitor FSx performance through CloudWatch metrics:

TotalIOTime: I/O utilization
DataReadBytes/DataWriteBytes: Throughput metrics
MetadataOperations: File system metadata operations

Next Steps

Once your shared file system is set up:

Verify performance with your specific workloads
Configure data repository associations with S3 if needed
Set up monitoring and alerting for FSx metrics
Review the training blueprints that leverage FSx for distributed training

For advanced FSx configuration and management, see:

Why Shared File Systems Matter​

Performance Impact​

FSx for Lustre Benefits​

Setup Options​

Option 1: Auto-Provisioned (Console Quick Setup)​

What Gets Created Automatically​

Verification Steps​

Option 2: Manual Setup (CLI/SDK Users)​

When to Use Manual Setup​

Step-by-Step FSx Creation​

Integration with HyperPod Cluster​

Option 3: Bring Your Own FSx​

Requirements and Considerations​

Integration Steps​

Using FSx in Slurm Jobs​

Mount Points and Paths​

Best Practices for Data Access​

Data Organization​

Performance Optimization​

Data Management​

Troubleshooting​

Common Issues​

Performance Monitoring​

Next Steps​