📄️ PyTorch Environment Validation
This validation script runs a comprehensive PyTorch environment check to screen for NCCL, MPI, OpenMP, CUDA, and other critical components on your HyperPod cluster. The script executes once per instance and helps verify that your environment is properly configured for distributed training.
📄️ EFA and Network Stack Validation
This validation script checks the versions and configuration of the Elastic Fabric Adapter (EFA) network stack, including EFA installer, libfabric, AWS OFI NCCL, NCCL, and CUDA components. This is essential for ensuring optimal network performance in distributed training workloads.