📄️ NCCL Performance Tests
The NCCL Tests are a comprehensive testing suite that evaluates network performance between GPU instances using the NVIDIA Collective Communication Library. This is essential for validating cluster performance and troubleshooting issues before starting distributed training workloads.
📄️ GPU Stress Testing
GPU stress testing validates hardware stability, thermal management, and performance consistency by putting GPUs under sustained computational load. This guide focuses on "burning" GPUs to test their limits and detect potential hardware issues.
📄️ NCCOM Tests (Trainium)
nccom-test is a benchmarking tool for evaluating the performance of Collective Communication operations on Trainium instances (trn1 and inf2). It provides a fast way to validate your Neuron environment before running complex distributed training workloads.