Troubleshooting¶
Before troubleshooting, it’s helpful to gather information about your system’s hardware and software configurations. Common issues are related to CUDA version mismatches as PyTorch includes its own CUDA binaries. You want to ensure that PyTorch is using its own CUDA binaries and not conflicting with other CUDA installations on your system.
Check Torch, CUDA, and MLSimKit Versions¶
Use the following Python code to print the installed versions of Torch, CUDA used by Torch, and MLSimKit:
import torch, mlsimkit
print(torch.__version__)
print(torch.version.cuda)
print(mlsimkit.__version__)
Or using the command-line:
python -c "import torch; import mlsimkit; print(f'Torch: {torch.__version__}\nCUDA: {torch.version.cuda}\nMLSimKit: {mlsimkit.__version__}')"
You will see output like:
Torch: 2.2.2+cu121
CUDA: 12.1
MLSimKit: 0.1.0b0
Check NVIDIA Driver Version¶
Use the nvidia-smi
command to check the installed NVIDIA driver version:
$ nvidia-smi
Look for the “Driver Version” field in the output.
Common Issues¶
GPU out-of-memory (OOM) issues¶
You may run into an out-of-memory (OOM) issue while training a model, but there are steps that can be taken to address it. The following are some suggestions of things to try:
Change training hyperparameters to reduce the size of the model or in-memory data such as
batch-size
,message-passing-steps
,hidden-size
,ae.start-out-channel
andae.div-rate
.Try uing mixed precision (
--mixed-precision fp16
or--mixed-precision bf16
).Make sure you are not using the deterministic flag, as this will increase the memory requirements of the model training (
--deterministic
).Reduce the size of your training data in preprocessing. For instance in the slices use case, the resolution can be reduced via
--resolution
argument, and the number of input channels can be reduced via the--grayscale
argument. For input meshes in the KPI and Surface use cases,--downsample-remaining-perc
can be used to decrease the size of the meshes.Reduce the size of your training data externally. Alternatively, outside methods can be used to reduce the size of the data. For instance the number of slices can be reduced or external software can be used to decimate meshes.
Try training on the cpu (
--device cpu
) instead. Note that this can lead to much longer training times which may be undesirable.Make sure other processes aren’t consuming your GPU’s memory. This can be done by using the
nvidia-smi
command.Train on a bigger GPU with more memory.
"Could not load library libcudnn_cnn_train.so"
on Deep Learning AMI¶
mlsimkit
depends on Pytorch, which installs its own Cuda binaries. You may see this error if your environment has a version mismatch:
Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8
One solution is to remove Cuda from the system entirely and/or install the identical version. For example, see the Deep Learning tutorial.
Alternatively, you may remove the Cuda binaries from your library path when running torch
applications. For example, in the Deep Learning AMI for Ubuntu 22.04, the default LD_LIBRARY_PATH
includes Cuda 12.1:
$ echo $LD_LIBRARY_PATH
/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/cuda-12.1/lib:/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1:/usr/local/cuda-12.1/targets/x86_64-linux/lib/:/usr/local/cuda-12.1/extras/CUPTI/lib64:/usr/local/lib:/usr/lib
Remove the cuda-12.1
directories and then Pytorch
will use its own cuda binaries:
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/opt/aws-ofi-nccl/lib:/usr/local/lib:/usr/lib
mlsimkit-accelerate
train command hangs or timeouts on multi-GPU instances¶
The train processes (across use cases) can hang and/or timeout when running on a NVIDIA multiple GPU compute when executing via mlsimkit-accelerate
or accelerate launch
. This can be caused by NVIDIA drivers such as NVIDIA-SMI 555.42.06
. To fix the issue you can remove and install a different NVIDIA driver that is compatible:
$ sudo apt-get --purge remove nvidia-kernel-source-555
$ sudo apt-get install --verbose-versions cuda-drivers-535
"Cannot convert a MPS Tensor to float64 dtype..."
or "fp16 mixed precision requires a GPU (not 'mps')"
on MacOS¶
On older MacOS hardware, you may need to force CPU-only for training if you see one of the following errors:
TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead.
Error: fp16 mixed precision requires a GPU (not 'mps')
Use --device cpu
for all training commands. For example:
mlsimkit-learn kpi train --device cpu
mlsimkit-learn slices train-image-encoder --device cpu
mlsimkit-learn slices train-prediction --device cpu
"qt.qpa.plugin: Could not load the Qt platform plugin..."
¶
The Slice prediction code utilizes the opencv-python
package, which relies on a specific version of the Qt library. In some cases, this version may conflict with other Qt installations already present on your system, resulting in the following error:
QObject::moveToThread: Current thread (0x55fff86fe4f0) is not the object's thread (0x55fff91b1b70).
Cannot move to target thread (0x55fff86fe4f0)
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "/home/ubuntu/miniconda3/lib/python3.12/site-packages/cv2/qt/plugins" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.
Available platform plugins are: xcb, eglfs, minimal, minimalegl, offscreen, vnc, webgl.
Aborted (core dumped)
To prevent such conflicts, we recommend setting up and using a virtual environment for running the Slice prediction code. This approach isolates the required dependencies, including the necessary Qt version, from your system’s global environment. Please refer to the Install section for instructions on creating and activating a virtual environment.
Multi-processes warning “may exceed file descriptors limits” ("RuntimeError: received 0 items of ancdata"
)¶
During the kpi preprocess
or surface preprocess
steps, MLSimKit uses Python’s multiprocessing
module to parallelize the processing of mesh files across multiple CPU cores. However, this parallelization can sometimes exceed the system’s file descriptor limit, leading to the following error:
RuntimeError: received 0 items of ancdata
This error occurs when the number of open file descriptors (used for inter-process communication) exceeds the system’s limit. The likelihood of encountering this issue increases when processing a higher number of simulation runs,
You will encounter a warning in the logs when running the mlsimkit-learn kpi preprocess or mlsimkit-learn surface preprocess commands with multiple processes:
[WARNING] Using multi-processes (2) for preprocessing data. May exceed file descriptor limits, use 'ulimit -n'. See Troubleshooting in the user guide.
Workaround: Increase your File Descriptor Limit
If you prefer to continue preprocessing with multiple CPUs, increase the file descriptor limit on your system by running ulimit -n <higher_value>
before running mlsimkit-learn
. However, this may require administrative privileges and may not be a viable option in some environments.
Workaround: Use a Single Process
To avoid this issue enitrely, use a single process for preprocessing by setting --num-processes 1
:
KPI command
mlsimkit-learn --config training.yaml --log.prefix-dir logs/preprocess kpi preprocess --num-processes 1
Surface command
mlsimkit-learn --config training.yaml --log.prefix-dir logs/preprocess surface preprocess --num-processes 1
This workaround eliminates the need for inter-process communication and shared memory segments, preventing the file descriptor limit from being exceeded.
Surface view screenshots error “PyVista will likely segfault when rendering” ("bad X server connection"
)¶
Outputting screenshots of the surface prediction data requires 3D rendering. On systems without a display, you will see an error like this:
$ mlsimkit-learn --config training.yaml surface view --no-gui
/usr/local/lib/python3.10/dist-packages/pyvista/plotting/plotter.py:151: UserWarning:
This system does not appear to be running an xserver.
PyVista will likely segfault when rendering.
Try starting a virtual frame buffer with xvfb, or using
``pyvista.start_xvfb()``
warnings.warn(
2024-06-03 00:15:35.535 ( 1.463s) [ 7FE7D7B70480]vtkXOpenGLRenderWindow.:456 ERR| vtkXOpenGLRenderWindow (0x561067609970): bad X server connection. DISPLAY=
[ERROR] bad X server connection. DISPLAY=
Aborted (core dumped)
Linux/Ubuntu:
We support the package Xvfb (X virtual framebuffer) on Linux/Ubuntu. Install this package:
sudo apt install xvfb
Now start Xvfb when starting the viewer to enable rendering on remote machines:
mlsimkit-learn surface view --start-xvfb ...
See the PyVista documentation ‘Running on Remote Servers’ for more details.