Parallelism and GPU Support

Palace employs multiple types of parallelism in an attempt to maximize performance across a wide range of deployment possibilities. The first is MPI-based distributed-memory parallelism. This is controlled using the -np command line flag as outlined in Running Palace.

Shared-memory parallelism using OpenMP is also available. To enable this, the -DPALACE_WITH_OPENMP=ON option should be specified at configure time. At runtime, the number of threads is configured with the -nt argument to the palace executable, or by setting the OMP_NUM_THREADS environment variable.

Lastly, Palace supports GPU-acceleration using NVIDIA and AMD GPUs, activated with the build options -DPALACE_WITH_CUDA=ON and -DPALACE_WITH_HIP=ON, respectively. At runtime, the config["Solver"]["Device"] parameter in the configuration file can be set to "CPU" (the default) or "GPU" in order to configure Palace and MFEM to use the available GPU(s). The config["Solver"]["Backend"] parameter, on the other hand, controls the libCEED backend. Users typically do not need to provide a value for this option and can instead rely on Palace's default, which selects the most appropriate backend for the given value of config["Solver"]["Device"].

In order to take full advantage of the performance benefits made available by GPU- acceleration, it is recommended to make use of operator partial assembly, activated when the value of config["Solver"]["PartialAssemblyOrder"] is less than config["Solver"]["Order"]. This feature avoids assembling a global sparse matrix and instead makes use of data structures for operators which lend themselves to more efficient asymptotic storage and application costs. See also https://libceed.org/en/latest/intro/ for more details. Partial assembly in Palace supports mixed meshes including both tensor product elements (hexahedra and quadrilaterals) as well as non-tensor product elements (tetrahedra, prisms, pyramids, and triangles).