MPI Parallelization
Compile
MPI (Message Passing Interface) is a standard in parallel computing for communication between distributed processes.
Building MPI Applications
The proper way to compile with the MPI library is to use the compiler wrapper scripts installed with the MPI implementation. Once your preferred MPI implementation (OpenMPI, MPICH, Intel MPI, etc.) and compiler suite (GNU, Intel, etc.) have been loaded using environment modules, you can compile your code.
The compiler command depends on the programming language:
mpicc test.c -O3 -o a.out
mpif90 test.F -O3 -o a.out
The MPI compiler wrappers automatically include the appropriate compiler flags and MPI libraries.
Running MPI Applications with SLURM
A simple SLURM batch script for running an MPI application:
#!/bin/bash
#SBATCH -J mpi_job
#SBATCH -A my_allocation
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
#SBATCH -t 20:00:00
#SBATCH -o mpi_job_%j.out
#SBATCH -e mpi_job_%j.err
export HOME_DIR=/home/$USER
export WORK_DIR=/work/$USER/test
cd $WORK_DIR
cp $HOME_DIR/a.out .
# Launch MPI application
srun ./a.out
SLURM automatically allocates the requested resources and launches the application across all allocated MPI tasks using srun.
Running Hybrid MPI + OpenMP Jobs
Combining MPI and OpenMP can improve performance by using MPI between nodes and OpenMP threads within a node. Most hybrid applications run multiple OpenMP threads within each MPI task.
Example SLURM script:
#!/bin/bash
#SBATCH -J hybrid_job
#SBATCH -A my_allocation
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH -t 00:10:00
#SBATCH -o hybrid_job_%j.out
#SBATCH -e hybrid_job_%j.err
# Number of OpenMP threads per MPI task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
cd $SLURM_SUBMIT_DIR
# Launch the hybrid job
srun ./my_exe
Resource Layout
In the example above:
-N 2requests 2 compute nodes.--ntasks-per-node=2launches 2 MPI tasks per node.--cpus-per-task=8allocates 8 CPU cores to each MPI task.OMP_NUM_THREADS=8creates 8 OpenMP threads within each MPI task.
This results in:
- 4 MPI tasks total (2 nodes × 2 tasks per node)
- 8 OpenMP threads per MPI task
- 32 CPU cores used overall
NUMA Considerations
Non-Uniform Memory Access (NUMA) can affect hybrid application performance. Many compute nodes contain multiple CPU sockets, each with its own local memory. Performance is often best when OpenMP threads are confined to cores within a single socket. The optimal MPI/OpenMP balance depends on the application and hardware configuration.
Users are encouraged to benchmark different combinations of:
- MPI tasks per node
- OpenMP threads per task
to determine the best performance for their application.
Known Issues
system(), fork(), and popen()
Some MPI implementations and high-performance networking environments may not fully support calls to:
system()fork()popen()
within the MPI execution scope (between MPI_Init() and MPI_Finalize()). Applications that invoke these functions while MPI is active may experience unexpected behavior or failures. When possible, avoid these calls inside MPI-parallel regions.