MPI Parallelization

Compile

MPI (Message Passing Interface) is a standard in parallel computing for communication between distributed processes.

Building MPI Applications

The proper way to compile with the MPI library is to use the compiler wrapper scripts installed with the MPI implementation. Once your preferred MPI implementation (OpenMPI, MPICH, Intel MPI, etc.) and compiler suite (GNU, Intel, etc.) have been loaded using environment modules, you can compile your code.

The compiler command depends on the programming language:

mpicc test.c -O3 -o a.out
mpif90 test.F -O3 -o a.out

The MPI compiler wrappers automatically include the appropriate compiler flags and MPI libraries.

Running MPI Applications with SLURM

A simple SLURM batch script for running an MPI application:

#!/bin/bash
#SBATCH -J mpi_job
#SBATCH -A my_allocation
#SBATCH -N 2
#SBATCH --ntasks-per-node=16
#SBATCH -t 20:00:00
#SBATCH -o mpi_job_%j.out
#SBATCH -e mpi_job_%j.err

export HOME_DIR=/home/$USER
export WORK_DIR=/work/$USER/test

cd $WORK_DIR

cp $HOME_DIR/a.out .

# Launch MPI application
srun ./a.out

SLURM automatically allocates the requested resources and launches the application across all allocated MPI tasks using srun.

Running Hybrid MPI + OpenMP Jobs

Combining MPI and OpenMP can improve performance by using MPI between nodes and OpenMP threads within a node. Most hybrid applications run multiple OpenMP threads within each MPI task.

Example SLURM script:

#!/bin/bash
#SBATCH -J hybrid_job
#SBATCH -A my_allocation
#SBATCH -N 2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=8
#SBATCH -t 00:10:00
#SBATCH -o hybrid_job_%j.out
#SBATCH -e hybrid_job_%j.err

# Number of OpenMP threads per MPI task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

cd $SLURM_SUBMIT_DIR

# Launch the hybrid job
srun ./my_exe

Resource Layout

In the example above:

-N 2 requests 2 compute nodes.
--ntasks-per-node=2 launches 2 MPI tasks per node.
--cpus-per-task=8 allocates 8 CPU cores to each MPI task.
OMP_NUM_THREADS=8 creates 8 OpenMP threads within each MPI task.

This results in:

4 MPI tasks total (2 nodes × 2 tasks per node)
8 OpenMP threads per MPI task
32 CPU cores used overall

NUMA Considerations

Non-Uniform Memory Access (NUMA) can affect hybrid application performance. Many compute nodes contain multiple CPU sockets, each with its own local memory. Performance is often best when OpenMP threads are confined to cores within a single socket. The optimal MPI/OpenMP balance depends on the application and hardware configuration.

Users are encouraged to benchmark different combinations of:

MPI tasks per node
OpenMP threads per task

to determine the best performance for their application.

Known Issues

system(), fork(), and popen()

Some MPI implementations and high-performance networking environments may not fully support calls to:

system()
fork()
popen()

within the MPI execution scope (between MPI_Init() and MPI_Finalize()). Applications that invoke these functions while MPI is active may experience unexpected behavior or failures. When possible, avoid these calls inside MPI-parallel regions.