Job Templates for Serial and Parallel (Multi-Threaded and MPI) jobs

Serial Job

#!/bin/bash
#SBATCH --job-name=serial_job_test    # Job name
#SBATCH --ntasks=1                    # Using a single core
#SBATCH --time=00:10:00               # Time limit hh:mm:ss
#SBATCH --output=serial_test_%j.log   # Standard output and error log

module load python

echo "Running job on a single CPU core"

python /home/user/single_core_job.py

date

Shared Memory Parallelism (SMP) Jobs

Shared-Memory Parallelism (SMP) is when workload is shared among different CPU cores using multiple threads or processes running within a single compute node and these cores have access to common (shared) memory. The SMP applications can use OpenMP (Open Multi-Processing), pthreads, Python’s multiprocessing module, R's mcapply all fall into this category. While they can use multiple cores, they cannot make use of multiple nodes and all the cores must be physically located the same node. When running SMP jobs, you must make the SMP application aware of how many cores to use. How that is done depends on the specific application:

The OpenMP applications check the OMP_NUM_THREADS environment variable to determine how many threads to create (how many cores to use). You must set --ntasks=1, and then set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task, typically, set --cpus-per-task to the number of OpenMP threads you wish to use.
For other types of applications, there could be different ways to to specify the number of cores to use (e.g., through particular command line arguments), please refer to the software documents for detailed information.

Below is an example for running SMP jobs:

#!/bin/bash
#SBATCH --job-name=parallel_job      # Job name
#SBATCH --nodes=1                    # Run all processes on a single node   
#SBATCH --ntasks=1                   # Run a single task        
#SBATCH --cpus-per-task=4            # Number of CPU cores per task
#SBATCH --time=00:10:00              # Time limit hh:mm:ss
#SBATCH --output=parallel_%j.log     # Standard output and error log

date
# use this line if your job uses OpenMP 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 

/home/user/smp_job.out
date

MPI (Message Passing Interface) Job

According to Slurm documentation, "there are three fundamentally different modes of operation used by various MPI implementation with Slurm:

Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.)
Use mpirun launches tasks using Slurm's infrastructure (not using PMIx).
Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm." (We do not recommend HPC/LONI users use this method to launch their MPI jobs.)

PMIx Versions

If you compiled your MPI application using our default mvapich2 libraries (which is compiled with PMIx enabled), you should start the application directly using the srun command. Below is an example job script with the executable a.out compiled using mvapich2 and launched using the srun command:

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --partition=workq            # For jobs using more than 1 node, submit to workq
#SBATCH --nodes=2                    # Number of nodes to be allocated
#SBATCH --ntasks=96                  # Number of MPI tasks (i.e. processes/cores)
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=mpi_test_%j.log     # Standard output and error

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Slurm Nodes Allocated          = $SLURM_JOB_NODELIST"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"

module load mvapich2/2.3.3/intel-19.0.5
srun -n $SLURM_NTASKS ./a.out

Non-PMIx Versions

If your MPI application did not use our default module key mvapich2/2.3.3/intel-19.0.5, you should start the application using the mpirun command. Below is an example job script with the executable a.out compiled using mvapich2/2.3.3/intel-19.0.5-hydra and launched using the mpirun command:

#!/bin/bash
#SBATCH --job-name=mpi_job_test      # Job name
#SBATCH --partition=workq            # For jobs using more than 1 node, submit to workq
#SBATCH --nodes=2                    # Number of nodes to be allocated
#SBATCH --ntasks=96                  # Number of MPI tasks (i.e. processes/cores)
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=mpi_test_%j.log     # Standard output and error

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Slurm Nodes Allocated          = $SLURM_JOB_NODELIST"
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"

module load mvapich2/2.3.3/intel-19.0.5-hydra
mpirun -n $SLURM_NTASKS ./a.out

Hybrid (MPI + SMP) Job

Hybrid jobs are MPI applications where each MPI process is multi-threaded (usually via either OpenMP or POSIX Threads) and can use multiple cores across multiple nodes. If the MPI implementation is compiled with PMIx enabled, use the srun command to start the hybrid job, otherwise, use the mpirun command to start it.

PMIx Versions

In this example there are 48 CPU cores on each compute node. Below example requests 4 MPI process (tasks), each process will spawn 24 threads on 24 cores, thus a total of 96 cores will be used, running one thread on each core from 2 nodes in workq using the module key mvapich2/2.3.3/intel-19.0.5 compiled with PMIx enabled.

#!/bin/bash
#SBATCH --job-name=hybrid_job_test   # Job name
#SBATCH --partition=workq            # Need to submit workq for multiple node jobs
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks=4                   # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=24           # Number of cores per MPI task
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=hybrid_test_%j.log  # Standard output and error file

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load mvapich2/2.3.3/intel-19.0.5
srun -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./a.out

Non-PMIx Versions

Similar to the above, below example requests 4 tasks, each with 24 cores, thus a total of 96 cores will be used from 2 nodes in workq, but it uses the module key mvapich2/2.3.3/intel-19.0.5-hydra without PMIx enabled, so the mpirun command is used to launch ./a.out, and OMP_NUM_THREADS is specified in the job script to determine the number of threads used for each process.

#!/bin/bash
#SBATCH --job-name=hybrid_job_test      # Job name
#SBATCH --partition=workq            # Need to submit workq for multiple node jobs
#SBATCH --nodes=2                    # Maximum number of nodes to be allocated
#SBATCH --ntasks=4                   # Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=24           # Number of cores per MPI task
#SBATCH --time=00:05:00              # Wall time limit (hh:mm:ss)
#SBATCH --output=hybrid_test_%j.log  # Standard output and error file

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo ""
echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"

module load mvapich2/2.3.3/intel-19.0.5-hydra
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpirun -n $SLURM_NTASKS ./a.out