# GNU Parallel examples running laplacian solver and blast:
# 
# blast:
# 
# To submit distributed workload:
# *I added +ncbiblast-2.2.28-gcc-4.4.6 to my ~/.soft
# pure serial run on 2 nodes, using maximum 16 cores per node,
cd /project/fchen14/gpar/blast
qsub par_ser_blast.pbs
# multi-threaded run on 2 nodes, 4 jobs per node, 2 threads per job 
# to demonstrate the load is evenly distributed on the 2 nodes
# [fchen14@mike5 blast]$ qsub par_mt_blast.pbs 
# 559695.mike3
# [fchen14@mike5 laplace]$ qshow 559695.mike3 | awk '{print $1, $2, $3}'
# PBS job: 559695.mike3,
# Hostname Days Load
# mike429 217 5.24
# mike430 217 4.69
# PBS_job=559695.mike3 user=fchen14 allocation=hpc_hpcadmin3
# use e.g. 8 jobs per node, 2 threads per job to maximize load on each node
qsub par_mt_blast.pbs

# laplace
# to compile:
cd /project/fchen14/gpar/laplace
# This will make all the serial, openmp and mpi executables:
make
# [fchen14@mike5 laplace]$ make
# icc laplace_ser.c -o lap_ser.out
# icc -openmp laplace_omp.c -o lap_omp.out
# mpicc laplace_mpi.c -o lap_mpi.out

# Notes: on the laplacian solvers
# lap_ser.out: solves a 4096x4096 laplacian grid in serial, by default, run 2000 steps, 
# use "./lap_ser.out steps" to change total number of iterations
# using the default values:
# ./lap_ser.out takes about 200 seconds, distributed runs using GNU Parallel could be longer

# lap_omp.out: solves a 4096x4096 laplacian grid using openmp, by default, run 2000 steps using 16 threads
# use "./lap_omp.out iters num_threads" to change total number of iterations and the number of openmp threads
# using the default values:
# ./lap_omp.out takes about 75 seconds using 8 threads, 122 seconds using 4 threads, distributed runs using GNU Parallel could be longer

# lap_mpi.out: solves a 4096x4096 laplacian grid using mpi, by default, run 2000 steps, 2x2=4 processes 
# use "nr nc ndiv_rows ndiv_cols relerr niter iprint debug_global_matrix" to change settings
# nr:        number of rows, default 4096
# nc:        number of cols, default 4096
# ndiv_rows: number of divisions in row (Y) direction, default 2
# ndiv_cols: number of divisions in col (X) direction, default 2
# relerr:    relative error, default 0.001
# niter:     total iterations, default 2000
# iprint:    print information every iprint iterations, default 100
# debug_global_matrix: print a final global matrix, set it to 0
# using the default values:
# mpirun -np 4 lap_mpi.out
# takes about 56-60 seconds, distributed runs using GNU Parallel could be longer

# to submit distributed workload:
# serial, using maximum 16 cores per node on 2 nodes:
qsub par_ser.pbs 
# using openmp, using 4 omp threads per job, 3 jobs per node on 2 nodes
# to show the load is evenly distributed among 2 nodes:
qsub par_omp.pbs
# using mpi, using 4 mpi processes per job, 3 jobs per node on 2 nodes 
# to show the load is evenly distributed among 2 nodes:
qsub par_mpi.pbs

# Regarding the known bug on GNU Parallel maximum spawned jobs per node, this bug happens when:
# 1. Number of jobs specified on each node "JOBS_PER_NODE" is greater than 9 (>=10) AND
# 2. Total number of jobs is less than 16
# i.e. if we have total 15 jobs, the jobs spawned will be 14
# However as in many of our user cases, the total number of jobs need to be processed is typically (much) more than 16,
# so I used a workaround by simply add "-j $(($JOBS_PER_NODE+1))" or simply "-j+1" to the parallel command in the serial distributed jobs
# which should be able to use all the available cores on SuperMike2 
