Skip to main content

SuperMIC

System Overview

SuperMIC (pronounced as Super Mick) is LSU's newest supercomputer funded by the National Science Foundation's (NSF) Major Research Instrumentation (MRI) award to the Center for Computation & Technology. When in production, 40 percent of its computational resources are reserved for participants in the Extreme Science and Engineering Discovery Environment (XSEDE), a virtual system that scientists can use to interactively share computing resources, data, and expertise.

SuperMIC is currently in the acquisition phase with an expected production date of June 1, 2014. It is expected to be a 1 PetaFlop cluster with 360 compute nodes each with two 10-core 2.8GHz Intel Ivy Bridge-EP processors, 64GB of memory and 2 Intel Xeon Phi 7120P coprocessors. In addition, 20 compute nodes will have two 10-core 2.8GHz Intel Ivy Bridge-EP processors, 64GB of memory, 1 Intel Xeon Phi 7120P coprocessor and 1 NVIDIA Tesla K20X. When in production mode, SuperMIC will be open for general use to LSU and XSEDE users. Allocation and Account request for XSEDE users will be processed from the XSEDE User Portal. LSU users will need to use their LSU HPC credentials to gain access to the SuperMIC. More details will follow in the coming months as we work toward getting the cluster into production.

Configuration

  • 1 Interactive Node
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • One Intel Xeon Phi 7120P Coprocessors
    • 128GB DDR3 1866MHz Ram
    • 1TB HD
    • 56 Gigabit/sec Infiniband network interface
    • 10 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • 1 Interactive Node
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • One NVIDIA Tesla K20X 6GB GPU
    • 128GB DDR3 1866MHz Ram
    • 1TB HD
    • 56 Gigabit/sec Infiniband network interface
    • 10 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • 360 Compute Nodes
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • Two Intel Xeon Phi 7120P Coprocessors
    • 64GB DDR3 1866MHz Ram
    • 500GB HD
    • 56 Gigabit/sec Infiniband network interface
    • 1 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • 20 Hybrid Compute Nodes
    • Two 2.8GHz 10-Core Ivy Bridge-EP E5-2680 Xeon 64-bit Processors
    • One Intel Xeon Phi 7120P Coprocessors
    • One NVIDIA Tesla K20X 6GB GPU with GPUDirect Support
    • 64GB DDR3 1866MHz Ram
    • 500GB HD
    • 56 Gigabit/sec Infiniband network interface
    • 1 Gigabit Ethernet network interface
    • Red Hat Enterprise Linux 6
  • Cluster Storage
    • 840TB Lustre High-Performance disk
    • 5TB NFS-mounted /home disk storage

1. System Access to SuperMIC

1.1. SSH

To access SuperMIC, users must connect using an Secure Shell (SSH) client.

*nix and Mac Users - SSH client is already installed and can be accessed from the command prompt using the ssh command. One would issue a command similar to the following:

$ ssh -X username@mic.hpc.lsu.edu

The user would then be prompted for his password. The -X flags allow for X11 Forwarding to be set up automatically.

Windows Users - You will need to download and install a SSH client such as the PuTTY utility. If users need access to login with X11 Forwarding, a X-Server needs to be installed and running on your local Windows machine. Xming X Server is recommended, advanced users may also install Cygwin which also provides a command line ssh client similar to that available for *nix and Mac Users.

If you have forgotten your password, or you wish to reset it, see here (click "Forgot your password?").

Back to Top

1.2. GSI-OpenSSH (gsissh)

The following commands authenticate using the XSEDE myproxy server, then connecting to the gsissh on SuperMIC :

localhost$ myproxy-logon -s myproxy.teragrid.org
localhost$ gsissh userid@mic.lsu.xsede.org

Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.

XSEDE also provides a Single Sign On (SSO) login hub, where upon logging in a proxy certificate is automatically generated for users, who can then connect to XSEDE resources via gsissh. Detailed information can be found here.

Back to Top

1.3. Help

To report a problem please run the ssh or gsissh command with the "-vvv" option and include the verbose information in the ticket.

Back to Top

2. File Transfer to SuperMIC

SuperMIC supports multiple file transfer programs, common command line utilities such as scp, sftp, and rsync and services such as globus-url-copy and Globus Online.

2.1. SCP

Using scp is the easiest method to use when transferring single files.

Local File to Remote Host
% scp localfile user@remotehost:/destination/dir/or/filename
Remote Host to Local File
% scp user@remotehost:/remote/filename localfile

Back to Top

2.2. SFTP

Interactive Mode

One may find this mode very similar to the interactive interface offered. A login session may look similar to the following:

% sftp user@remotehost
(enter in password)
 ...
sftp>

The commands are similar to those offered by the outmoded ftp client programs: get, put, cd, pwd, lcd, etc. For more information on the available set of commands, one should consult sftp the man page.

% man sftp
Batch Mode

One may use sftp interactively in two cases.

Case 1: Pull a remote file to the local host.

% sftp user@remotehost:/remote/filename localfilename

Case 2: Creating a special sftp batch file containing the set of commands one wishes to execute with out any interaction.

% sftp -b batchfile user@remotehost

Additional information on constructing a batch file is available in the sftp man page.

Back to Top

2.3. rsync Over SSH (preferred)

rsync is an extremely powerful program; it can synchronize entire directory trees, only sending data about files that have changed. That said, it is rather picky about the way it is used. The rsync man page has a great deal of useful information, but the basics are explained below.

Single File Synchronization

To synchronize a single file via rsync, use the following:

To send a file:

% rsync --rsh=ssh --archive --stats --progress localfile \
        username@remotehost:/destination/dir/or/filename

To receive a file:

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/filename localfilename

Note that --rsh=ssh is not necessary with newer versions of rsync, but older installs will default to using rsh (which is not generally enabled on modern OSes).

Directory Synchronization

To synchronize an entire directory, use the following:

To send a directory:

% rsync --rsh=ssh --archive --stats --progress localdir/ \
        username@remotehost:/destination/dir/ 

or

% rsync --rsh=ssh --archive --stats --progress localdir \
        username@remotehost:/destination 

To receive a directory:

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/directory/ /some/localdirectory/

or

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/directory /some/

Note the difference with the slashes. The second command will place the files in the directory /destination/localdir; the fourth will place them in the directory /some/directory. rsync is very particular about the placement of slashes. Before running any significant rsync command, add --dry-run to the parameters. This will let rsync show you what it plans on doing without actually transferring the files.

Synchronization with Deletion

This is very dangerous; a single mistyped character may blow away all of your data. Do not synchronize with deletion if you aren't absolutely certain you know what you're doing.

To have directory synchronization delete files on the destination system that don't exist on the source system:

% rsync --rsh=ssh --archive --stats --dry-run --progress \
        --delete localdir/ username@remotehost:/destination/dir/

Note that the above command will not actually delete (or transfer) anything; the --dry-run must be removed from the list of parameters to actually have it work.

Back to Top

2.4. GridFTP, globus-url-copy and Globus Online

To transfer data between XSEDE sites, use globus-url-copy. This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. It has a complex syntax, but provides high speed access to other XSEDE machines that support GridFTP services (the protocol for globus-url-copy). High speed transfers of a file or directory occur between the different FTP servers at the XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories. Third party transfers, transfers initiated between two machines from a third machine, are possible .

Use the myproxy-logon command with your XUP username to obtain a proxy certificate.

myproxy-logon -T -l <XUP_username>

This command will prompt for your XSEDE password. The proxy is valid for 12 hours for all logins on the local machine. With globus-url-copy, you must include the name of the server and a full path to the file. The general syntax looks like:

globus-url-copy <options> \
gsiftp://<gridftp_server1>/<filename> gsiftp://<gridftp_server2>/<filename>

If one of the gridftp servers is the local machine where the above command is entered, you can replace gsiftp://<gridftp_server1>/<filename> with file:///<filename>

globus-url-copy example

The following example transfers a file, 100mbfile from Eric LONI cluster to a directory, createdirectory that is created on the QueenBee LONI cluster.

[apacheco@qb1 ~]$ globus-url-copy -vb -cd \ 
 gsiftp://eric1.loni.org/home/apacheco/100mbfile \ 
 gsiftp://qb1.loni.org/home/apacheco/createdirectory/100mbfile
gsiftp example

The following example transfers a file, 1gbfile between the QueenBee and Eric LONI clusters.

gsiftp://qb1.loni.org/work/sirish/1gbfile gsiftp://eric1.loni.org/work/sirish/1gbfile

Please refer to the GridFTP User Guide for detailed description for various options available for globus-url-copy.

Globus Online

Globus Online endpoints exist for XSEDE and LONI clusters. To use GO on LONI cluster, please refer the Globus Tutorial available here. Additional information regarding GO endpoints for SuperMIC will be available at a later time.

Back to Top

3. Computing Environment.

3.1. Shell

SuperMIC's default shell is bash. Other shells are available: sh, csh, tcsh, and ksh. Users may change their default shell by logging into their HPC Profile page at https://accounts.hpc.lsu.edu.

Back to Top

3.2. Modules

SuperMIC makes use of modules to allow for adding software to the user's environment.

LSU HPC users: With SuperMIC's participation in XSEDE project, the module environment will be the default environment for modifying your user environment. Users who are familiar with the softenv environment on our other clusters please note that softenv environment will not installed. The following is a guide to managing your software environment with modules.

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.

Back to Top

3.2.1. Useful Module Commands

Command Description
module listList the modules that are currently loaded
module availList the modules that are available
module display <module name>Show the environment variables used by <module name> and how they are affected
module unload <module name>Remove <module name> from the environment
module load <module name>Load <module name> into the environment
module swap <module one> <module two>Replace <module one> with <module two> in the environment

Back to Top

3.2.2. Loading and unloading modules

You must remove some modules before loading others. Some modules depend on others, so they may be loaded or unloaded as a consequence of another module command. For example, if intel and mvapich are both loaded, running the command module unload intel will automatically unload mvapich. Subsequently issuing the module load intel command does not automatically reload mvapich.

If you find yourself regularly using a set of module commands, you may want to add these to your configuration files (.bashrc for bash users, .cshrc for C shell users). Complete documentation is available in the module(1) and modulefile(4) manpages.

Back to Top

4. File Systems

User-owned storage on the SuperMIC system is available in two directories, identified by the home and work directories. These directories are separate Lustre (global) file systems, and accessible from any node in the system. The home and work directories are created automatically within an hour of first login. If these directories do not exist when you login, please wait at least an hour before contacting the HPC helpdesk.

4.1. Home Directory

The /home file system quota on SuperMIC is 5 GB. Files can be stored on /home permanently, which makes it an ideal place for your source code and executables. The /home file system is meant for interactive use such as editing and active code development. Do not use /home for batch job I/O.

Back to Top

4.2. Work (Scratch) Directory

The /work volume meant for the input and output of executing batch jobs and not for long term storage. We expect files to be copied to other locations or deleted in a timely manner, usually within 30-120 days. For performance reasons on all volumes, our policy is to limit the number of files per directory to around 10,000 and total files to about 500,000.

The /work file system quota on SuperMIC is unlimited. If it becomes over utilized we will enforce a 30 days purging policy, which means that any files that have not been accessed for the last 30 days will be permanently deleted. An email message will be sent out weekly to users targeted for a purge informing them of their /work utilization.

Please do not try to circumvent the removal process by date changing methods. We expect most files over 30 days old to disappear. If you try to circumvent the purge process, this may lead to access restrictions to the /work volume or the cluster.

Please note that the /work volume is not unlimited. Please limit your usage rate to a reasonable amount. When the utilization of /work is over 80%, a 14 day purge may be performed on users using more than 2 TB or having more than 500,000 files. Should disk space become critically low, all files not accessed in 14 days will be purged or even more drastic measures if needed. Users using the largest portions of the /work volume will be contacted when problems arise and they will be expected to take action to help resolve issues.

Back to Top

5. Application Development

The Intel, GNU and Portland Group (PGI) C, C++ and Fortran compilers are installed on SuperMIC and they can be used to create OpenMP, MPI, hybrid and serial programs. The commands you should use to create each of these types of programs are shown in the table below.

Intel compilers are loaded by default, codes can be compiled according to the following chart:

Intel Compiler Commands
Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran ifort mpif90 ifort -openmp mpif90 -openmp
C icc mpicc icc -openmp mpicc -openmp
C++ icpc mpiCC icpc -openmp mpiCC -openmp
GNU Compiler Commands
Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran gfortran mpif90 gfortran -fopenmp mpif90 -fopenmp
C gcc mpicc gcc -fopenmp mpicc -fopenmp
C++ g++ mpiCC g++ -fopenmp mpiCC -fopenmp
PGI Compiler Commands
Serial Codes MPI Codes OpenMP Codes Hybrid Codes
Fortran pgf90 mpif90 pgf90 -mp mpif90 -mp
C pgcc mpicc pgcc -mp mpicc -mp
C++ pgCC mpiCC pgCC -mp mpiCC -omp

Default MPI: openmpi 1.6.2 compiled with Intel compiler version 13.0.0

To compile a serial program, the syntax is: <your choice of compiler> <compiler flags> <source file name> . For example, the command below compiles the source file mysource.f90and generate the executble myexec.

$ ifort -o myexec mysource.f90

To compile a MPI program, the syntax is the same, except that one needs to replace the serial compiler with an MPI one listed in the table above:

$ mpif90 -o myexec_par my_parallel_source.f90

Back to Top

5.1. Coprocessor (MIC) Programming

Disclaimer: The following section is borrowed from TACC's Stampede user guide. We expect MIC Programming on SuperMIC to be identical to that on Stampede. As we work to bring SuperMIC into production, if we find the following section to be different we will update this section at a later time.

Many Fortran or C/C++ applications designed to run on the E5 processor (host) can be modified to automatically execute blocks of code or routines on the Phi coprocessor through directives. The Intel compiler, without requiring any additional options, will interpret the directives and include Phi executable code within the normal executable binary. A binary with Phi executable offload code can be launched on the host in the usual manner (with ibrun for MPI codes, and as a process execution for serial and OpenMP), and the offloaded sections of code will automatically execute on the Phi coprocessor.

There are two points to remember when discussing computations on the host (E5 CPUs) and coprocessor (Phi):

  • The instruction sets and architectures of the host E5 and Phi coprocessor are quite similar, but are not identical. (Expect differences in performance.)
  • Host processors and MIC coprocessors have their own memory subsystems. They are effectively separate SMP systems with their own OS and environment.

Programming details for offloading can be found in the respective user guides listed below.

Back to Top

Automatic Offloading

Some of the MKL routines that do large amounts of floating point operations compared to data accesses (having computational complexity O(n3) compared to O(n2) data access; e.g., level 3 BLAS) have been configured with automatic offload (AO) capabilities. This capability allows the user to offload work in the library routines automatically, without any coding changes. No special compiler options are required. Just compile with the usual flags and the MKL library load options (-mkl is the new shortened way to load MKL libraries). Then set the "$MKL_MIC_ENABLE" environment variable to request the automatic offload to occur at run time:

mic1% ifort -mkl -xhost -O3 app_has_MKLdgemm.f90
mic1% export MKL_MIC_ENABLE=1
mic1% ./a.out

Depending upon the problem size (e.g., n>2048 for dgemm) the library runtime may choose to run all, part or none of the routine on the coprocessor. Offloading and the work division between the CPU and MIC are transparent to the user; but these may be controlled with environment variables and Fortran/C/C++ APIs (application program interfaces), particularly when compiler-assisted offloading is also employed. Also, MPI applications that use multiple tasks per node will need to adjust the workload division for sharing the coprocessor among all of the tasks. For example, setting the $MKL_MIC_WORKDIVISION environment variable or using the support function mkl_mic_set_workdivision() with a fraction value, advises the runtime to give the MIC that fraction of work. Set the $OFFLOAD_REPORT variable value, or mkl_mic_set_offload_report function argument, to 0-2 to disclose a range of information, as shown below:

mic1% export MKL_MIC_ENABLE=1 OFFLOAD_REPORT=2
mic1% ./a.out

Details and a list of all the automatic offload controls are available in the MKL User Guide document.

Back to Top

Compiler Assisted Offloading

Developers can explicitly direct a block of code or routine to be executed on the MIC, in the base Fortran, C/C++ languages using directives. The code to be executed on the MIC is called an offload region. No special coding is required in an offloaded region and Intel specific and OpenMP threading methods may be used. Code Example 1 illustrates an offload directive for a code block containing an OpenMP loop. The "target(mic:0)" clause specifies that the MIC coprocessor with id=0 should execute the code region.

When the host execution encounters the offload region the runtime performs several offload operations: detection of a target Phi coprocessor, allocation of memory space on the coprocessor, data transfer from the host to the coprocessor, execution of the coprocessor binary on the Phi, transfer of data from the Phi back to the host after the completion of the coprocessor binary, and memory deallocation. The offload model is suitable when the data exchanged between the host and the MIC consists of scalars, arrays, and Fortran derived types and C/C++ structures that can be copied using a simple memcpy. This data characteristic is often described as being flat or bit-wise copyable. The data to be transferred at the offload point need not be declared or allocated in any special way if the data is within scope (as in Code Example 1); although pointer data (arrays pointed to by a pointer) need their size specified (see Advanced Offloading).

Example 1: Offloaded OpenMP code block with automatic data transfer
    int main(){
        ...
        float a[N], b[N], c[N];
        ...
    #pragma offload target(mic:0)
        {
    #pragma omp parallel for
          for(i=0;<N;i++){
            a[i]=sin(b[i])+cos(c[i]);
          }
        }
        ...
    }
    program main
      ...
      real :: a(N), b(N), c(N)
      ...
      !dir$ offload begin target(mic:0)
      !$omp parallel do
      do i=1,N
        a(i)=sin(b(i))+cos(c(i))
      end do
      !dir$ end offload
      ...
    end program
    

By default the compiler will recognize any offload directive. During development it is useful to observe the names and sizes of variables tagged for transfer by including the "-opt-report-phase=offload" option as shown here:

  mic1% ifort/icc/icpc -openmp -O3 -xhost -opt-report-phase=offload myprog.f90/c/cpp
  mic1% export OMP_NUM_THREADS=16
  mic1% export MIC_ENV_PREFIX=MIC  MIC_OMP_NUM_THREADS=240  KMP_AFFINITY=scatter
  mic1% ./a.out

The "-openmp" and "-O2" options apply to both the host (E5 CPU) and offload (MIC) code regions, and the -xhost is specific to the host code. Environment variables such as $OMP_NUM_THREADS will normally have different values on the host and the MIC. In these cases variables intended for the MIC should be prefixed with "MIC_" and set on the host as shown above; also the "$MIC_ENV_PREFIX" variable must be set to "MIC". Actually, any prefix may be used; but we strongly recommend using MIC.

Back to Top

Advanced Offloading

A few of the important concepts you will need to develop and optimize offload paradigms are summarized below. The corresponding directives, clauses and qualifiers are explained as well. More details and examples, as well as references to Intel documentation are provided in TACC's Advanced Offloading document.

Data Transfers: in/out/inout:

In Code Example 1 the compiler will make sure that the a, b and c arrays are copied over to the MIC before the offloaded region is executed, and are copied back at the end of the execution. Because the a array is only written on the MIC there is no reason to "copy in" the array into the coprocessor; likewise there is no reason to "copy out" b and c out of the coprocessor. To eliminate unnecessary transfers, data intent clauses (in, out, inout) on the offload directive can be used to optimize transfers.

Persistent Data: alloc_if() and free_if():

The automatic data transfers in Code Example 1 allocate storage on the MIC, transfer the data, and deallocate storage for each call. If the same data is to be used in different offloads the data can be made to persist across the offloads by modifying the memory allocation defaults with alloc_if(arg) and free_if(arg) qualifiers within the intent data clauses (in, out, inout). If the argument is false (.false. for Fortran, 0 for C, false for C++) the allocation or deallocation is not performed, respectively.

Data Transfer Directive: offload_transfer:

The programmer can transfer data without offloading executable code. The offload_transfer directive fulfills this function. It is a stand-alone directive (requiring no code block), and uses all the same data clauses and modifiers of a normal offload statement. One common use case is to initially load persistent data (asynchronously) onto the MIC at the beginning of a program.

Asynchronous Offloading: signal and wait:

Often a developer may want to transfer data or do offload work while continuing to do work on the cpu. An offload region can be executed asynchronously when a signal clause is included on the directive. The host process encountering the offload will initiate the offload (offload or offload_transfer), and then immediately continue to execute the program code following the offload region. The offload event is identified by a variable argument within a signal clause, and uses it in the wait clause in a subsequent offload directive or stand-alone wait directive.

Back to Top

5.2. GPU Programming

CUDA Programming

NVIDIA's CUDA compiler and libraries are accessed by loading the CUDA module:

mike1$ module load cuda

Use the nvcc compiler on the head node to compile code, and run executables on nodes with GPUs - one head node has GPUs. SuperMIC K20X's GPUs are compute capability 3.5 devices. When compiling your code, make sure to specify this level of capability with:

nvcc -arch=compute_35 -code=sm_35 ...

GPU nodes are accessible through the gpu queue for production work.

GPU nodes are not available for XSEDE users. We may make these nodes available to XSEDE in the future.

Back to Top

OpenACC Programming

OpenACC is the name of an application program interface (API) that uses a collection of compiler directives to accelerate applications that run on multicore and GPU systems. The OpenACC compiler directives specify regions of code that can be offloaded from a CPU to an attached accelerator. A quick reference guide is available here.

Currently, only the Portland Group compilers installed on SuperMIC can be used to compile C and Fortran code annotated with OpenACC directives.

To load the PGI compilers:

module load portland

To compile a C code annotated with OpenACC directives:

pgcc -acc -ta=nvidia -Minfo=accel code.c -o code.exe

The Pittsburgh Supercomputing Center (PSC), in cooperation with the National Institute for Computational Sciences (NICS), the Georgia Institute of Technology (Georgia Tech), and the Internet2 community, periodically presents a workshop on OpenACC GPU programming. Please visit the XSEDE Training Course Calendar for upcoming workshop on OpenACC.

Back to Top

6. Running Applications

SuperMIC uses TORQUE, an open source version of the Portable Batch System (PBS) together with the MOAB Scheduler, to manage user jobs. Whether you run in batch mode or interactively, you will access the compute nodes using the qsub command as described below. Remember that computationally intensive jobs should be run only on the compute nodes and not the login nodes. More details on submitting jobs and PBS commands can be found here.

6.1. Available Queues on SuperMIC

Below are the possible job queues to choose from:

  • single - Used for jobs that will only execute on a single node, i.e. nodes=1:ppn<=20.
  • workq - Used for jobs that will use at least one node, i.e. nodes>=1:ppn=20. Currently, this queue has a limit of 72 hours (3 days) of wallclock time.
  • checkpt - Used for jobs that will use at least one node.
  • gpu - Used for jobs that run applications compiled with CUDA compiler or OpenACC directives.
Queue Name Max Walltime Max Nodes (per job)
workq 72 128
checkpt 72 200
single 72 1
gpu 24 10

Back to Top

6.2. Job Submission

The command qsub is used to send a batch job to PBS. The basic usage is

qsub pbs.script 

where pbs.script is the script users write to specify their needs. qsub also accept command line arguments, which will overwrite those specified in the script, for example, the following command

qsub myscript -A my_LONI_allocation2

will direct the system to charge SUs (service units) to the allocation my_LONI_allocation2 instead of the allocation specified in myscript.

To submit an interactive job, use the -I flag to the qsub command alongwith the options for resources required, for example

qsub -I -l walltime=hh:mm:ss,nodes=n:ppn=20 -A allocation_name

Note that you need to take the whole node when requesting an interactive job, using anything other than ppn=20 will cause job submission failure. If you need to enable X-Forwarding, add the -X flag.

Your PBS submission script should be written in one of the Linux scripting languages such as bash, tcsh, csh or sh i.e. the first line of your submission script should be something like #!/bin/bash. The next section of the submission script should be PBS directives followed by the actual commands to run your job. Following are a list of useful PBS directives (can also be used as command line options to qsub) and environment variables that can be used in the submit script:

  • #PBS -q queuename: Submit job to the queuename queue.
    • Allowed values for queuename: single, workq, checkpt.
    • Depending on cluster, addition values allowed are gpu, lasigma, mwfa, bigmem.
  • #PBS -A allocationname: Charge jobs to your allocation named allocationname.
  • #PBS -l walltime=hh:mm:ss: Request resources to run job for hh hours, mm minutes and ss seconds.
  • #PBS -l nodes=m:ppn=n: Request resources to run job on n processors each on m nodes.
  • #PBS -N jobname: Provide a name, jobname to your job to identify it when monitoring job using the qstat command.
  • #PBS -o filename.out: Write PBS standard output to file filename.out.
  • #PBS -e filename.err: Write PBS standard error to file filename.err.
  • #PBS -j oe: Combine PBS standard output and error to the same file. Note you will need either #PBS -o or #PBS -e directive not both.
  • #PBS -m status: Send an email after job status status is reached. Allowed values for status are
    • a: when job aborts
    • b: when job begins
    • e: when job ends
    • The arguments can be combined, for e.g. abe will send email when job begins and either aborts or ends
  • #PBS -M your email address: Address to send email to when the status directive above is trigerred.
  • PBS_O_WORKDIR: Directory where the qsub command was executed
  • PBS_NODEFILE: Name of the file that contains a list of the HOSTS provided for the job
  • PBS_JOBID: Job ID number given to this job
  • PBS_QUEUE: Queue job is running in
  • PBS_WALLTIME: Walltime in secs requested
  • PBS_JOBNAME: Name of the job. This can be set using the -N option in the PBS script
  • PBS_ENVIRONMENT: Indicates job type, PBS_BATCH or PBS_INTERACTIVE
  • PBS_O_SHELL: value of the SHELL variable in the environment in which qsub was executed
  • PBS_O_HOME: Home directory of the user running qsub

Following are templates for submitting jobs to the various queues available on SuperMIC.

Single Queue Job Script Template
$ cat ~/script

#!/bin/bash
#PBS -q single
#PBS -l nodes=1:ppn=1
#PBS -l walltime=HH:MM:SS
#PBS -o desired_output_file_name
#PBS -N NAME_OF_JOB

/path/to/your/executable
Workq Queue Job Script Template
$ cat ~/script

#!/bin/bash
#PBS -q workq
#PBS -l nodes=1:ppn=20
#PBS -l walltime=HH:MM:SS
#PBS -o desired_output_file_name
#PBS -j oe
#PBS -N NAME_OF_JOB

# mpi jobs would execute:
#   mpirun -np 20 -machinefile $PBS_NODEFILE /path/to/your/executable
# OpenMP jobs would execute:
#   export OMP_NUM_THREADS=20; /path/to/your/executable
Checkpt Queue Job Script Template
$ cat ~/script

#!/bin/bash
#PBS -q checkpt 
#PBS -l nodes=1:ppn=20
#PBS -l walltime=HH:MM:SS
#PBS -o desired_output_file_name
#PBS -j oe
#PBS -N NAME_OF_JOB

# mpi jobs would execute:
#   mpirun -np 20 -machinefile $PBS_NODEFILE /path/to/your/executable
# OpenMP jobs would execute:
#   export OMP_NUM_THREADS=20; /path/to/your/executable

Back to Top

6.3. Monitoring Jobs

qstat for checking job status

The command qstat is used to check the status of PBS jobs. The simplest usage is

qstat

which would give informations similar to the following:

[apacheco@qb4 ~]$ qstat
Job id              Name             User            Time Use S Queue
------------------- ---------------- --------------- -------- - -----
729444.qb2          job1.pbs         ebeigi3                0 Q workq          
729516.qb2          MAY2009_d        skayres         533:14:2 R workq          
729538.qb2          wallret_test222  liyuxiu         67:43:38 R workq          
729539.qb2          wallret_test223  liyuxiu         67:43:39 R workq          
729540.qb2          wallret_test228  liyuxiu         66:49:50 R workq          
729541.qb2          wallret_test231  liyuxiu         64:40:21 R workq          
729542.qb2          wallret_test232  liyuxiu         64:40:15 R workq          
729543.qb2          wallret_test233  liyuxiu         63:18:24 R workq          
729567.qb2          CaPtFeAs         cekuma          00:22:01 R workq     

The first column to the six column show the id of each job, the name of each job, the owner of each job, the time consummed by each job, the status of each job (R corresponds to running, Q correcponds to in queue ), and which queue each job is in. qstat also accepts command line arguments, for instance, the following usage gives more detailed information regarding jobs.

[apacheco@qb4 ~]$ qstat -a

qb2: 
                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
729444.qb2           ebeigi3  workq    job1.pbs      --      2   1    --  06:30 Q   -- 
729516.qb2           skayres  workq    MAY2009_d    2969     8   1    --  72:00 R 66:45
729538.qb2           liyuxiu  workq    wallret_te  26259     1   1    --  70:00 R 67:44
729539.qb2           liyuxiu  workq    wallret_te   5144     1   1    --  70:00 R 67:44
729540.qb2           liyuxiu  workq    wallret_te  12445     1   1    --  70:00 R 66:50
729541.qb2           liyuxiu  workq    wallret_te   2300     1   1    --  70:00 R 64:41
729542.qb2           liyuxiu  workq    wallret_te   1809     1   1    --  70:00 R 64:41
729543.qb2           liyuxiu  workq    wallret_te   9377     1   1    --  70:00 R 63:19
729567.qb2           cekuma   workq    CaPtFeAs    10562     7   1    --  69:50 R 48:18

Other useful options to qstat:

  • -u username: To display only jobs owned by user username.
  • -n: To display list of nodes that jobs are running on.
  • -q: To summarize resources available to all queues.
qdel for cancelling a job

To cancel a PBS job, enter the following command.

qdel job_id [job_id] ...
qfree to query free nodes in PBS

One useful command for users to schedule their jobs in an optimal way is "qfree", which shows free nodes in each queue. For example,

[apacheco@qb4 ~]$ qfree
PBS total nodes: 668,  free: 6,  busy: 629,  down: 33,  use: 94%
PBS workq nodes: 529,  free: 3,  busy: 317,  queued: 2
PBS checkpt nodes: 656,  free: 1,  busy: 312,  queued: 64
(Highest priority job 729767 on queue checkpt will start in 2:34:14)

shows that there total 6 free nodes in PBS, they are available in all the two queues: checkpt and workq.

showstart for estimating the starting time for a job

The command showstart can be used to get an approximate estimation of the starting time of your job, the basic usage is

showstart job_id

The following shows an simple example:

[apacheco@qb4 ~]$ showstart 729767
job 729767 requires 32 procs for 2:00:00:00

Estimated Rsv based start in                 2:33:25 on Tue Dec 17 11:52:32
Estimated Rsv based completion in         2:02:33:25 on Thu Dec 19 11:52:32

Best Partition: base

Please note that the start time listed above is only an estimate. There is no gaurantee that the job will start at the above mentioned time.

showq to display jobs info within the batch system

The command showq can be used to display job information within the batch system.

[apacheco@qb4 ~]$ showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME                     

729538              liyuxiu    Running     8     2:11:44  Sat Dec 14 13:31:32
729539              liyuxiu    Running     8     2:11:44  Sat Dec 14 13:31:32
729607               amani1    Running   256     2:32:44  Mon Dec 16 15:52:32
729609               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729610               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729611               amani1    Running   256     2:51:13  Mon Dec 16 16:11:01
729613               amani1    Running   256     3:05:19  Mon Dec 16 16:25:07
... truncated ...
92 active jobs        5032 of 5064 processors in use by local jobs (99.37%)
                        629 of 633 nodes active      (99.37%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME                     

729767             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
729768             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
729769             lsurampu       Idle    32  2:00:00:00  Mon Dec 16 22:54:38                     
... truncated ...
16 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME                     


0 blocked jobs   

Total jobs:  108

To display job information for a particular queue, use the command

showq -w class=<queue name>
checkjob to display detailed job state information

The command checkjob is used to display detailed information about the job state. This is very useful if your job is remaining in the queued state, and you'd like to see why PBS hasn't executed it:

[apacheco@qb4 ~]$ checkjob 729787.qb2
job 729787

AName: null
State: Idle 
Creds:  user:apacheco  group:loniadmin  account:loni_loniadmin1  class:workq  qos:userres
WallTime:   00:00:00 of 2:00:00
SubmitTime: Tue Dec 17 09:22:14
  (Time Queued  Total: 00:00:14  Eligible: 00:00:06)

NodeMatchPolicy: EXACTNODE
Total Requested Tasks: 32

Req[0]  TaskCount: 32  Partition: ALL  



Flags:          INTERACTIVE
Attr:           INTERACTIVE,checkpoint
StartPriority:  141944
available for 8 tasks     - qb[002,007,376]
rejected for Class        - (null)
rejected for State        - (null)
NOTE:  job req cannot run in partition base (available procs do not meet requirements : 24 of 32 procs found)
idle procs:  32  feasible procs:  24

Node Rejection Summary: [Class: 1][State: 667]

This job cannot be started since it requires 4 nodes (32 procs) but only 3 nodes are available.

qshow to display memory and cpu usage on the node that a job is running on

The command qshow is useful to find out how the resources on the node allocated to your job are consumed. For example, if a users job is running slow due to swapping, this command will provide you with information on how much memory (physical and virtual) is used on all processors allocated to your job.

[apacheco@qb4 ~]$ qshow 729731
PBS job: 729731, nodes: 4
Hostname  Days Load CPU U# (User:Process:VirtualMemory:Memory:Hours)
qb373       39 8.93 798 21 lsurampu:mdrun_mpi:88M:31M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:117M:65M:10.6 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:88M:30M:10.9 lsurampu:mdrun_mpi:88M:30M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:91M:33M:10.9 lsurampu:pbs_demux:3M:0M lsurampu:729731:52M:1M lsurampu:mpirun:52M:1M lsurampu:mpirun_rsh:6M:1M lsurampu:mpispawn:6M:1M
qb368       39 8.99 798 12 lsurampu:mdrun_mpi:89M:40M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:88M:31M:10.9 lsurampu:mdrun_mpi:89M:32M:10.9 lsurampu:mdrun_mpi:91M:33M:10.9 lsurampu:mdrun_mpi:95M:37M:10.9 lsurampu:mdrun_mpi:91M:33M:10.9 lsurampu:mdrun_mpi:112M:50M:10.9 lsurampu:mpispawn:6M:1M
qb364       39 8.85 800 12 lsurampu:mdrun_mpi:91M:42M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:93M:35M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mdrun_mpi:90M:32M:10.9 lsurampu:mpispawn:6M:1M
qb362       39 8.89 802 12 lsurampu:mdrun_mpi:90M:41M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:112M:51M:10.9 lsurampu:mdrun_mpi:89M:32M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mdrun_mpi:89M:31M:10.9 lsurampu:mpispawn:6M:1M
PBS_job=729731 user=lsurampu allocation=loni_poly_mic_1 queue=checkpt total_load=32 cpu_hours=320 wall_hours=10 unused_nodes=0 total_nodes=4 avg_load=8

More detailed information on the Torque PBS commands and Moab to schedule and monitor jobs can be found at Adaptive Computing on-line documentations.

Back to Top