Skip to main content

Frequently Asked Questions

▶ How to get started quickly

Getting started on LSU systems requires several steps that may vary depending on how experienced you are with HPC systems. The items listed here are not meant to be exhaustive - treat as robust starting points.

Everyone

Every LONI and HPC system user requires a user account. Some systems also require an allocation account to charge production runs against.

  1. Visit the applications index to see what software is currently installed. Listings by name and by field (i.e. Computational Biology) are provided. Assistance with installing other software is available.
  2. Request a LSU and/or LONI user account. Individuals on the LSU campus have access to both sets of resources.
  3. Be aware that assistance in many forms is always available.

Production efforts on LONI and some LSU resources (i.e. SuperMike-II) require an allocation of system time in units of core-hours (i.e. SU) against which work is charged. This is a no-cost service, but a proposal of one form or another is required. The holder of an allocation then adds users who may charge against it. Note that only faculty and research staff may request allocations (see allocation policy pages for details).

To request or join an allocation, you must have the appropriate system user account and then visit:

  1. LSU allocation applications (e.g. SuperMike-II).
  2. LONI allocation applications (e.g. QB2, Eric, etc.).

Beginner

See the Training link for the various forms of training that are available, such as: Moodle courses, weekly tutorials (live and recorded past sessions), and workshops.

  1. Learn how to connect to an HPC system (SSH, PuTTY, WinSCP).
  2. Learn basic Linux commands.
  3. Learn how to edit files (vi/vim editor).
  4. Learn about the user shell environment (bash shell).
  5. Learn how to submit jobs (PBS)

Advanced

  1. Learn how to manage data files.
  2. Learn how to control your applications of choice.
  3. Learn how to write shell scripts (i.e. job scripts).
  4. Learn how to install custom software.
  5. Learn to program in one or more languages.

Expert

  1. Learn how to debug software (ddt, totalview).
  2. Learn how to use parallel programming techniques.
  3. Learn how to profile and optimize code (tau).
  4. Learn how to manage source code (svn).
  5. Learn how to automate the build process (make).

▶ LSU HPC System FAQ

What computing resources are available to LSU HPC users?

The following three clusters are in production and open to LSU HPC users

  • Philip (philip.hpc.lsu.edu)
  • SuperMike-II (mike.hpc.lsu.edu)
  • Pandora (pandora.hpc.lsu.edu)

In addition, users from LSU also have access to computing resources provided by LONI.

Where can I find information on using the LSU HPC systems?

See the LSU HPC Cluster User's Guide for information on connecting to and using the LSU HPC clusters.

Who is eligible for a LSU HPC account?

All faculty and research staff at Louisiana State University Baton Rouge Campus, as well as students pursuing sponsored research activities at LSU, are eligible for a LSU HPC account. For prospective LSU HPC Users from outside LSU, you are required to have a faculty or research staff at LSU as your Collaborator to sponsor you a LSU HPC account.

How can one apply for an account?

Individuals interested in applying for a LSU HPC account should visit this page to begin the account request process. A valid, active institutional email address is required.

Who can sponsor a LSU HPC account or request allocations?

The LSU HPC Resource Allocations Committee require that the Principle Investigators using LSU HPC resources be restricted to full-time faculty or research staff members located at LSU (Baton Rouge campus).

HPC@LSU welcomes members of other institutions that are collaborating with LSU researchers to use LSU HPC resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at LSU to sponsor them and any additional researchers.

How do LSU HPC users change their passwords?

If a LSU HPC password needs to be changed or reset, one must submit a request here to initiate the process. Please not that the email address must be the one used to apply for the LSU HPC account. Like the account request process, you will receive an email which includes a link to the reset form where the new password can be entered. After you confirm your new password, it will not be in effect until one of the LSU HPC administrators approves it, which may take up to a few hours. Please do not request another password reset, you will receive an email notifying that your password reset has been approved.

How do LSU HPC users change their login shell?

LSU HPC users can change their login shell on all LONI clusters by visiting their profile on the LSU HPC website

Login to your LSU HPC Profile

How can users communicate with the LSU HPC systems staff?

Questions for the LSU HPC systems' administrators should be directed to sys-help@loni.org.

▶ LONI System FAQ

What computing resources are available to LONI users?

LONI has also acquired 6 Dell Intel Linux clusters each capable of roughly 5 trillion calculations per second (5 teraflops). The following five clusters are in production and open to LONI users while the other cluster is being installed at Southern (environmental issues) and will hopefully be available for use soon.

  • Eric at LSU (eric.loni.org)
  • Oliver at ULL (oliver.loni.org)
  • Louie at Tulane (louie.loni.org)
  • Painter at LaTech (painter.loni.org)
  • Poseidon at UNO (poseidon.loni.org)

Finally, LONI has a large Intel Linux cluster rated at 50 teraflops of theoretical capacity. This central piece is named Queen Be and is housed at the State of Louisiana's Information Services Building (ISB) in downtown Baton Rouge. Queen Bee is open for general use to LONI users.

  • Queen Bee at ISB (queenbee.loni.org)

Where can I find information on using the LONI systems?

See the LONI User's Guide for information on connecting to and using the LONI clusters.

Who is eligible for a LONI account?

All faculty and research staff at a LONI Member Institution, as well as students pursuing sponsored research activities at these facilities, are eligible for a LONI account. Requests for accounts by research associates not affiliated with a LONI Member Institution will be handled on a case by case basis. For prospective LONI Users from a non-LONI Member Institution, you are required to have a faculty or research staff in one of LONI Member Institutions as your Collaborator to sponsor you a LONI account.

How can one apply for an account?

Individuals interested in applying for a LONI account should visit this page to begin the account request process. A valid, active institutional email address is required.

Who can sponsor a LONI account or request allocations?

LONI provides Louisiana researchers with an advanced optical network and powerful distributed supercomputer resources. The LONI Allocations committee and LONI management require that the Principle Investigators using LONI resources be restricted to full-time faculty or research staff members located at LONI member institutions.

LONI welcomes members of other institutions that are collaborating with Louisiana researchers to use LONI resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at a LONI institution to sponsor them and any additional researchers.

How do LONI users change their passwords?

If a LONI password needs to be changed or reset, one must submit a request here to initiate the process. Please not that the email address must be the one used to apply for the LONI account. Like the account request process, you will receive an email which includes a link to the reset form where the new password can be entered. After you confirm your new password, it will not be in effect until one of the LONI administrators approves it, which may take up to a few hours. Please do not request another password reset, you will receive an email notifying that your password reset has been approved.

How do LONI users change their login shell?

LONI users can change their login shell on all LONI clusters by visiting their profile on the LONI website

Login to your LONI Profile

How can users communicate with the LONI systems staff?

Questions for the LONI systems' administrators should be directed to sys-help@loni.org.

▶ LSU HPC Accounts FAQ

Who is eligible for a LSU HPC account?

All faculty and research staff at Louisiana State University Baton Rouge Campus, as well as students pursuing sponsored research activities at LSU, are eligible for a LSU HPC account. For prospective LSU HPC Users from outside LSU, you are required to have a faculty or research staff at LSU as your Collaborator to sponsor you a LSU HPC account.

How can one apply for an account?

Individuals interested in applying for a LSU HPC account should visit this page to begin the account request process. A valid, active institutional email address is required.

Who can sponsor a LSU HPC account or request allocations?

The LSU HPC Resource Allocations Committee require that the Principle Investigators using LSU HPC resources be restricted to full-time faculty or research staff members located at LSU (Baton Rouge campus).

HPC@LSU welcomes members of other institutions that are collaborating with LSU researchers to use LSU HPC resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at LSU to sponsor them and any additional researchers.

▶ LSU HPC Allocations FAQ

How do I access the LSU HPC Allocations website?

To access LSU HPC Allocations, please login to your LSU HPC Profile. Once logged in you should see links on the right sidebar for Balances (to find out current status and usage of your allocations) and Request Allocations (to join or request a new allocations).

How do I request a new allocation

To request a new allocation, you first need to login to your LSU HPC profile and then click on the "Request Allocation" link in the right sidebar. You will see two links there: "New Allocation" and "Join Allocation". Click on the first link. You will then be presented with the Allocation Request form that you need to fill out. Click the "Submit Request" button after you have completed filling the form. The LSU HPC Resource Allocation Committee (HPCRAC) member will review your allocation and let you know of their decision via email. HPC@LSU Support staff do not make decisions on allocations. In case your allocation request has not been responded to in a timely manner, you can either email the HPCRAC committee member directly or email the help desk who will forward the email to appropriate committee member(s).

How do I join an allocation of another PI?

To join an existing allocation of another PI (either your professor, collaborator or our training allocation), you first need to login to your LSU HPC profile and then click on the "Request Allocation" link in the right sidebar. You will see two links there: "New Allocation" and "Join Allocation".

Click on the second link and enter the name, HPC username or email address of PI whose allocation you wish to join (see screenshot below).

If your search is successfull, you will be provided with information of the PI whom you have searched for. Click the "Join Projects" button

You will then be presented with a list of allocations for the PI. Click on the "Join" button for the allocation which you wish to join.

The PI will recieve an email requesting him/her to confirm adding you to their allocation. HPC@LSU staff do not add users to an existing allocation of a PI, please do not email the support staff to add you to an allocation. You need to this yourself. If you are planning on attending the HPC Training, please read the instructions that are send in the training confirmation email.

Who can sponsor a LSU HPC account or request allocations?

The LSU HPC Resource Allocations Committee require that the Principle Investigators using LSU HPC resources be restricted to full-time faculty or research staff members located at LSU (Baton Rouge campus).

HPC@LSU welcomes members of other institutions that are collaborating with LSU researchers to use LSU HPC resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at LSU to sponsor them and any additional researchers.

Where can I get more information?

Please look through the LSU HPC Policy page for more information on allocation types and HPCRAC contact information.

▶ LONI Accounts FAQ

Who is eligible for a LONI account?

All faculty and research staff at a LONI Member Institution, as well as students pursuing sponsored research activities at these facilities, are eligible for a LONI account. Requests for accounts by research associates not affiliated with a LONI Member Institution will be handled on a case by case basis. For prospective LONI Users from a non-LONI Member Institution, you are required to have a faculty or research staff in one of LONI Member Institutions as your Collaborator to sponsor you a LONI account.

How can one apply for an account?

Individuals interested in applying for a LONI account should visit this page to begin the account request process. A valid, active institutional email address is required.

Who can sponsor a LONI account or request allocations?

LONI provides Louisiana researchers with an advanced optical network and powerful distributed supercomputer resources. The LONI Allocations committee and LONI management require that the Principle Investigators using LONI resources be restricted to full-time faculty or research staff members located at LONI member institutions.

LONI welcomes members of other institutions that are collaborating with Louisiana researchers to use LONI resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at a LONI institution to sponsor them and any additional researchers.

How long can I keep my LONI account?

LONI accounts are valid as long as you maintain eligibility for a LONI account. However, the Allocation Management clause from the LONI Allocation policy, which is stated below, takes precedence for maintaining an active LONI account.

User accounts must be associated with a valid allocation, and if not, will be retained for a maximum of 1 year pending authorization against a renewed or different allocation.

▶ LONI Allocations FAQ

How do I access the LONI Allocations website?

To access LONI Allocations, please login to your LONI Profile. Once logged in you should see links on the right sidebar for Balances (to find out current status and usage of your allocations) and Request Allocations (to join or request a new allocations).

How do I request a new allocation

To request a new allocation, you first need to login to your LONI profile and then click on the "Request Allocation" link in the right sidebar. You will see two links there: "New Allocation" and "Join Allocation". Click on the first link. You will then be presented with the Allocation Request form that you need to fill out. Click the "Submit Request" button after you have completed filling the form. The LONI Resource Allocation Committee (LRAC) member will review your allocation and let you know of their decision via email. HPC@LSU Support staff do not make decisions on allocations. In case your allocation request has not been responded to in a timely manner, you can either email the LRAC committee member directly or email the help desk who will forward the email to appropriate committee member(s).

How do I join an allocation of another PI?

To join an existing allocation of another PI (either your professor, collaborator or our training allocation), you first need to login to your LONI profile and then click on the "Request Allocation" link in the right sidebar. You will see two links there: "New Allocation" and "Join Allocation" (see screenshot above).

Click on the second link and enter the name, LONI username or email address of PI whose allocation you wish to join.

If your search is successfull, you will be provided with information of the PI whom you have searched for. Click the "Join Projects" button

You will then be presented with a list of allocations for the PI. Click on the "Join" button for the allocation which you wish to join.

The PI will recieve an email requesting him/her to confirm adding you to their allocation. HPC@LSU staff do not add users to an existing allocation of a PI, please do not email the support staff to add you to an allocation. You need to this yourself. If you are planning on attending the HPC Training, please read the instructions that are send in the training confirmation email.

Who can sponsor a LONI account or request allocations?

LONI provides Louisiana researchers with an advanced optical network and powerful distributed supercomputer resources. The LONI Allocations committee and LONI management require that the Principle Investigators using LONI resources be restricted to full-time faculty or research staff members located at LONI member institutions.

LONI welcomes members of other institutions that are collaborating with Louisiana researchers to use LONI resources, but they cannot be the Principle Investigator requesting allocations or granting access. For the Principle Investigator role, Adjunct and Visiting professors do not qualify. They must ask a collaborating full-time professor or research staff member located at a LONI institution to sponsor them and any additional researchers.

Where can I get more information?

Please look through the LONI Policy page for more information on allocation types and LRAC contact information.

▶ Are LONI or LSU HPC Systems Backed Up?

No. A user must do this for himself.

For small sets of important files (not large data files!) subversion may be a good way to manage versioned copies. The Subversion client is available on all clusters. To learn more about using Subversion, please see the Subversion tutorial in the LONI Moodle course on Software Development Tools.

▶ How to unsubscribe from LONI and/or LSU HPC mailing lists?

All LONI users are required to be subscribed to the LONI users mailing list while LSU HPC users are required to be subscribed to the HPC users mailing list. The mailing list provides a medium for HPC@LSU admins to disseminate important information regarding Cluster status, upcoming downtime, training, workshops, etc. To be unsubscribed from LONI or LSU HPC mailing list, users have to disable their LONI or LSU HPC accounts.

▶ How to Login to the Clusters?

Utilities

Interactive Utilities

Only ssh access is allowed for interactive access. One would issue a command similar to the following:

 
LSU HPC: ssh -X -Y username@philip.hpc.lsu.edu
LONI: ssh -X -Y username@eric.loni.org

The user would then be prompted for his password. The -X -Y flags allow for X11 forwarding to be set up automagically.

For a Windows client, please look at the puTTY utility.

Accessibility

On Campus

All host instition networks should be able to directly connect to any LONI machine since a connection to the Internet 2 network is available.

At Home

All LONI machines except QB2 are only accessible via Internet 2 networks. This means that one will most likely not to directly connect to a LONI machine from home. To access LONI machines from home, a user should use one of the following methods:

  1. Preferred Method: Login to a machine on Internet2 network, all machines at your host institution are on Internet2 network.
  2. Not Tested: Connect using your host institution's VPN.
    • LSU Faculty, Staff and Students can download VPN Client software from Tigerware
    • Others, please contact your IT department for VPN Client software.

▶ How to setup your environment with softenv?

The information here is applicable to LSU HPC and LONI systems.

Shells

A user may choose between using /bin/bash and /bin/tcsh. Details about each shell follows.

/bin/bash

System resource file: /etc/profile

When one access the shell, the following user files are read in if they exist (in order):

  1. ~/.bash_profile (anything sent to STDOUT or STDERR will cause things like rsync to break)
  2. ~/.bashrc (interactive login only)
  3. ~/.profile

When a user logs out of an interactive session, the file ~/.bash_logout is executed if it exists.

The default value of the environmental variable, PATH, is set automatically using SoftEnv. See below for more information.

/bin/tcsh

The file ~/.cshrc is used to customize the user's environment if his login shell is /bin/tcsh.

Softenv

SoftEnv is a utility that is supposed to help users manage complex user environments with potentially conflicting application versions and libraries.

System Default Path

When a user logs in, the system /etc/profile or /etc/csh.cshrc (depending on login shell, and mirrored from csm:/cfmroot/etc/profile) calls /usr/local/packages/softenv-1.6.2/bin/use.softenv.sh to set up the default path via the SoftEnv database.

SoftEnv looks for a user's ~/.soft file and updates the variables and paths accordingly.

Viewing Available Packages

The command softenv will provide a list of available packages. The listing will look something like:

$ softenv
These are the macros available:
*   @default
These are the keywords explicitly available:
+amber-8                       Applications: 'Amber', version: 8 Amber is a
+apache-ant-1.6.5              Ant, Java based XML make system version: 1.6.
+charm-5.9                     Applications: 'Charm++', version: 5.9 Charm++
+default                       this is the default environment...nukes /etc/
+essl-4.2                      Libraries: 'ESSL', version: 4.2 ESSL is a sta
+gaussian-03                   Applications: 'Gaussian', version: 03 Gaussia
... some stuff deleted ...
Managing SoftEnv

The file ~/.soft in the user's home directory is where the different packages are managed. Add the +keyword into your .soft file. For instance, ff one wants to add the Amber Molecular Dynamics package into their environment, the end of the .soft file should look like this:

+amber-8

@default

To update the environment after modifying this file, one simply uses the resoft command:

% resoft

The command soft can be used to manipulate the environment from the command line. It takes the form:

$ soft add/delete +keyword

Using this method of adding or removing keywords requires the user to pay attention to possible order dependencies. That is, best results require the user to remove keywords in the reverse order in which they were added. It is handy to test out individual keys, but can lead to trouble if changing multiple keys. Changing the .soft file and issuing the resoft is the recommended way of dealing with multiple changes.

▶ How to setup your environment with module?

The information here is applicable to LSU HPC and LONI systems.

Shells

A user may choose between using /bin/bash and /bin/tcsh. Details about each shell follows.

/bin/bash

System resource file: /etc/profile

When one access the shell, the following user files are read in if they exist (in order):

  1. ~/.bash_profile (anything sent to STDOUT or STDERR will cause things like rsync to break)
  2. ~/.bashrc (interactive login only)
  3. ~/.profile

When a user logs out of an interactive session, the file ~/.bash_logout is executed if it exists.

The default value of the environmental variable, PATH, is set automatically using SoftEnv. See below for more information.

/bin/tcsh

The file ~/.cshrc is used to customize the user's environment if his login shell is /bin/tcsh.

Modules

Modules is a utility which helps users manage the complex business of setting up their shell environment in the face of potentially conflicting application versions and libraries.

Default Setup

When a user logs in, the system looks for a file named .modules in their home directory. This file contains module commands to set up the initial shell environment.

Viewing Available Modules

The command

$ module avail

displays a list of all the modules available. The list will look something like:

--- some stuff deleted ---
velvet/1.2.10/INTEL-14.0.2
vmatch/2.2.2

---------------- /usr/local/packages/Modules/modulefiles/admin -----------------
EasyBuild/1.11.1       GCC/4.9.0              INTEL-140-MPICH/3.1.1
EasyBuild/1.13.0       INTEL/14.0.2           INTEL-140-MVAPICH2/2.0
--- some stuff deleted ---

The module names take the form appname/version/compiler, providing the application name, the version, and information about how it was compiled (if needed).

Managing Modules

Besides avail, there are other basic module commands to use for manipulating the environment. These include:

add/load mod1 mod2 ... modn . . . Add modules
rm/unload mod1 mod2 ... modn  . . Remove modules
switch/swap mod . . . . . . . . . Switch or swap one module for another
display/show  . . . . . . . . . . List modules loaded in the environment
avail . . . . . . . . . . . . . . List available module names
whatis mod1 mod2 ... modn . . . . Describe listed modules

The -h option to module will list all available commands.

Module is currently available only on SuperMIC.

▶ Cluster email?

Electronic mail reception is not supported on the clusters. In general, email sent to user accounts on the clusters will not be received or delivered.

Email may be send from the clusters to facilitate remote notification of changes in batch job status. Simply include your email in the appropriate script field.

▶ What is my Disk Space Quota on LONI Clusters?

Dell Linux clusters

Home Directory

For all LONI Linux clusters, the /home quota is 5 GB. Files can be stored on /home permanently, which make it an ideal place for your source code and executables. In the meanwhile, it is not a good idea to use /home for batch job I/O.

Work (Sratch) Directory

For all LONI 5TF Linux clusters, the quota on /work (/scratch) is 100GB. Please note that the scratch space should only be used to store output files during job execution, and by no means for long term storage. Emergency purge without advanced notice may be executed when the usage of disk approaches its full capacity.

For Queen Bee, no quota is enforced on /work but we do enforce a 30 days purging policy, which means that any files that have not been accessed for the last 30 days will be permanently deleted.

Checking One's Quota

Issuing the following command,

$ showquota

results in output similar to:

Disk quotas for user lyan1 (uid 24106): 
    Filesystem    MB used       quota
    /home             105        5000
    /work             606      100000
Notes on Number of Files in a Directory

All users should limit the number of files that they store in a single directory to < 1000. Large numbers of files stored within a single directory can severly degrade performance, negatively impacting the experience of all individuals using that filesystem.

▶ File Storage on LONI Clusters

Home Directory

For all LONI Linux clusters, the /home file system quota is 5 GB. Files can be stored on /home permanently, which makes it an ideal place for your source code and executables. The /home file system is meant for interactive use such as editing and active code development. Do not use /home for batch job I/O.

Work (Scratch) Directory

The /work volume on all LONI clusters is meant for the input and output of executing batch jobs and not for long term storage. We expect files to be copied to other locations or deleted in a timely manner, usually within 30-120 days. For performance reasons on all volumes, our policy is to limit the number of files per directory to around 10,000 and total files to about 500,000.

For all LONI 5TF Linux clusters, the quota on /work is 100GB. One can apply for a larger quota if needed. These requests will be evaluated on a case by case basis. The work volumes on the 5TF clusters are not purged automatically.

For Queen Bee, no quota is enforced on /work but we do enforce a 30 days purging policy, which means that any files that have not been accessed for the last 30 days will be permanently deleted. An email message will be sent out weekly to users targeted for a purge informing them of their /work utilization.

Please do not try to circumvent the removal process by date changing methods. We expect most files over 30 days old to disappear. If you try to circumvent the purge process, this may lead to access restrictions to the /work volume or the cluster.

Please note that the /work volume is not unlimited. Please limit your usage rate to a reasonable amount. When the utilization of /work is over 80%, a 14 day purge may be performed on users using more than 2 TB or having more than 500,000 files. Should disk space become critically low, all files not accessed in 14 days will be purged or even more drastic measures if needed. Users using the largest portions of the /work volume will be contacted when problems arise and they will be expected to take action to help resolve issues.

The current version of Lustre has problems deleting a large number of files quickly with a rm -rf. Use the scripts purge or rmpurge to delete files and directory trees to prevent locking up the file system.

Project Directory

For Queen Bee, there is also a project volume. To obtain a directory on this volume, an allocation request needs to be made. Allocations on this volume are for a limited time, usually 6 months. This volume uses quotas and is not automatically purged. An allocation of 100 GB can be obtained easily by any user, but greater allocations require justification and higher level of approval. Any allocation over 1 TB requires approval from the LONI allocations committee which meats every 3 months. Since this is a limited resource, approval is also based on availability.

Checking One's Quota

Issuing the following command,

$ showquota

results in output similar to:

 Disk quotas for user lyan1 (uid 24106): 
     Filesystem    MB used       quota
     /home             105        5000
     /work             606      100000

▶ Disk/Storage Problems?

Often times, full disk partitions are the cause of many weird problems. Sometimes the error messages do not indicate an unwritable or unreadable disk, but the possibility should be investigated anyway.

Checking the Filesystems

Using the df command, a user may get a global view of the file system. df provides information such as the raw device name, mount point, available disk space, current usage, etc. For example,

 % df
 Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on
 /dev/hd4          524288    456928   13%     3019     6% /
 /dev/hd2         6291456    936136   86%    38532    26% /usr
 /dev/hd9var      2097152   1288280   39%      892     1% /var
 /dev/hd3          524288    447144   15%     1431     3% /tmp
 /proc                  -         -    -         -     -  /proc
 /dev/hd10opt     1048576    297136   72%    11532    26% /opt
 /dev/sni_lv       524288    507232    4%       40     1% /var/adm/sni
 /dev/sysloglv     524288    505912    4%      635     2% /syslog
 /dev/scratchlv  286261248 131630880   55%   112583     1% /mnt/scratch
 /dev/globuslv    4194304   1337904   69%    23395    14% /usr/local/globus
 l1f1c01:/export/home   52428800  34440080   35%   109684     2% /mnt/home

The du program will calculate the amount of disk space being used in the specified directory. This is useful if a user needs to find some offending directory or file. It is often the case that a user exceeds his quota because of a small number of large files or directories.

Potential Issues

The Partition in Question is Full

Ensure that the directory that you are trying to write in is not full.

/tmp is Full

A lot of programs use this partition to dump files, but very few clean up after themselves. Ensure that this directory is not full because often times a user has no idea that the application they are using touchs /tmp. A user should contact SYS-Help is this is found to be the case and they are not responsible for the data themselves.

/var is Full

A lot of system programs use this partition to store data, but very few use the space efficiently or clean up after themselves. Ensure that this directory is not full. A user should contact SYS-Help if this is found to be the case.

The User Quota Has Been Exceeded

A user should ensure that they have not exceeded their quota. To check one's quota status, one must use the quota -v command.

# su as root on the login node of the P5 in question
# su as user
# run the command
  $quota -v
  Disk quotas for user estrabd (uid 1238):
  Filesystem      blocks     quota    limit grace files   quota    limit grace
  /mnt/home        77804    500000   550000        1105    5000     5500
  /mnt/work.nfs101 77084  20000000 20500000         365  900000  1000000

This will show if they are over quota. If so, delete some files and try submitting again.

▶ How to Share Directories for Collaborative Work?

Introduction

Often, users collaborating on a project wish to share their files with others in a common set of directories. Unfortunately, many decide to use a single login username as a group user. For security reasons, the sharing of accounts is explicitly forbidden by LONI and HPC@LSU policy, and such practices may result in the suspension of access for all involved.

There are two approaches to sharing files. The first method does restrict access to read-only, but may be setup by anyone. The caveat is any user on the system will be able to read the files, not just the intended collaborators. To achieve this, simply make sure that all directories in the path have read/execute permission for the group, and that the files had read permission for the group.

Set group read/execute permission with a command like:

  $ chmod 650 dirname

  or

  $ chmod g+rx dirname

Likewise, group read permission on a file can be set with:

  $ chmod 640 filename

  or

  $ chmod g+r filename

Note that the numerical method is preferred.

The second method allows a group of collaborators to have full access to files: read, write, and/or, execute. This requires that a PI qualified individual apply for a /project storage allocation. As part of the application, request that a user group be created to allow shared access and provide a list of user names to be included in the group. Members of the group can then place files in the /project directory, and change group ownership to the storage group. Other members of the group will then be able to access the files, include changing permissions.

Not every machine has /project space, but those that do allow allocations to be applied for via the appropriate allocation request pages:

For advice or additional information, contact sys-help@loni.org

More Information

▶ How to transfer Files?

This page discusses methods by which you may get files on to and off of the various LONI resources.

What is NOT Available

  • FTP (file transfer protocol, and not to be confused with SFTP)
  • rcp (remote copy)

Selecting the Right Method

Knowing what one want to transfer will help them determine the best tool for the job.

For example, the transferring of single, unique files should be done using scp or sftp. When transferring hierarchical data, such a directory and its member files, one should use rsync. This is especially true when an older copy of this hierarchical data already exists on the destination side. rsync is able to detect what has changed between the source and destination data, and saves time by only transmitting these changes.

Be Careful!

Any tool may clobber (destroy) a source file if it is told to copy a file to the same location. For example, if file A on host 1 is transferred to file A on host 1, file A will be over written. In most circumstances, this results in a source file of 0 bytes - effectively deleting it.

For example, the following will clobber myfile.txt on host1.

user@host1> scp ~/myfile.txt host1:~

Before transferring any file, one needs to note on what host they are so that a file is not accidentally copied to itself.

Useful Tools and Methods

This section describes what tools and methods are available for transferring files, and it outlines the trade offs associated with each method. It will be noted when a particular method is not available for use on a particular LONI resource.

rsync Over SSH (preferred)

rsync is an extremely powerful program; it can synchronize entire directory trees, only sending data about files that have changed. That said, it is rather picky about the way it is used. The rsync man page has a great deal of useful information, but the basics are explained below.

Single File Synchronization

To synchronize a single file via rsync, use the following:

To send a file:

% rsync --rsh=ssh --archive --stats --progress localfile \
        username@remotehost:/destination/dir/or/filename

To receive a file:

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/filename localfilename

Note that --rsh=ssh is not necessary with newer versions of rsync, but older installs will default to using rsh (which is not generally enabled on modern OSes).

Directory Synchronization

To synchronize an entire directory, use the following:

To send a directory:

% rsync --rsh=ssh --archive --stats --progress localdir/ \
        username@remotehost:/destination/dir/ 

or

% rsync --rsh=ssh --archive --stats --progress localdir \
        username@remotehost:/destination 

To receive a directory:

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/directory/ /some/localdirectory/

or

% rsync --rsh=ssh --archive --stats --progress \
        username@remotehost:/remote/directory /some/

Note the difference with the slashes. The second command will place the files in the directory /destination/localdir; the fourth will place them in the directory /some/directory. rsync is very particular about the placement of slashes. Before running any significant rsync command, add --dry-run to the parameters. This will let rsync show you what it plans on doing without actually transferring the files.

Synchronization with Deletion

This is very dangerous; a single mistyped character may blow away all of your data. Do not synchronize with deletion if you aren't absolutely certain you know what you're doing.

To have directory synchronization delete files on the destination system that don't exist on the source system:

% rsync --rsh=ssh --archive --stats --dry-run --progress \
        --delete localdir/ username@remotehost:/destination/dir/

Note that the above command will not actually delete (or transfer) anything; the --dry-run must be removed from the list of parameters to actually have it work.

SCP

Using scp is the easiest method to use when transferring single files.

Local File to Remote Host
% scp localfile user@remotehost:/destination/dir/or/filename
Remote Host to Local File
% scp user@remotehost:/remote/filename localfile
No-Payload-Encryption SCP

This is a modified scp utility that saves time by encrypting only the authentication exchange with the remote host. The actual payload (files) being transferred are not encrypted, as is the case with the traditional scp utility. This will only work when both the sending and receiving system have the HPN-enabled SSH installed.

At this time, this tool is only available on the IBM 575 clusters, and can be used by replacing scp with npescp. Do not use this if the security of the data is critical; it will go over the wire unencrypted.

% npescp localfile user@remotehost:/destination/dir/or/filename
SFTP
Interactive Mode

One may find this mode very similar to the interactive interface offered. A login session may look similar to the following:

% sftp user@remotehost
(enter in password)
 ...
sftp>

The commands are similar to those offered by the outmoded ftp client programs: get, put, cd, pwd, lcd, etc. For more information on the available set of commands, one should consult sftp the man page.

% man sftp
Batch Mode

One may use sftp interactively in two cases.

Case 1: Pull a remote file to the local host.

% sftp user@remotehost:/remote/filename localfilename

Case 2: Creating a special sftp batch file containing the set of commands one wishes to execute with out any interaction.

% sftp -b batchfile user@remotehost

Additional information on constructing a batch file is available in the sftp man page.

Client Software

scp and sftp
Standard Clients

The command-line scp and sftp tools come with any modern distribution of OpenSSH; this is generally installed by default on modern Linux, UNIX, and Mac OS X installs.

Windows Clients

Windows clients include:

(puTTY-related command line utilities), and

  • scp, sftp, & rsync as provided by Cygwin.

For additional clients, please see http://www.openssh.com/windows.html.

Mac OS Clients

See http://www.openssh.com/macos.html.

rsync

The command-line rsync application is the only widely-available client for the protocol. It comes with most modern Linux, Unix, and Mac OS X distributions, and other versions (plus source) can be downloaded from the official website.

Advanced Methods for Raw Speed

While the methods above will suffice for all but the most time-sensitive file transfers, and are considerably simpler than the alternatives, users who have further needs can contact the systems administrators to work out potential alternatives.

▶ How to Install Utilities and Libraries?

Typically, a user has the permissions to compile and "install" their own libraries and applications. Obviously, a non-root user can not write to the protected systems library directories, but there is enough space in their home directory and work directory to store such tools and libraries.

▶ How To Background and Distribute Unrelated Processes?

Introduction

All compute nodes have more than one core/processor, so even if the job is not explicitly parallel (using OpenMP or MPI), it is still beneficial to be able to launch multiple jobs in a single submit script. This document will briefly explain how to launch multiple processes on a single computer node and among 2 or more compute nodes.

Note this method does NOT facilitate communication among processes, and is therefore not appropriate for use with parallelized executables using MPI; it is still okay to invoke multithreaded executables because they are invoked initially as a single process. The caveat there is one should not use more threads than there are cores on a single computer node.

Basic Use Case

A user has a serial executable that they wish to run multiple times on LONI/LSU HPC resources, but wishes to run many per submission script. Instead of submitting one queue script per serial execution, and wishing not to have any idle processors on one or more of the many-core compute nodes available, the user wishes to launch a serial process per available core.

Required Tools

  1. the shell (bash)
  2. ssh (for distributing processes to remote compute nodes)

Launching Multiple Processes - Basic Example

On a single computer node

This example assumes one knows of the number of available processors on a a single node. The following example is a bash shell script that launches each process and backgrounds the command using the & symbol.

#!/bin/bash
 
# -- the following 8 lines issue a command, using a subshell, 
# into the background this creates 8 child processes, belonging 
# to the current shell script when executed

/path/to/exe1 & # executable/script arguments may be passed as expected 
/path/to/exe2 & 
/path/to/exe3 & 
/path/to/exe4 & 
/path/to/exe5 & 
/path/to/exe6 & 
/path/to/exe7 & 
/path/to/exe8 & 
 
# -- now WAIT for all 8 child processes to finish
# this will make sure that the parent process does not
# terminate, which is especially important in batch mode

wait
On multiple compute nodes

Distributing processes onto remote compute nodes builds upon the single node example. In the case of wanting to use multiple compute nodes, one can use the 8 commands from above for the mother superior node (i.e., the "home" node, which will be the compute node that the batch schedular uses to execute the shell commands contained inside of the queue script). For the remote nodes, one must use the ssh to launch the command on the remote host.

#!/bin/bash
 
# Define where the input files are
export WORKDIR=/path/to/where/i/want/to/run/my/job

# -- the following 8 lines issue a command, using a subshell, 
# into the background this creates 8 child processes, belonging 
# to the current shell script when executed
 
# -- for mother superior, or "home", compute node
/path/to/exe1 & 
/path/to/exe2 & 
/path/to/exe3 & 
/path/to/exe4 & 
/path/to/exe5 & 
/path/to/exe6 & 
/path/to/exe7 & 
/path/to/exe8 & 

# -- for an additional, remote compute node
# Assuming executable is to be run on WORKDIR
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe1 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe2 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe3 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe4 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe5 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe6 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe7 ' & 
ssh -n remotehost 'cd '$WORKDIR'; /path/to/exe8 ' & 
 
# -- now WAIT for all 8 child processes to finish
# this will make sure that the parent process does not
# terminate, which is especially important in batch mode

wait

The example above will spawn 16 commands, 8 of the which run on the local compute node (i.e., mother superior) and 8 run on remotehost. The background token (&) backgrounds the command on the local node (for all 16 commands); the commands sent to remotehost are not backgrounded remotely because this would not allow the local machine to know when the local command (i.e., ssh) completed.

Note as it is often the case that one does not know the identity of the mother superior node or the set of remote compute nodes (i.e., remotehost) when submitting such a script to the batch scheduler, some more programming must be done to determine the identity of these nodes at runtime. The following section considers this and concludes with a basic, adaptive example.

Advanced Example for PBS Queuing Systems

The steps involved in the following example are:

  1. determine identity of mother superior
  2. determine list of all remote compute nodes

Assumptions:

  1. shared file system
  2. a list of all compute nodes assigned by PBS are contained in a file referenced with the environmental variable, ${PBS_NODEFILE}
  3. each compute node has 8 processing cores

Note this example still requires that one knows the number of cores available per compute node; in this example, 8 is assumed.

!#/bin/bash

# Define where the input files are
export WORKDIR=/path/to/where/i/want/to/run/my/job

# -- Get List of Unique Nodes and put into an array ---
NODES=($(uniq $PBS_NODEFILE ))
# -- Assuming all input files are in the WORKDIR directory (change accordingly)
cd $WORKDIR
# -- The first array element contains the mother superior
#  while the second onwards contains the worker or remote nodes --
# -- for mother superior, or "home", compute node
/path/to/exe1 &
/path/to/exe2 &
/path/to/exe3 &
/path/to/exe4 &
/path/to/exe5 &
/path/to/exe6 &
/path/to/exe7 &
/path/to/exe8 &

# -- Launch 8 processes on the first remote node --
ssh ${NODES[1]} ' \
    cd '$WORKDIR' \
    /path/to/exe1 & \
    /path/to/exe2 & \
    /path/to/exe3 & \
    /path/to/exe4 & \
    /path/to/exe5 & \
    /path/to/exe6 & \
    /path/to/exe7 & \
    /path/to/exe8 &  ' &

# Repeat above script for other remote nodes
# Remember that bash arrays start from 0 not 1 similar to C Language
# You can also use a loop over the remote nodes
#NUMNODES=$(uniq $PBS_NODEFILE | wc -l | awk '{print $1-1}')
# for i in $(seq 1 $NUMNODES ); do
#    ssh -n ${NODES[$i]} ' launch 8 processors ' &
# done
 
# -- now WAIT for all child processes to finish
# this will make sure that the parent process does not
# terminate, which is especially important in batch mode

wait
Submitting to PBS

Now let's assume you have 128 tasks to run. You can do this by running on 16 8-core nodes using PBS. If one task takes 7 hours, and allowing a 30 minute safety margin, the following qsub command line will take care of running all 128 tasks:

% qsub -I -A allocation_account -V -l walltime=07:30:00,nodes=16:ppn=8 

When the script executes, it will find 16 node names in its PBS_NODEFILE, run 1 set of 8 tasks on the mother superior, and 15 sets of 8 on the remote nodes.

More information on submitting to the PBS queue can be accessed at the frequently asked questions page.

Advanced Usage Possibilities

1. executables in the above script can take normal argments and flags; e.g.:

/path/to/exe1 arg1 arg2 -flag arg3 & 

2. one can, technically, initiate multithreaded executables; the catch is to make sure there is only one thread per processing core is allowed; the following example launches 4, 2-threaded executables (presumably using OpenMP) locally - thus consuming all 8 (i.e., 4 processes * 2 threads/process), each.

# "_r" simply denotes that executable is multithreaded 
OMP_NUM_THREADS=2 /path/to/exe1_r & 
OMP_NUM_THREADS=2 /path/to/exe2_r & 
OMP_NUM_THREADS=2 /path/to/exe3_r & 
OMP_NUM_THREADS=2 /path/to/exe4_r &

# -- in total, 8 threads are being executed on the 8 (assumed) cores
# contained by this compute node; i.e., there are 4 processes, each 
# with 2 threads running.

A creative user can get even more complicated, but in general there is no need.

Conclusion

In situations where one wants to take advantage of all processors available on a single compute node or a set of compute nodes to run a number of unrelated processes, using backgrounded processes locally (and via ssh) remotely allows one to do this.

Converting the above scripts to csh and tcsh is straightforward

Using the examples above, one may create a customized solution using the proper techniques. For questions and comments regarding this topic and the examples above, please email sys-help@loni.org.

▶ How to run hybrid MPI and OpenMP jobs?

Combination of MPI and OpenMP in programming can provide high parallel efficiency. For most hybrid codes, OpenMP threads are spread within one MPI task. This kind of hybrid codes are widely used in many fields of science and technology. A sample bash shell script for running hybrid jobs on LSU HPC clusters is provided in the following.

#!/bin/bash
#PBS -A my_allocation_code
#PBS -q workq
#PBS -l walltime=00:10:00
#PBS -l nodes=2:ppn=16        # ppn=16 for SuperMike and ppn=20 for SuperMIC
#PBS -V                       # make sure environments are the same for asigned nodes
#PBS -N my_job_name           # will be shown in the queue system
#PBS -o my_name.out           # normal output
#PBS -e my_name.err           # error output

export TASK_PER_NODE=2        # number of MPI tasks per node
export THREADS_PER_TASK=8     # number of OpenMP threads per MPI task

cd $PBS_O_WORKDIR             # go to the path where you run qsub

# Get the node names from the PBS node file and gather them to a new file
cat $PBS_NODEFILE|uniq>nodefile   

# Run the hybrid job.
# Use "-x OMP_NUM_THREADS=..." to make sure that 
# the number of OpenMP threads is passed to each MPI task
mpirun -npernode $TASK_PER_NODE -x OMP_NUM_THREADS=$THREADS_PER_TASK -machinefile nodefile ./my_exe

Non-uniform memory access (NUMA) is a common issue when running hybrid jobs. The reason causing this issue is that there are two sockets on one CPU card. In theory the parallel efficiency is the highest if the number of OpenMP threads equals the number of cores of each socket in one CPU, that is 8 for SuperMike and 10 for SuperMIC. But in practical it varies from case to case depending on users' codes.

▶ How to achieve ppn=N?

Achieving ppn=N

When a job is submitted, the setting of processes per node must match certain fixed values depending on the cluster: ppn=8 on Philip, ppn=16 on SuperMike, ppn=20 on SuperMIC and QB2, and ppn=4 on all other LSU and LONI Dell clusters. A few may have a single queue, which allows ppn=1. However, there is a way for a user to achieve ppn=N, where N is an integer value between 1 and the usual required setting on any cluster. This involves generating a new machine/host file to use with mpirun which has with N entries per node.

The machine/host file created for a job has its path stored in the shell variable PBS_NODEFILE. For each node requested, the node host name is repeated ppn times in the machine file. Asking for 16 nodes on QB2, for instance, results in 320 names appearing in the machine machine (i.e. 8 copies for each of 16 node names, since ppn=20). With a little script magic, one can pull out all the unique node names from this file and put them in a new file with only N entries for each node, then present this new file to mpirun. The following is a simple example script which does just that. But, it is important to keep in mind that you will be charged SUs based on ppn=20 since only entire nodes are allocated to jobs.

 #!/bin/bash
 #PBS -q checkpt
 #PBS -A my_alloc_code
 #PBS -l nodes=16:ppn=8
 #PBS -l cput=2:00:00
 #PBS -l walltime=2:00:00
 #PBS -o /work/myname/mpi_openmp/run/output/myoutput2
 #PBS -j oe

 # Start the process of creating a new machinefile by making sure any
 # temporary files from a previous run are not present.

 rm -f tempfile tempfile2 newmachinefile

 # Generating a file with only unique node entries from the one provided
 # by PBS. Sort to make sure all the names are grouped together, then
 # let uniq pull out one copy of each:

 cat $PBS_NODEFILE | sort | uniq >> tempfile

 # Now duplicate the content of tempfile N times - here just twice.

 cat tempfile >> tempfile2
 cat tempfile >> tempfile2

 # Sort the content of tempfile2 and store in a new file.
 # This lets MPI assign processes sequentially across the nodes.

 cat tempfile2 | sort > newmachinefile 

 # Set the number of processes from the number of entries.

 export NPROCS=`wc -l newmachinefile |gawk '//{print $1}'`

 # Launch mpirun with the new machine/host file and process count.
 # (some versions of mpirun use "-hostfile" instead of "-machinefile")

 mpirun -np $NPROCS -machinefile newmachinefile your_program

▶ Monitor PBS stdout and stderr?

Currently, PBS is configured so that the standard output/error of their jobs are redirected to temporary files and copied back to the final destination after their jobs finish. Hence, users can only access the standard output after their jobs finish.

However, PBS does provide another method for users to check standard output/error in real time, i.e.

qsub -k oe pbs_script

The -k oe option at the qsub command line specifies that standard output or standard error streams will be retained on the execution host. The stream will be placed in the home directory of the user under whose user id the job executed. The file name will be the default file name given by: where is the name specified for the job, and is the sequence number component of the job identifier. For example, if a user submits a job to Queenbee with job name "test" and job id "1223", then the standard output/error will be test.o1223/test.e1223 in the user's home directory. This allows users to check their stdout/stderr while their jobs are running.

▶ How to Establish a Remote X11 Session?

From *nix

Since ssh and X11 are already on most client machines running some sort of unix (Linux, FreeBSD, etc), one would simply use the following command:

% ssh -X -Y username@remote.host.tdl

Once successfully logged in, the following command should open a new terminal window on the local host:

% xterm&

An xterm window should appear. If this is not the case, email us.

From Mac OS X

An X11 service is not installed by default, but one is available for installation on the OS distribution disks as an add-on. An alternative would be to install the XQuartz version. Make sure the X11 application is running and connect to the cluster using:

% ssh -X -Y username@remote.host.tdl

From Windows

Microsoft Windows does not provide an X11 server, but there are both open source and commercial versions available. You also need to install an SSH client. Recommended applications are:

  • Xming - a Windows X11 server
  • PuTTY - a Windows ssh client

When a PuTTY session is created, make sure the "X11 Forwarding Enabled" option is set, and that the X11 server is running before starting the session.

Testing

Once Xming and puTTY have been set up and in stalled, the following will provide a simple test for success:

  1. start Xming
  2. start puTTY
  3. connect to the remote host (make sure puTTY knows about Xming for this host)

Once successfully logged in, the following command should open a new terminal window on the local host:

% xterm&

An xterm window should appear. If this is not the case, refer to "Trouble with Xming?" or email us.

Note About Cygwin

Cygwin is still a useful environment, but is to complicated and contains too many unnecessary parts when all one wants is to interface with remote X11 sessions. For these reasons, we recommend Xming and PuTTY as listed above.

Advanced Usage

The most important connection that is made is from the user's client machine to the first remote host. One may "nest" X11 forwarding by using the ssh -XY command to jump to other remote hosts.

For example:

1. on client PC (*nix or Windows), ssh to remotehost1

2. on remotehost1 (presumably a *nix machine), ssh -XY to remotehost2

3. on remotehost2 (presumably a *nix machine), ssh -XY to remotehost3

...

8. on remotehost8 (presumably a *nix machine), ssh -XY to remotehost9

9. on remotehost9, running an X11 application like xterm should propagate the remote window back to the initial client PC through all of the additional remote connects.

▶ Run Interactive Jobs?

Interactive Jobs

Interactive jobs give you dedicated access to a number of nodes, which is handy (and the preferred way) to debug parallel programs, or execute parallel jobs that require user interaction. Interactive jobs can be started by executing qsub without an input file:

 [user@m1]$ qsub -I -l nodes=2:ppn=4 -l walltime=00:10:00 -q workq

As with any other job request, the time it takes to actually begin executing depends on how busy the system is and the resources requested. You have the option of using any available queue on the system, so matching the resources with the queue will tend to decrease the wait time.

Once the job starts running, you will be automatically logged into a node and can begin issuing commands. This example starts up a 8-process MPI program named program05:

 [user@m1]$ qsub -I -l nodes=2:ppn=4 -l walltime=00:10:00 -q workq
 [user@m026]$ cd /work/user
 [user@m026]$ mpirun -hostfile $PBS_NODEFILE -np 8 ./program05

The set of nodes is yours to use for the length of time requested. If desired, one can also create an X-windows connection to display graphical information on your local machine. The syntax is:

 [user@machine]$ qsub -I -l nodes=1:ppn=4 -l walltime=00:10:00 -X

Make sure that the ppn=X argument matches that required by the machine you are on.

▶ Which Text Editors are installed on the clusters?

The following text editors are available:

  1. vi (See Google for the vi cheat sheet)
  2. pico
  3. emacs
  4. vim
  5. xemacs.

▶ How to setup SSH Keys?

Setting Up SSH Keys Among Nodes

  1. Login to cluster head node
  2. Make sure ~/.ssh exists:
    $ mkdir -p ~/.ssh
  3. Change to the .ssh directory:
    $ cd ~/.ssh
  4. Generate keys:
    $ ssh-keygen -b 1024 -t dsa -f ~/.ssh/id_dsa.mynode_key
  5. Authorize the Public Key:
    $ cat ~/.ssh/id_dsa.mynode_key.pub >> ~/.ssh/authorized_keys

Test Set Up

From the head node, attempt to ssh into a compute node:

$ ssh -i ~/.ssh/id_rsa.node_key _compute_node_

If access is gained without being prompted for a password, then this has been set up properly.

▶ PBS Job Chains and Dependencies

PBS Job Chains

Quite often, a single simulation requires multiple long runs which must be processed in sequence. One method for creating a sequence of batch jobs is to execute the qsub to submit its successor. We strongly discourage recursive, or "self-submitting," scripts since for some jobs, chaining isn't an option. When your job hits the time limit, the batch system kills them and the command to submit a subsequent job is not processed.

PBS allows users to move the logic for chaining from the script and into the scheduler. This is done with a command line option:

$ qsub -W depend=afterok:<jobid> <job_script>

This tells the job scheduler that the script being submitted should not start until jobid completes successfully. The following conditions are supported:

afterok:<jobid>
Job is scheduled if the job <jobid> exits without errors or is successfully completed.
afternotok:<jobid>
Job is scheduled if job <jobid> exited with errors.
afterany:<jobid>
Job is scheduled if the job <jobid> exits with or without errors.

One method to simplify this process is to write multiple batch scripts, job1.pbs, job2.pbs, job3.pbs etc and submit them using the following script:

#!/bin/bash
 
FIRST=$(qsub job1.pbs)
echo $FIRST
SECOND=$(qsub -W depend=afterany:$FIRST job2.pbs)
echo $SECOND
THIRD=$(qsub -W depend=afterany:$SECOND job3.pbs)
echo $THIRD

Modify the script according to number of job chained jobs required. The Job <$FIRST> will be placed in queue while the jobs <$SECOND> and <$THIRD> will be placed in queue with the "Not Queued" (NQ) flag in Batch Hold. When <$FIRST> is completed, the NQ flag will be replaced with the "Queued" (Q) flag and will be moved to the active queue.

A few words of caution: If you list the dependency as "afterok"/"afternotok" and your job exits with/without errors then your subsequent jobs will be killed due to "dependency not met".

▶ Check LoadLeveler Status?

To determine the status of the LoadLeveler environment and if it is accepting jobs or to check how many nodes might be free at the moment, issue the following command:

$ llstatus

One might see similar input to the following:

 $ llstatus
 Name                      Schedd  InQ Act Startd Run LdAvg Idle Arch      OpSys
 l1f1n01                   Avail     0   0 Idle     0 1.00   216 Power5    AIX52    
 l1f1n02                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n03                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n04                   Down      0   0 Idle     0 0.01  9999 Power5    AIX52    
 l1f1n05                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n06                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n07                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n08                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n09                   Down      0   0 Idle     0 0.02  9999 Power5    AIX52    
 l1f1n10                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n11                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n12                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n13                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 l1f1n14                   Down      0   0 Idle     0 0.00  9999 Power5    AIX52    
 Power5/AIX52               14 machines      0  jobs      0  running 
 Total Machines             14 machines      0  jobs      0  running
 The Central Manager is defined on l1f1n01
 The BACKFILL scheduler is in use
 All machines on the machine_list are present.
Field Description
Busy The maximum number of jobs are running.
Down The Startd daemon is not running or can not contact the central manager.
Drained The machine will accept no more jobs and jobs already running have completed.
Draining The machine will accept no more jobs but there are jobs currently running.
Flush All jobs have been flushed from the machine and no new ones are being accepted.
Idle The machine is not running any jobs.
None LoadLeveler is running but no jobs can run.
Reserved The resource manager has this machine reserved for interactive jobs.
Running Machine is running one or more jobs and is capable of running more.
Suspend All jobs on this machine have been suspended.

▶ Advanced reservations on LoadLeveler?

Under special circumstances, the LONI admins can allow a user to reserve computing time by using LoadLeveler's reservations capabilities. The general procedure, once an admin has allowed a specific user the permissions to do this, is:

  1. Create a reservation using llmkres
  2. Setting the environmental variable, LL_RES_ID, with the reservation id returned
  3. Submit the job script using llsubmit

Making the reservation

The llmkres utility is used to make the reservation. It allows for the specification of a start date and time-of-day, and the amount of time desired.

$ llmkres -t 01/17/2005 02:00 -d 420
-t mm/dd/yyy hh:mm
specifies the date and time to start the reservation.
-d mmm
Desired runtime for the reservation, in minutes.

Many other options are available, and are described in the man pages (i.e. man llmkres).

After your reservation starts, you need to bind your jobs with your reservation by using llbind:

$ llbind -R reservation_id job_id

Listing currently made reservations

Use the command llqres.

Removing an existing reservation

Use the command llrmres:

$ llrmres [-R all | res_id]

Use res_id to remove a specific reservation, or -R all to remove all of them.

▶ LoadLeveler Job Chains and Dependencies

Not yet updated for Power7 systems

LoadLeveler Job Chains

LoadLeveler allows one to specify multiple, independent job steps per LL queue script. They are run concurrently as long as the resources are available. Likewise, LoadLeveler provides for the specification of dependencies among jobs steps such that job chains maybe set up depending on the return status of a previously run job.

Independent Jobs

Independent job steps are specified using the step_name directive. In this example, the environment directive applies to all stanzas.

#!/bin/sh
#
#
#@ environment = COPY_ALL
#@ step_name = adcirc_e1
#@ job_type = parallel
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ notify_user = estrabd@cct.lsu.edu
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = not_shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/padcirc.sh
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue
#
#
# independent job step
#@ step_name = adcirc_e2
#@ job_type = parallel
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ notify_user = estrabd@cct.lsu.edu
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = not_shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/padcirc.sh
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue
#
#
# independent job step
#@ step_name = adcirc_e3
#@ job_type = parallel
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ notify_user = estrabd@cct.lsu.edu
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = not_shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/padcirc.sh
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue

Dependent Jobs

The following is an example of multiple job steps in a single LoadLeveler queue script that depend on one another. Note the addition of the dependency directive.

#!/bin/sh
#
#
#@ job_name =  adcircSysTest
#
# PREP (serial)
#@ step_name = prep_e1
#@ environment = COPY_ALL
#@ job_type = serial
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ wall_clock_limit = 00:10:00
#@ class = checkpt
#@ resources = ConsumableMemory(1 gb)
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/prep.sh
#@ queue
#
#
# RUN (parallel)
#@ dependency = (prep_e1 >= 0)
#@ step_name = adcirc_e1
#@ job_type = parallel
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ notify_user = estrabd@cct.lsu.edu
#@ notification = error
#@ class = checkpt
#@ checkpoint = no
#@ restart = yes
#@ wall_clock_limit = 00:10:00
#@ node_usage = not_shared
#@ node = 2,2
#@ total_tasks = 16
#@ requirements = (Arch == "Power5")
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/padcirc.sh
#@ network.MPI =sn_single,not_shared,US,HIGH
#@ resources = ConsumableMemory(1 gb)
#@ queue
#
#
# POST (serial)
#@ dependency = (adcirc_e1 >= 0)
#@ step_name = post_e1
#@ environment = COPY_ALL
#@ job_type = serial
#@ output = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).out
#@ error = /work/default/estrabd/adcirc-systest/$(step_name).$(jobid).err
#@ wall_clock_limit = 00:05:00
#@ class = checkpt
#@ resources = ConsumableMemory(1 gb)
#@ initialdir = /work/default/estrabd/adcirc-systest
#@ executable = /work/default/estrabd/adcirc-systest/post.sh
#@ queue

▶ LoadLeveler Shell Variables?

LoadLeveler Shell Variables

LoadLeveler sets the following environment variables that you may access from your job, if you need to have this information. These may be used inside of your submit scripts.

LOADLBATCH
Set to YES to indicate that the job is running under LoadLeveler
LOADL_ACTIVE
Contains the LoadLeveler version number.
LOADL_JOB_NAME
The three part identifier for your job: of the form (name of submitting machine).(Job ID number).(Step ID number)
LOADL_PROCESSOR_LIST
A blank delimited list of hostnames allocated for running a job.
LOADL_STEP_ARGS
Any arguments passed to the executable for a job step.
LOADL_STEP_CLASS
The Job Class.
LOADL_STEP_COMMAND
The name of the executable or the name of the Job Command file if the executable is not specified.
LOADL_STEP_ERR
The file used for Standard Error messages.
LOADL_STEP_GROUP
The Unix group name of the job's owner
LOADL_STEP_IN
The file used for standard input.
LOADL_STEP_INITDIR
The initial working directory.
LOADL_STEP_NAME
The name of the job step.
LOADL_STEP_NICE
The Unix nice value that your job step is executing with. Set by Systems Administrators.
LOADL_STEP_OUT
The file for standard output.
LOADL_STEP_OWNER
The user id for the job's owner.
LOADL_STEP_TYPE
The job type, may be SERIAL or PARALLEL.

▶ LoadLeveler Job Execution Errors?

The problem could be related to memory exhaustion. Try running with fewer than 32 processors per node so that more memory is available per processor. This increases the number of nodes required for the same number of processors, but more memory is available on a per process basis.

▶ Compile MPI programs?

MPI (messeage passing interface) is a standard in parallel computing for data communication across distributed processes.

Building MPI applications on LONI 64 bit Intel cluster

The proper way to compile with the MPI library is to use the compiler scripts installed with the libary. Once your favor MPI (i.e. OpenMPI, MPICH, MVAPICH2, etc) and compiler suite (i.e. Intel, GNU) have been set up using softenv, you're good to go.

The compiler command you use will depend on the language you program in. For instance, if you program in C, irregardless of whether its the Intel C compiler or the GNU C compiler, the command would be mpicc. The command is then used exactly as one would use the native compiler. For instance:

$ mpicc test.c -O3 -o a.out
$ mpif90 test.F -O3 -o a.out

There are slight differences in how each version of MPI launches a program for parallel execution. For that refer to the specific version information. But, by way of example, here is what a PBS job script might look like:

 #!/bin/bash
 #PBS -q workq
 #PBS -A your_allocation
 #PBS -l nodes=4:ppn=4
 #PBS -l walltime=20:00:00
 #PBS -o /scratch/ou/s_type/output/myoutput2
 #PBS -j oe
 #PBS -N s_type
 export HOME_DIR=/home/$USER/
 export WORK_DIR=/work/$USER/test
 export NPROCS=`wc -l $PBS_NODEFILE |gawk '//{print $1}'`
 cd $WORK_DIR
 cp $HOME_DIR/a.out .
 mpirun -machinefile $PBS_NODEFILE -np 16 $WORK_DIR/a.out

▶ Random Number Seeds

The random number functions in most language libraries are actually pseudo-random number generators. Given the same seed, or starting value, they will reproduce the same sequence of numbers. If a relatively unique sequence is required on multiple nodes, the seeds should be generated from a source of high entropy random bits. On Linux systems, this can be accomplished by reading random bits from the system file /dev/random.

C/C++

The read can readily be done with fread() in C/C++ by treating /dev/random as a stream.

int main ( void )
{
  #include <stdio.h>
  double x;
  int i[2];
  FILE *fp;
  fp = fopen("/dev/random","r");
  fread ( &x, 1, 8, fp );
  fread ( i, 1, 8, fp );
  fclose ( fp );
  printf ( "%i %i %g\n", i[0], i[1], x );
  return 0;
}

Fortran

In Fortran, the size of the variable determines how many bits will be read. The /dev/random file must be treated as a binary, direct access file, with a record length corresponding to the byte size of the variable being read into.

      program main
      real*8 x
      integer*4 i(2)
      open ( unit=1, file='/dev/random', action='read',
     $       form='unformatted', access='direct', recl=8, status='old' )
      read ( 1, rec=1 ) x
      read ( 1, rec=1 ) i
      close ( 1 )
      print *, i, x
      stop
      end

▶ Common IBM XL Compiler Options?

The following shows the basic and most frequently used compiler options available in the IBM XL Fortran, C and C++ compilers. Most of these options work for both Fortran and C/C++ compilers. Each option presented below comes with a short description. Detailed information about them can be found in the IBM documentation.

==Control Options==

-I 
  Specify an additional search path for includes files whose absolute
  path is not specified in the code. This option needs to be specified
  once for each additional path if there are more than one.

  Example:
   -I/usr/local/packages/petsc-2.3.2/include/
  
-o 
  Specify a name for the output object or executable. The default is 'a.out'.

-c 
  Compile only. A .o object file will be generated for each source file
  but no executable will be generated.

-q32
  Compile the code in 32-bit mode.

-q64
  Compile the code in 64 bit mode.

'''For Fortran only:'''

-qfixed
  Indicate the soucre program is in fixed form (f77).

-qfree=f90
  Indicate the source program is in free form (f90).

==Debugging Options==

Almost all the options in this section can slow execution, so you
might want to restrict their use to debugging purpose only.

-p
  Prepare the program for profiling.

-pg
  Prepare the program for profiling. This option produces more
  extensive statistics than ''-p''.

-g
  Generate debugging information for debugging tools.

-qcheck
  Check each reference to an array element, array section and
  character substring for correctness.

-qflttrap
  Detect and determine run-time floating-point exceptions such as
  overflow and divided by zero.
            
-qextchk
  Verifies the consistency of procedure definitions and references at
  compile time and verifies that actual arguments agree in type, shape
  and class with the corresponding dummy arguments at link time.

==Optimization Options==

-qhot
  'HOT' stands for 'High Order Transformation', which targets to
  improve the performance of loops and array languages. What this
  option may do include 1) scalar replacement; 2) interchange, fusion
  and unrolling of loops and; 3) reducing the generation of temporary
  arrays.

-qarch
  Instruct the compiler to generate codes that have additional
  performance improvements on the specified architecture.

-qcache
  Specify the cache configuration for a specific excution machine.

-qtune
  Instruct the compiler to tune instruction selection, scheduling and
  other implementation-dependent performance enhancement for a
  specific implementation of a hardwar architecture.

-O[level]
  Specify the optimization level (0,2,3,4 or 5).

-qstrict
  Ensure that the compiler does not change the semantics of the code
  when trying to optimize, which is only an issue when -O3 or higher
  is specified.

-qipa
  Inter-procedure analysis. You might want to consider this option
  when you code consists of a great of separate files.

-qinline
  Inline subroutines that are frequently called to reduce the overheading.

==Other Options==

-qsmp
  Parallel the code automatically. See the compiler documentation for
  details on how this works.

-qsmp=omp

  Compile OpenMP codes. The compiler must be thread-safe when using
  this option.

Manuals and detailed information on XL compilers can be found at the IBM compiler information center.

▶ IBM Math Libraries?

Mathematical Accelaration SubSystem (MASS)

MASS stands for "Mathematical Acceleration SubSystem". It consists of tuned intrinsic mathematical functions such as sin(), cos(), log() and exp(). It supports the C/C++ and Fortran compilers. There are scalar and vector versions, and they are considered thread-safe. It usually provides better performance over the original intrinsic functions, but at the expense of reduced precision (1 to 2 bits at most). Usage involves adding -lmass to the compile line:

xlf90_r foo.f90 -lmass

You might find the detailed introduction from the Naval Oceanographic Office helpful.

Engineering and Scientific Subroutines (ESSL)

ESSL is a state-of-the-art collection of subroutines providing a wide range of mathematical functions for many different scientific and engineering applications. Its primary characteristics are performance, functional capability, and usability. Subroutines touch on the following areas: linear algebra, matrix operations, eigenvalue problems, Fourier Transforms, sorting and searching, interpolation, numerical quadrature, and random number generation.

Most real and complex subroutines in ESSL has two versions: short- and long-precision. For example, SGEMM and DGEMM are the short- and long-precision versions of subroutines that perform combined matrix multiplication and addition for general matrices, respectively.

To use, simply add -lessl to the compiler line.

xlf90_r foo.f90 -lessl

A user guide can be found on line.

Parellel ESSL (PESSL)

A version of ESSL adapted for paralle use. See above.

▶ IBM PE Shell Variables?

Parallel Environment (PE) on AIX is an environment for the development of parallel applications written in MPI. It provides a great number of environmental variables to let users to control how parallel programs are executed. Below you can find a few of them that are most frequently used (and probably most useful). To set an environment variable, you can set it either as a shell variable or as a command line flag. For example, if you export MP_RMPOOL=1 then executing a program. is equivalent to executing the program using -rmpool 1 on the command line.

MP_PROCS (-procs)
The number of program tasks.
MP_RMPOOL (-rmpools)
The name or number of the pool that should be used for nonspecific node allocation.
MP_NODES (-nodes)
To specify the number of processor nodes on which to run the parallel tasks.
MP_STDOUTMODE (-stdoutmode)
Sometimes it is nice to have stdout output in processor order. To do this, you need to set MP_STDOUTMODE=ordered (the default is MP_STDOUTMODE=unordered). Unordered output might look like:
 $ ./hello -rmpool 1 -nodes 1 -procs 4
 
 Hello, world!
 This is the master process.
  
 Hello, world!
 This is process 2
 
 Hello, world!
 This is process 1
 
 Hello, world!
 This is process 3
With MP_STDOUTMODE set to ordered, the output changes to:
 
 $ export MP_STDOUTMODE=ordered
 $ ./hello -rmpool 1 -nodes 1 -procs 4
 Hello, world!
 This is the master process.

 Hello, world!
 This is process 1
 
 Hello, world!
 This is process 2
 
 Hello, world!
 This is process 3
Also, if for some reason you want the output from process 2 only, you can set MP_STDOUTMODE to a single process number:
 $ export MP_STDOUTMODE=2      
 $ ./hello -rmpool 1 -nodes 1 -procs 4
 
 Hello, world!
 This is process 2
MP_LABELIO (-labelio)
When this variable is set to yes, the output from the parallel tasks is labeled by task id. The default is no.
 $ export MP_LABELIO=yes
 lyan1@l1f1n01$ ./hello -rmpool 1 -nodes 1 -procs 4
   0: 
   0: Hello, world!
   0: This is the master process.
   1:
   1: Hello, world!
   1: This is process 1
   2: 
   2: Hello, world!
   2: This is process 2
   3: 
   3: Hello, world!
   3: This is process 3
MP_INFOLEVEL (-infolevel)
This variable controls the level of message reporting.
  • 0: error only
  • 1: error and warning
  • 2: error, warning and information
  • 3-6: all for level 2 plus diagnostic messages for use by the IBM Support Center.
 $ ./hello -rmpool 1 -nodes 1 -procs 4 -infolevel 2 \
                -stdoutmode ordered -labelio yes
 INFO: 0031-364  Contacting LoadLeveler to set and query information for ...
 INFO: 0031-380  LoadLeveler step ID is l1f1n01.6343.0
 INFO: 0031-119  Host l1f1n01 allocated for task 0
 INFO: 0031-120  Host address 208.100.64.13 allocated for task 0
 INFO: 0031-119  Host l1f1n01 allocated for task 1
 INFO: 0031-120  Host address 208.100.64.13 allocated for task 1
 INFO: 0031-119  Host l1f1n01 allocated for task 2
 INFO: 0031-120  Host address 208.100.64.13 allocated for task 2
 INFO: 0031-119  Host l1f1n01 allocated for task 3
 INFO: 0031-120  Host address 208.100.64.13 allocated for task 3
   0:INFO: 0031-724  Executing program: <./hello>
   1:INFO: 0031-724  Executing program: <./hello>
   2:INFO: 0031-724  Executing program: <./hello>
   3:INFO: 0031-724  Executing program: <./hello>
   0:INFO: 0031-619  64bit(ip)  ppe_rsan, rsan0537a MPCI shared object was ...
   0: 
   1:LAPI version #7.17 2005/08/15 1.157 src/rsct/lapi/lapi.c, lapi, ...
   3:LAPI version #7.17 2005/08/15 1.157 src/rsct/lapi/lapi.c, lapi, ...
   1:LAPI is using lightweight lock.
   3:LAPI is using lightweight lock.
   2:LAPI version #7.17 2005/08/15 1.157 src/rsct/lapi/lapi.c, lapi, ...
   2:LAPI is using lightweight lock.
   0:LAPI version #7.17 2005/08/15 1.157 src/rsct/lapi/lapi.c, lapi, ...
   0:LAPI is using lightweight lock.
   0:The MPI shared memory protocol is used for the job
   1:The MPI shared memory protocol is used for the job
   2:The MPI shared memory protocol is used for the job
   3:The MPI shared memory protocol is used for the job
   0:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   3:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   2:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
   1:INFO: 0031-306  pm_atexit: pm_exit_value is 0.
 INFO: 0031-656  I/O file STDOUT closed by task 1
 INFO: 0031-656  I/O file STDERR closed by task 1
 INFO: 0031-656  I/O file STDOUT closed by task 0
 INFO: 0031-656  I/O file STDERR closed by task 0
 INFO: 0031-656  I/O file STDOUT closed by task 2
 INFO: 0031-656  I/O file STDERR closed by task 2
 INFO: 0031-656  I/O file STDOUT closed by task 3
 INFO: 0031-656  I/O file STDERR closed by task 3
 INFO: 0031-251  task 0 exited: rc=0
 INFO: 0031-251  task 1 exited: rc=0
 INFO: 0031-251  task 2 exited: rc=0
 INFO: 0031-251  task 3 exited: rc=0
   0: 
   0: Hello, world!
   0: This is the master process.
   0: The number of processes is 4
   0: 
   0: Normal end of execution.
   1: 
   1: Hello, world!
   1: This is process 1
   2: 
   2: Hello, world!
   2: This is process 2
   3: 
   3: Hello, world!
   3: This is process 3
 INFO: 0031-639  Exit status from pm_respond = 0

▶ Known Issues?

system(), fork() and popen()

Calls to system library functions system(), fork() and popen() are not supported by the Infiniband driver under the current Linux kernel. Any code that makes these calls inside the MPI scope (between MPI initialization and finalization) will likely fail.

▶ DOS/Linux/MAC Text File Problems?

Convert DOS Text Files to UNIX Text Files

Text File Formats

Text files contain human readable information, such as script files and programming language source files. However, not all text files are created equal - there are operating system dependencies to be aware of. Linux/Unix text files end a line with an ASCII line-feed control character which has a decimal value of 10 (Ctrl-J). Microsoft Windows (or MS-DOS) text files use two control characters: an ASCII carriage return (decimal 13 or Ctrl-M), followed by a line-feed. Just to mix things up, Apple OS/X text files use just a carriage return.

Problems can, and do, arise if there is a mismatch between a file's format and what the operating system expects. Compilers may be happy with source files in mixed formats, but other apps, like PBS, throw strange errors or just plain give up. So, it is important to make sure text file formats match what the operating system expects, especially if you move files back and forth between systems.

How To Tell

On most systems, the file command should be sufficient:

   $ file <filename>

Assuming a file named foo is a DOS file on Linux, you may see something like this:

   $ file foo
   foo: ASCII text, with CRLF line terminators

This indicates foo is a DOS text file, since it uses CRLF (carriage-return/line-feed). Some editors, such as vim and Emacs, will report the file type in their status lines. Other text based tools, such as od can also be used to examine the ASCII character values present, exposing the end-of-line control characters by their value.

How To Convert

If you're lucky, the system may have utilities installed to convert file formats. Some common names include: dos2unix, unix2dos, cvt, and flip. If they are not present, then you can draw on one of the basic system utilities to do the job for you.

The simplest involves using an editor, such as vim or Emacs. For instance, to use vim to control the format of the files it writes out, try setting the file format option using the commands :set ff=unix or :set ff=dos. Regardless of what format the file was read in as, it will take the format last set when written back out.

Another option would be to use a command line tool, such as tr (translate), awk, set, or even a Perl or Python script. Here's a tr command that will remove carriage returns, and any Ctrl-Z end-of-file markers from a DOS file (note that the character values are octal, not decimal):

   $ tr -d '\15\32' < dosfile.txt > linuxfile.txt

How To Avoid the Problem

There aren't many ways to reliably avoid this problem if you find yourself moving back and forth between operating systems.

  1. Use vim on all your systems, and modify the startup config file to always set ff=unix, regardless of the OS you are on.
  2. Use Windows tools that produce proper Linux files (vim, for instance).
  3. Install a conversion tool on your system, and apply it before moving the file. Most tools are smart enough to converting a file only if required (e.g. if converting to DOS and a file is already in DOS format, leave it alone). flip is available in source form, and works on DOS, MAC OS/X, and Linux.
  4. Move text files in a Zip archive, using the appropriate command line option to translate end-of-line characters. However, you'll have trouble if you accidently translate a binary file!
  5. Just say NO to cross-platform development!

Users may direct questions to sys-help@loni.org.