Personal tools
You are here: Home CSE Student Resources HPC Resources PBS at CSE - User's Guide

PBS at CSE - User's Guide

User's Guide for Portable Batch System (PBS) in the Penn State University's Department of Computer Science and Engineering

 User Guide

 

 

 

Overview

CSE HPC resources are available for use via Torque 2.5.13 (a version of PBS - Portable Batch System).

Jobs are submitted to the PBS Server and from there are distributed to one of the various clusters in the department.  Most clusters in the department are restricted use for specific research groups and many are used for kernel development and testing.

 

Cluster access is controlled by a users UNIX group. To determine what groups you are in issue the command “groups”.

 

PBS Quick Start

Make sure you are in the correct groups

Check your groups by using the command  groups

  you may see something like this

   % groups
   gcse inti titan

 

If you need to access the inti cluster make sure you have the inti group. For the titan cluster, you need to be in the "titan" group, for the ganga cluster, you need to be in the "ganga" group and for the scotia cluster , you need to be in the "scotia" group.

 

Setting up your account

PBS requires you to be able to SSH between nodes.  Use the following steps.

  1. SSH to a department linux machine (lab218) or inti.cse.psu.edu

% chmod 700 ~/.ssh 
% chmod go-w ~/

   2. edit or create the file ~/.ssh/config to have the following (other Hosts lines may be required depending on the clusters you are using and have access to

Host titan*
  StrictHostKeyChecking no
  BatchMode yes
Host inti*
  StrictHostKeyChecking no
  BatchMode yes
Host scotia*
  StrictHostKeyChecking no
  BatchMode yes
Host p218inst*
  StrictHostKeyChecking no
  BatchMode yes
Host ganga*
  StrictHostKeyChecking no
  BatchMode yes
Host wattson*
  StrictHostKeyChecking no
  BatchMode yes

         Or if you only want it for inti.

Host inti*
  StrictHostKeyChecking no
  BatchMode yes

        Alternately you can import the hostinfo directly into your ~/.ssh/known_hosts from all machines at one time with one command.  The draw back to this method is that, should the hostkey change, you may see a failure with pbs  with an error message of  "Bad UID for job execution" if we reimage a machine.

ssh-keyscan -f /usr/local/user_conf/clusterips  -t rsa >>~/.ssh/known_hosts

       I would suggest using both methods the import as well as adding the "Host" lines above to the ~/.ssh/config

  3.  Generate  RSA2 key with DO NOT put in a password at the prompt leave it blank.

ssh-keygen –t rsa –f ~/.ssh/id_rsa

  4.   Create your authorized_keys file:

cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

  5.      Create or edit your .rhosts to have the following hosts where you will typically submit jobs from.   If you use others such as ganga or wattson, add them.

      inti.cse.psu.edu <username>
      inti <username>
      shai.cse.psu.edu <username> 
      shai  <username>
      ladon.cse.psu.edu <username>
      ladon <username>
      titan <username>
      titan.cse.psu.edu <username>
 

  6. Test by connecting between hosts “ssh p218inst20” then “ssh p218inst21”.   You should be able to SSH without password or error.

  

PBS Commands

Here are the most used commands for users.  If you don’t see what you need here, download the PBS User Guide.

 

qstat –Q                         Displays a list of queues (refer to queue

                                                     description to for availability)

qstat                              List all jobs in the queues

qstat –q <queuename>    List jobs in a specific queue

qstat –u <username>       List jobs for a particular user

qsub <script>                   Submit a job to the default queue (submit)

qsub –q <queue> <script> Submit a job to a particular queue.

qsub –I <script>               Submit a job for an interactive run (will return a prompt when  job is run)

qsub –h <script>               Submit a job and put a hold on it (useful for

                                                      starting jobs with dependencies)

qrls <jobid>                      release a hold on a job

qdel <jobid>                     stop and remove a job from pbs

xpbs &                             run the X-windows batch monitoring application

 

 

examples:

This command submits a job to the queue inti.

% qsub –q inti myjob.pbs

This command will delete job number 13452.shai.cse.psu.ed

 % qdel 13452.shai.cse.psu.ed


PBS Scripts

PBS uses a flat text file to contain environment settings. A simple script to submit a parallel job:

Submit.pbs:

#!/bin/sh
#PBS -l nodes=2:ppn=2
#PBS -l walltime=00:30:00
#PBS -q titan@shai
#PBS -N test_program
#### Next two lines can have one # removed and edit the email address to get notifications of when job begins/ends/aborts  bea
##PBS -m abe
##PBS -M <youremail>@cse.psu.edu
#
PROG=<change to path to your hello executable>
ARGS=""
#
echo ""
echo "Change to submission directory"
cd $PBS_O_WORKDIR
pwd
echo ""
# Get number of nodes allocated (using the magic shell script)
#
NO_OF_CORES=`cat $PBS_NODEFILE | egrep -v '^#'\|'^$' | wc -l | awk '{print $1}'`
# run the program
#

echo ""
echo "run the command:"
echo /usr/bin/mpiexec.hydra -f $PBS_NODEFILE -n $NO_OF_CORES $PROG
/usr/bin/mpiexec.hydra -f $PBS_NODEFILE -n $NO_OF_CORES $PROG
exit

Explanation:

First we call the shell (any shell can be used but the path must exist on the execution nodes and syntax must match the shell called).

The next 6 lines are PBS options (They may also be specified on the command line).   They are not comments despite the # at the beginning of the lines.

   -l nodes=2:ppn=2         specifies 2 nodes and 2 processors per node

   -l walltime=00:30:00     says how long a job will run (when time runs out your job is killed.) ALWAYS SPECIFY A WALLTIME

   -q <queuename>          directs job to a specific queue. By default jobs will go to the route queue submit@surya-0 and pbs will try to determine what queue your job should go to.

   -N <name of job>        sets the name of the job for output files.

   -m <abe>                     Mail options for when PBS will send you mail (begin, aborted, terminated)

    -M <addresses>           Where mail should be sent (defaults to the user who submits the job)

Next we set some variables to try to keep the script clean

            PROG= full path to the binary you want to run

            ARGS= and arguments you want to send to $PROG

 

$PBS_O_WORKDIR is the directory from which the job was submitted from. We cd here because otherwise jobs are started from the user's CSE Home Directory (i.e. /home/grads/<username>, /home/other/<username>, etc.).

Serial job scripts are a bit less complicated.   Most can use something as simple as this:

 Submit.pbs:

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:30:00
#PBS -q inti@shai.cse.psu.edu
#PBS -N test_program
#PBS -m abe
#PBS -M user@cse.psu.edu
#
PROG=/home/users/prescott/hello-world
ARGS=""
#
cd $PBS_O_WORKDIR
# run the program
#
$PROG $ARGS
exit 0

 

Once you have created your submit.pbs script it can be submitted with qsub and PBS will return a job number:

% qsub submit.pbs 
2344.pbs
 

To view the status of your job use the command qstat. While you will receive mail from your job you may not want to receive mail from each one if you are submitting 2000 jobs.

 

% qstat -a
Job id            Name              User             Time Use S Queue
----------------  ----------------  ---------------- -------- - -----
399407.pbs    254.gap.ref       jhu              13:11:34 R titan          
399423.pbs    175.vpr.ref       jhu              10:07:14 R titan          
399441.pbs    submit.pbs        prescott         00:00:00 R inti

To kill a job in the queue the command qdel <jobid> is used:

% qdel 399441.pbs

Output from all jobs that finish are placed in the PBS_O_WORKDIR (directory where job was submitted from) either using the name of the script or the job name as specified using the –N option in the PBS script. A pair of files will be moved into the directory with the names of  “test_program.o399411” and “test_program.e399411” as per the script above .o<jobid> files contain the standard output and .e<jobid> contain the standard error.

 

Examples

Here are some examples and how the output should look.

Serial example

Parallel example

Serial matlab

Hello World Parallel code example

Specific resources:

 

PBS looks for tags in the node line that may specify specific resources requested. These resources can be connection type, operating system, and cluster.

Valid options are:

Clusters:  titan, inti, ladon, ganga, scotia
 


And example of this would be:

#PBS –l nodes=1:ppn=1

A job sent to the submit queue would be routed to the Inti Linux cluster queue as this is the only cluster that currently has Infiniband interconnect at this time.

 

#PBS –l nodes=1:ppn=1
 

To submit to a specific queue, specify it with the option.

 #PBS –q <queuename>@shai

 

Inti Cluster and Infiniband

 

To compile parallel programs on the inti cluster to take advantage of the TopSpin Infiniband interconnect, users will need to compile their programs on the head node of the cluster  inti.cse.psu.edu and use the top spin libraries.  Running the following (in tcsh format) will set the environment:

set MPICH = /usr/local/topspin/mpi/mpich
set MPICH_LIB = $MPICH/lib64
set MPICH_PATH = $MPICH/bin
set path = ($path $MPICH_PATH )

 

Compile your programs using mpicc (/usr/local/topspin/mpi/mpich/bin/mpicc) and using the topspin mpi.h (/usr/local/topspin/mpi/mpich/include/mpi.h)

 

You should then tell PBS that you want your job to run using the inti cluster and use mpirun_ssh (/usr/local/topspin/mpi/mpich/bin/mpirun_ssh). To require infiniband in your pbs script add “ib” to the nodes line:

 #PBS –l nodes=2:ppn=2

 

Matlab and PBS

 

To run matlab using pbs you first need to create a .m file that will hold all of your matlab instructions (It is important that the .m file have a quit as the last line!).

 

Matlab polar-plot.m:

angle = 0:.1*pi:3*pi;
   radius = exp(angle/20);
   polar(angle,radius),...
   title('An Example Polar Plot'),...
   grid
print -deps polar-plot.ps;
quit;

 

The “print –deps polar-plot.ps” is necessary to see the graphical output.  ALWAYS have a “quit;” as your last line or matlab will not shutdown.

 

Submit file for PBS matlab.pbs:

 

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIR
#Be sure to include “-nodisplay” to stop X display.
/usr/local/bin/matlab
  –nodisplay < polar-plot.m
exit 0

 

FAQ

 

Q: Who manages PBS?

A: The manager of the CSE PBS system and cluster machines is Eric Prescott. All requests for help should be submitted via helpdesk http://www.cse.psu.edu/support/external.php

 

Q: How do I get an account?

A: First you must have a CSE account. All CSE users have access to the queue lab218. Other queues are restricted to research groups (see table). If you need access, contact the researcher in charge of the group and have them contact IT support.

 

Q: How do I connect to the clusters?

A: Access is via SSH protocol (version 2). Connect to any machine in lab218 or to the head node of a specific cluster such as inti.cse.psu.edu.  DO NOT connect to individual nodes in clusters. Access to cluster nodes is granted only via PBS.

 

Q: Why can I not SSH between nodes?

A: PBS requires you to be able to SSH between nodes follow the following steps.

  • SSH to a department linux machine (lab218)
  • chmod 700 ~/.ssh
  • chmod go-w ~/
  • edit or create the file ~/.ssh/config to have the following
StrictHostKeyChecking no
BatchMode yes
  • Generate  RSA2 key with “ssh-keygen –t rsa –f ~/.ssh/id_rsa” DO NOT put in a password at the prompt leave it blank
  • Create your authorized_keys file: “cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys”
  • Test by connecting between hosts “ssh p218inst20” then “ssh p218inst21” you should be able to SSH without password or error.

 

Q: How do I submit batch jobs?

A: Read the PBS tutorial (this full document and follow the examples).

 

Q: I keep getting the following error when I submit jobs “qsub: Bad UID for job execution

A: Check that the host you are submitting from is in your ~/.rhosts file.  The host you plan on submitting your jobs from must be in the .rhosts file and it should be in the format of:

p218inst20.cse.psu.edu

<username>

inti.cse.psu.edu

<username>

shai.cse.psu.edu

<username>

 

Q: My host is in .rhosts but I still get the error “qsub: Bad UID for job execution

A: remove your .rhosts and rebuild it with just the host your submitting from and try again. A control character in the wrong place can cause it to fail.

 

Q: I still get the error "qsub: Bad GID for job execution"

A: Make sure your "PBS -W group_list=" has the proper group for the queue your submitting to.

           Queue                 Group

            inti@shai          inti

            titan@shai      titan

Q: Can I just run MPI jobs without PBS?

A: NO. All MPI jobs to clusters must use PBS. For testing purposes several nodes will be made available on request in such cases use of a –hostfile is necessary.

 

Q: My jobs are not being terminated or I started the on the wrong nodes by mistake.

A: Sometimes MPI jobs don’t terminate properly or can be assigned to the wrong nodes. If you notice this happening, contact the system manager as soon as possible.

 

Q: What is with all the serial jobs? This person is hogging the cluster!?

A: This is a shared resource. Sometimes users will need to submit upwards of 40 thousand jobs! Users must think of others when submitting jobs. Make sure if you are submitting a large number of jobs that you script your submission so that you give time for others to get their jobs in as well.  Currently our scheduling is first come first serve, but if we need to regulate submission we will.

 

Document Actions