Usage Examples

Usage examples

Interactive commands for job submission and monitoring

Submitting your job to a job queue

Use the qsub command to submit your job to the PPPL cluster for processing. Your job is described by a job script, and it is this script that is submitted.

The job scheduler will return a job id containing the job number. For example:

[sunfire05.pppl.gov|82] qsub batch_test

82029.isis.pppl.gov

[sunfire05.pppl.gov|83] _

(note: the job number 82029 is returned from the job scheduler server isis.pppl.gov, a powerful and highly available system whose sole purpose is job scheduling.)

When the job is done, the standard output and error files (stdout, stderr) will be left in the current working directory, i.e. the directory from which you submitted the job. By default, the stdout file is named <jobname>.o<jobid> and the stderr file is named <jobname>.e<jobid>. These names may be overridden.

Monitoring your job

Use the qstat -u <your user name> command to see your jobs queued or running.

Monitoring all jobs currently in a queue

Use the qstat -a <queue name> command to see all jobs queued or running in a specific queue.

Commands to use in your job script

Using Portable Batch System (PBS) directives in your job script

The job scheduler recognizes commands written in the widely used Portable Batch System (PBS) syntax. This allows you to specify these commands (directives) in your job script to control your job's execution. The format of a PBS directive is:

#PBS <flag> [arguments]

where the string "#PBS" is NOT a comment, but rather a special string which denotes a PBS directive. An example is the specification of the job name:

#PBS -N myjob

Specifying the number of nodes on which to run

By default, the job will be run on a single node on a single processor. However, you can specify the use of multiple nodes (especially if you have a parallelized program) by specifying a PBS directive; for example:

 #PBS -l nodes=8:ppn=8

where nodes is the number of nodes to run upon, and ppn is the processors per node. Consequently, the directive

 #PBS -l nodes=1:ppn=16

will execute the job on a single node, using 16 processors on that node.

Considerations for node specification

The node specification will have a dramatic effect upon the queueing time for your job. For example, if you have a job requiring 16 parallel processes, asking for a single node with 16 processors (#PBS -l nodes=1:ppn=16) will have a much longer queued (wait) time than asking for multiple nodes with fewer processors per node (#PBS -l nodes=4:ppn=4), since it is much more likely that the scheduler can find 4 processors free on 4 nodes than all 16 processors free on a single node.

Job queue specification

When submitted, a job will be put into the sque (standard routing queue) to await processing by the job scheduler. The scheduler will then decide, based on the number of nodes and processors requested, in which queue the job will run. However, for special queues (like the Infiniband queue), the scheduler's selection can be overriden by specifying the PBS directive:

#PBS -q <queue name>

for example:

#PBS -q kruskal

Specifying the standard output and standard error files

By default, the standard output (stdout) and error files (stderr) are named <jobname>.o<jobid> and <jobname>.e<jobid> respectively. These names may be overridden using the PBS directive:

#PBS -o joboutput.out

#PBS -e joberror.err

To join standard output and error in one file, whose default value is <jobname>.o<jobid>, use the directive:


#PBS -j oe

Wall time

The amount of wall clock time needed to run the job may be specified by a PBS directive

#PBS -l walltime=hh:mm:ss

for example:

#PBS -l walltime=60:30:00

This wall time estimate (in this case, 60 hours and 30 minutes) informs the scheduler when the systems will be available again. Your job will be terminated (via a kill -15 command) when the wall time estimated is exceeded. So be generous, but to encourage accurate scheduling and load balancing estimates, not too generous.

Using large memory nodes

Many nodes have large memory sizes, which is especially useful for large simulations or models. Specify the amount of memory using the mem attribute, for example:

 #PBS -l mem=64000mb

where mem=64000mb selects a node with at least 64 GB of memory. The memory size, in this case 64GB, is written correctly as 64000mb to avoid rounding errors, and to allow the job to fit on a system with 64GB of memory, while still allowing the operating system to run.

A Generic job script

Here is a generic job script that includes most common directives and options used by PPPL jobs.

Some simple example job scripts

A very simple job

#!/bin/bash
# test.job
# --- send the output to the test.out file
#     the default is .o<jobid>
#PBS -o test.out
# --- send the error output to the test.err file
#     the default is .e<jobid>
#PBS -e test.err

echo "Print out the hostname and date"
/bin/hostname
/bin/date
exit 0

Save the file as test.job, then submit it:

> qsub test.job

To see your results:

> cat test.out

A multiple host job:

#!/bin/bash
# --- run the job on 4 nodes, with 2 processors per node
#PBS -l nodes=4:ppn=2
# --- send the output to the test.out file
#     the default is .o<jobid>
#PBS -o test.out
# --- send the error output to the test.err file
#     the default is .e<jobid>
#PBS -e test.err

# --- print out the list of nodes upon which this job is running
/bin/cat $PBS_NODEFILE

echo "Print out the hostname and date"
/bin/hostname
/bin/date
exit 0

Save the file as test.job, then submit it:

> qsub test.job

To see your results:

> cat test.out