Usage examples
Interactive commands for job submission and monitoring
Submitting your job to a job queue
Use the qsub command to
submit your job to the PPPL cluster for processing. Your job is
described by a job script,
and it is this script that is submitted.
The job scheduler will return a job id containing the job number. For example:
[sunfire05.pppl.gov|82] qsub batch_test
82029.isis.pppl.gov
[sunfire05.pppl.gov|83] _
(note: the job number 82029 is returned from the job scheduler server isis.pppl.gov, a powerful and highly
available system whose sole purpose is job scheduling.)
When the job is done, the standard output and error files (stdout,
stderr) will be left in the current working directory, i.e. the
directory from which you submitted the job. By default, the stdout file
is named <jobname>.o<jobid>
and the stderr file is named <jobname>.e<jobid>.
These names may be overridden.
Monitoring your job
Use the qstat -u <your user name> command to
see your jobs queued or running.
Monitoring all jobs currently in a queue
Use the qstat -a <queue name> command to
see all jobs queued or running in a specific queue.
Commands to use in your job script
Using Portable Batch System (PBS) directives in your job script
The job scheduler recognizes commands written in the widely used Portable
Batch System (PBS) syntax. This allows you to specify these commands (directives)
in your job script to control your job's
execution. The format of a PBS directive
is:
#PBS <flag> [arguments]
where the string "#PBS" is NOT a comment, but rather a special string
which denotes a PBS directive. An example is the specification
of the job name:
#PBS -N myjob
Specifying the number of nodes on which to run
By default, the job will be run on a single node on a single processor.
However, you can specify the use of multiple nodes (especially if you have a
parallelized program) by specifying a PBS directive; for example:
#PBS -l nodes=8:ppn=8
where nodes is the number of
nodes to run upon, and ppn is
the processors per node. Consequently, the directive
#PBS -l nodes=1:ppn=16
will execute the job on a single node, using 16 processors on that node.
Considerations for node specification
The node specification will have a dramatic effect upon the queueing time
for your job. For example, if you have a job requiring 16 parallel processes,
asking for a single node with 16 processors (#PBS -l nodes=1:ppn=16)
will have a much longer queued (wait) time than asking for multiple
nodes with fewer processors per node (#PBS -l nodes=4:ppn=4), since
it is much more likely that the scheduler can find 4 processors free on
4 nodes than all 16 processors free on a single node.
Job queue specification
When submitted, a job will be put into the sque (standard routing queue) to await
processing by the job scheduler. The scheduler will then
decide, based on the number of nodes and processors
requested, in which queue the job will run.
However, for
special queues
(like the Infiniband queue),
the scheduler's selection can be overriden by specifying the PBS directive:
#PBS -q <queue name>
for example:
#PBS -q kruskal
Specifying the
standard output and standard error files
By default, the standard output (stdout) and error files (stderr) are
named <jobname>.o<jobid>
and <jobname>.e<jobid>
respectively. These names may be overridden using the PBS directive:
#PBS -o joboutput.out
#PBS -e joberror.err
To join standard output and error in
one file, whose default value is <jobname>.o<jobid>,
use the directive:
#PBS -j oe
Wall time
The amount of wall clock time needed to run the job may be specified by
a PBS directive
#PBS -l walltime=hh:mm:ss
for example:
#PBS -l walltime=60:30:00
This wall time estimate (in this case, 60 hours and 30 minutes)
informs the scheduler when the systems will be
available again. Your job will be terminated (via a kill -15 command)
when the wall time estimated is exceeded. So be generous, but to encourage
accurate scheduling and load balancing estimates, not too generous.
Using large memory nodes
Many nodes have large memory sizes, which is especially useful for large
simulations or models.
Specify the amount of memory using the mem attribute, for example:
#PBS -l mem=64000mb
where mem=64000mb selects a node with at least 64 GB of memory. The
memory size, in this case 64GB,
is written correctly as 64000mb to avoid rounding errors, and to allow the job
to fit on a system with 64GB of memory, while still allowing the operating system to run.
A Generic job script
Here is a generic job script
that includes most common directives and options used by PPPL jobs.
Some simple example
job scripts
A very simple job
#!/bin/bash
# test.job
# --- send the output to the test.out file
# the default is .o<jobid>
#PBS -o
test.out
# --- send the error output to the test.err file
# the default is .e<jobid>
#PBS -e test.err
echo "Print out the hostname and date"
/bin/hostname
/bin/date
exit 0
Save the file as test.job, then submit it:
> qsub test.job
To see your results:
> cat test.out
A multiple host job:
#!/bin/bash
# --- run the job on 4 nodes, with 2 processors per node
#PBS -l nodes=4:ppn=2
# --- send the output to the test.out file
# the default is .o<jobid>
#PBS -o
test.out
# --- send the error output to the test.err file
# the default is .e<jobid>
#PBS -e test.err
# --- print out the list of nodes upon which this job is running
/bin/cat $PBS_NODEFILE
echo "Print out the hostname and date"
/bin/hostname
/bin/date
exit 0
Save the file as test.job, then submit it:
> qsub test.job
To see your results:
> cat test.out