Thursday, September 4, 2014

Clustering : PBS: Portable Batch System [HPC = JOB SCHEDULER]

PBS

The Portable Batch System, PBS, is a workload management system for Linux clusters. It supplies command to submit, monitor, and manage jobs.

PBS Job Script

  1. Create a job script containing the following PBS options
Request the resources that will be needed (i.e. number of processors, wall-clock time, etc.) and use commands to prepare for execution of the executable (i.e. cd to working directory, etc.).
  1. Submit the job script file to PBS.
  2. Monitor the job.

Common PBS Options

Below are some of the commonly used PBS options in a job script file. The options start with "#PBS."

Option
Description
#PBS -N myJob
Assigns a job name. The default is the name of PBS job script.
#PBS -l nodes=4:ppn=2
The number of nodes and processors per node.
#PBS -q queuename
Assigns the queue your job will use.
#PBS -l walltime=01:00:00
The maximum wall-clock time during which this job can run.
#PBS -o mypath/my.out
The path and file name for standard output.
#PBS -e mypath/my.err
The path and file name for standard error.
#PBS -j oe
Join option that merges the standard error stream with the standard output stream of the job.
#PBS -W stagein=file_list
Copies the file onto the execution host before the job starts. (*)
#PBS -W stageout=file_list
Copies the file from the execution host after the job completes. (*)
#PBS -m b
Sends mail to the user when the job begins.
#PBS -m e
Sends mail to the user when the job ends.
#PBS -m a
Sends mail to the user when job aborts (with an error).
#PBS -m ba
Allows a user to have more than one command with the same flag by grouping the messages together on one line, else only the last command gets executed.
#PBS -r n
Indicates that a job should not rerun if it fails.
#PBS -V
Exports all environment variables to the job.

PBS ENVIRONMENT VARIABLES

There are a number of predefined environment variables. These include the following:
  • Variables defined on the execution host;
  • Variables exported from the submission host to the execution host; and
  • Variables defined by PBS.
The following environment variables relate to the submission machine:
Option
Description
PBS_O_HOST
The host machine on which the qsub command was run.
PBS_O_LOGNAME
The login name on the machine on which the qsub was run.
PBS_O_HOME
The home directory from which the qsub was run.
PBS_O_WORKDIR
The working directory from which the qsub was run.

The following variables relate to the environment where the job is executing:
Option
Description
PBS_ENVIRONMENT
This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs.
PBS_O_QUEUE
The original queue to which the job was submitted.
PBS_JOBID
The identifier that PBS assigns to the job.
PBS_JOBNAME
The name of the job.
PBS_NODEFILE
The file containing the list of nodes assigned to a parallel job.


Submitting a Job

We can submit job by 'qsub' command. Job attributes can be set in 2 different ways .

Method 1: on the qsub command line

qsub -<other options> -N <job_name> <job_script>

ex: qsub -l select=1:ncpus=1:mem=100MB -l walltime=01:00:00 -N my_job myscript

Method 2: within a job script as a PBS directive

#! /bin/bash
#PBS -l walltime=10:00:00
#PBS -N my_job_mpi
#PBS -q workq
#PBS -l select=2:ncpus=12:mpiprocs=12
#PBS -l place=scatter:excl
#PBS -V

Go to the directory from which you submitted the job
cd $PBS_O_WORKDIR

mpiexec_mpt ./a.out

Note: - PBS expects the directives to begin on the second line, and be on consecutive lines thereafter.
Once started, the interpreter stops processing directives at the first line that contains an executable line. It will ignore comment lines.
- Command line arguments will override PBS directives.

Monitoring a Job

Below are commands for monitoring a job:
Command
Function
qstat -a
check status of jobs, queues, and the PBS server
qstat -f job.ID
get all the information about a job, i.e. resources requested, resource limits, owner, source, destination, queue, etc.
qdel job.ID
delete a job from the queue
qhold job.ID
hold a job if it is in the queue
qrls job.ID
release a job from hold
tracejob job.ID
comprehensive information about a job
 ===========================================================

Components of PBS


 Batch Server (PBS_Server) : provides the basic batch services such as receiving/creating a batch job, modifying the job, protecting the job against system crashes, and running the job.
  • central focus for a PBS complex
  • routes job to compute host
  • processes all PBS related commands
  • provides the basic batch services
  • server maintains its own server and queue settings
  • daemon executes as pbs_server

Scheduler (PBS_Scheduler) a daemon that contains the site's policy controlling which job is run and where and when it is run. PBS allows each site to create its own Scheduler 
  • queries list of running and queued jobs from the PBS Server
  • queries queue, server, and node properties
  • queries resource consumption and availability from the PBS MOM
  • sorts available jobs according to local scheduling policies
  • determines which job is eligible to run next
  • daemon executing as pbs_sched

MOM (PBS_MOM) : It actually places the job into execution when it receives a copy of the job from the Batch Server. Mom creates a new session as identical to a user login session as is possible and returns the job's output to the user.
  • executes jobs at request of PBS Scheduler
  • monitors resource usage of running jobs
  • enforces resource limits on jobs
  • reports system resource limits, configuration
  • daemon executing as pbs_mom   

No comments: