SLURM

From CAC Wiki
Jump to: navigation, search

About SLURM

SLURM is the scheduler used by the Frontenac cluster. Like Sun Grid Engine (the scheduler used for the M9000 and SW clusters), SLURM is used for submitting, monitoring, and controlling jobs on a cluster. Any jobs or computations done on the Frontenac cluster must be started via SLURM. Reading this tutorial will supply all the information necessary to run jobs on Frontenac.

Although existing users are likely very familiar with Sun Grid Engine (SGE), switching to SLURM offers a number of advantages over the old system. The biggest advantage is that the scheduling algorithm is significantly better than that offered by SGE, allowing more jobs to be run on the same amount of hardware. SLURM also supports new types of jobs- users will now be able to schedule interactive sessions or run individual commands via the scheduler. In terms of administration and accounting, SLURM is also considerably more flexible. Although easier cluster administration does not directly impact users in the short term, CAC will be able to more easily reconfigure our systems over time to meet the changing needs of users and perform critical system maintenance. All in all, we believe switching to SLURM will offer our users an all-around better experience when using our systems.

How SLURM works

SLURM is the piece of software that allows many users to share a compute cluster. A cluster is a set of networked computers- each computer represents one "node" of the cluster. When a user submits a job, SLURM will schedule this job on a node (or nodes) that meets the resource requirements indicated by the user. If no resources are currently available, the users job will wait in a queue until the resources they have requested become available for use.

Nodes in SLURM are divided into distinct "partitions" (similar to queues in SGE) and a node may be part of multiple partitions. Different partitions may have different uses, such as directing users' jobs to nodes with a particular piece of software installed (some software licenses only allow us to install software on a given number of nodes). Generally, the default partition (named "standard") will suffice for most uses and encompasses the largest amount of hardware.

All users will have one or more SLURM usage accounts. Accounts are used to record accounting information and may be used control access to certain partitions (such as those for RAC allocations). For everyday, default use, most users will not need to bother with accounts or accounting details (just be aware that they exist). For a detailed overview of SLURM accounts and accounting, please see our guide to SLURM accounting .

Basic SLURM commands

These are the basic commands used to do most basic operations with SLURM.


sinfo - Check the status of the cluster/partitions

sinfo 
sinfo -lNe  # same as above, but shows per-node status

Example output of sinfo on a small demonstration cluster. Nodes cac002-cac006 are part of the "standard" partition (jobs are submitted to this partition by default, indicated by the '*' character), and nodes cac007-cac009 are part of the "large" partition. One node in the "large" partition is currently allocated and being used (cac007).

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
standard*    up 2-00:00:00      5   idle cac[002-006]
large        up 14-00:00:0      1  alloc cac007
large        up 14-00:00:0      2   idle cac[008-009]

squeue - Show status of jobs

squeue                  # your jobs
squeue -u <username>    # show jobs for user <username>
squeue --start          # show expected start times of jobs in queue

Example output of squeue on a demonstration cluster. User jeffs has 5 jobs running on nodes ac002-ac006 (in partition "standard"), and 4 jobs in queue. Job 1164 is has not started because no resources are available for that job, and jobs 1165-1167 have not started because job 1164 has priority.

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1166  standard long-job    jeffs PD       0:00      1 (Priority)
              1167  standard long-job    jeffs PD       0:00      1 (Priority)
              1165  standard long-job    jeffs PD       0:00      1 (Priority)
              1164  standard long-job    jeffs PD       0:00      1 (Resources)
              1161  standard long-job    jeffs  R       0:08      1 cac004
              1162  standard long-job    jeffs  R       0:08      1 cac005
              1163  standard long-job    jeffs  R       0:08      1 cac006
              1160  standard long-job    jeffs  R       0:12      1 cac003
              1159  standard long-job    jeffs  R       0:16      1 cac002


scancel - Kill a job

You can get job IDs with squeue Note that you can only kill your own jobs.

scancel <jobID>         # kill job <jobID>. (you can get the job IDs with "squeue")
scancel -u <username>   # kill all jobs for user <username>. 
scancel -t <state>      # kill all jobs in state <state>. <state> can be one of: PENDING, RUNNING, SUSPENDED

Long wait times?

You can try using the older SSE3 nodes , see #Using_SSE3_Nodes

Running jobs

There are actually 3 methods of submitting jobs under SLURM: sbatch, srun, and salloc. Although this may initially seem unnecessarily complicated, these commands have the same options, and allows users to submit new types of jobs.

sbatch - Submit a job script to be run

sbatch will submit a job script to be run by the cluster. Job scripts under SLURM are simply just shell scripts (*.sh) with a set of resource requests at the top of the script. Users of Sun Grid Engine should note that SLURM's sbatch is functionally identical to SGE's qsub.

To submit a job script to SLURM:

sbatch nameOfScript.sh

Example output:

$ sbatch long-job.sh
Submitted batch job 1169

Job scripts specify the resources requested and other special considerations with special "#SBATCH" comments at the top of a job script. Although many of these options are optional, directives dealing with resource requests (CPUs, memory, and walltime) are mandatory. All directives should be added to your scripts in the following manner:

#SBATCH <directive>

To specify a job name, for instance, you would add the following to your script:

#SBATCH --job-name=myJobName

For users looking to get started with SLURM as fast as possible, a minimalist template job script is shown below:

#!/bin/bash
#SBATCH -c                                 # Number of CPUS requested. If omitted, the default is 1 CPU.
#SBATCH --mem=megabytes                    # Memory requested in megabytes. If omitted, the default is 1024 MB.
#SBATCH --time=days-hours:minutes:seconds      # How long will your job run for? If omitted, the default is 3 hours.

# commands for your job go here

Mandatory directives

Directives in this section are mandatory, and are by SLURM to determine where and when your jobs will run. If you do not assign a value for these, the scheduler will assign your jobs the default value. If you do not specifically request resources for a job, it will be assigned a set of default resources. Unlike with Sun Grid Engine, jobs that exceed their resource requests will be automatically killed by SLURM. Though this seems harsh, it means that users exceeding the resources that the scheduler has given them will not degrade the experiences of other users on the system. Jobs requesting more resources may be harder to schedule (because they have to wait for a larger slot).

-c <cpus> -- This is the number of CPUs your job needs. Note that SLURM is relatively generous with CPUs, and the value specified here is the minimum number of CPUs that your job will be assigned. If additional CPUs are available on a node beyond what was requested, your job will be given those CPUs until they are needed by other jobs. Default value is 1 CPU. Attempting to use more CPUs than you have been allocated will result in your extra processes taking turns on the same CPU (slowing your job down).

--mem=<megabytes> -- This is the amount of memory your job needs to run. Chances are, you may not know how much memory your job will use. If this is the case, a good rule of thumb is 2048 megabytes (2 gigabytes) per processor that your job uses. Note that jobs will be killed if they exceed their memory allocations, so it's best to err on the safe side and request extra memory if you are unsure of things (there is no penalty for requesting too much memory). Default value is 1024 MB.

-t <days-hours:minutes:seconds> -- Walltime for your job. The walltime is the length of time you expect your job to run. Again, your job will be killed if it runs for longer than the requested walltime. If you do not know how long your job will run for, err on the side of requesting too much walltime, rather than to little. A typical rule of thumb is asking for twice or three times the amount of time you think you will need. May also follow the format "hours:minutes:seconds". Default value is 3 hours, and the maximum walltime is 2 weeks (please contact us if you need to run longer jobs, this is quite easy to accommodate).

Optional directives

For a list of all directives available, see the SLURM documentation at http://slurm.schedmd.com/sbatch.html. The directives in this article were covered because they were the most relevant for typical use cases.

--mail-type=BEGIN,END,FAIL,ALL and --mail-user=<emailAddress> -- Be emailed when your job starts/finishes/fails. You can specify multiple values for this (separated by commas) if need be.

-p <partition>, --partition=<partition> -- Submit a job to a specific partition. Your submission may be rejected if you do not have permission to run in the requested partition.

-A <account>, --account=<account> -- Associate a job with a particular SLURM usage account. Unnecessary unless you wish to submit jobs to a partition that require the use of a particular account.

-D <directory>, --chdir=<directory> -- The working directory you want your job script to execute in. By default, job working directory is the location where sbatch <script> was run.

-J <name>, --jobname=<name> -- Specify a name for your job.

-o <STDOUT_log>, --output=<STDOUT_log> -- Redirect output to a the logfiles you specify. By default, both STDOUT and STDERR are sent to this file. You can specify %j as part of the log filename to indicate job ID (as an example, "#SBATCH -o ouptut_%j.o" would redirect output to "output_123456.o").

-e <STDERR_log>, --error=<STDERR_log> -- Redirect STDERR to a separate file. Works exactly the same as "-o".

Array jobs

When running hundreds or thousands of jobs, it may be advantages to run these jobs as an "array job". Array jobs allow you submit thousands of such jobs (called "job steps") with a single job script. Each job will be assigned a unique value for the environment variable SLURM_ARRAY_TASK_ID. You can use this variable to read parameters for individual steps from a given line of a file, for instance.

A sample array job that creates 6 job steps with SLURM_ARRAY_TASK_ID incremented by 3. STDOUT and STDERR output streams have been redirected to the same file: arrayJob_%A_%a.out (%A is the job number of the array job itself, %a is the job step).

#!/bin/bash
#SBATCH --array=0-20:3
#SBATCH --output=arrayJob_%A_%a.out

echo 'This is job step '${SLURM_ARRAY_TASK_ID}

srun - Run a single command on the cluster

Sometimes it may be advantageous to run a single command on the cluster as a test or to quickly perform an operation with additional resources. srun enables users to do this, and shares all of the same directives as sbatch. STDOUT and STDERR for an srun job will be redirected to the user's screen. Ctrl-C will cancel an srun job.

Basic usage:

srun <someCommand>     

Example output (running the command "hostname" to return which computer you are running on):

$ srun hostname
cac003

Submit a command with additional directives (in this case run the program "test" with 12 cpus/20 gigabytes of memory in partition "bigjob"):

srun -c 12 --mem=20000 --partition=bigjob test

salloc - Schedule an interactive job

SLURM has the unique capability of being able to schedule interactive sessions for a user. An "interactive session" is identical to having normal, command-line usage of one of a cluster node with the resources requested. Need to run a program that requires using a GUI or test out a program? No problem, this just requires a slight modification to srun's syntax.

To start an interactive shell, use salloc in the following manner. Note that use of X11 forwarding requires that you have connected to the cluster using an SSH client that supports X-forwarding (done using "ssh -X" on logon, requires XQuartz on OSX or MobaXTerm on Windows).

salloc [other slurm options here]

Example usage (use 4 processors and 6 gigabytes of RAM interactively):

[jeffs@caclogin02 ~]$ salloc -c 4 --mem=6g      # start an interactive session with x forwarding for graphics
[jeffs@cac002 ~]$ xeyes                                     # test graphics forwarding
[jeffs@cac002 ~]$ exit                                      # quit interactive session
exit
[jeffs@caclogin02 ~]$                                           # we are now back on the node where we started

Parallel Jobs

Many of the jobs running on a production cluster are going to involve more than one processor (CPU, core). Such parallel jobs need to request the number of required resources through additional options. The most common ones are:

-N, --node= Number of cluster nodes requested
-n, --ntasks= Total number of tasks (processes)
-c, --cpus-per-task Number of cpus (cores) per task

For different types of parallel jobs different options will be specified. The most common parallel jobs are MPI (distributed memory) jobs, multi-threaded (shared memory) jobs, and so-called hybrids that are a combination of the two. Let's discuss them separately with a n example for each.

MPI Jobs

MPI (Message Passing Interface) is the standard communication API for parallel distributed-memory job capable of being deployed on a cluster. To schedule such a job, it is necessary to specify the number of cluster nodes that will be used, and the number of processes (tasks) that are going to run on each node.

Currently, each MPI job on our cluster is restricted to run on a single node, i.e. all processes are scheduled on different CPUs (cores) and use a so-called shared-memory layer to communicate with each other. The upside of this type of scheduling is that communication is fast and efficient compared with inter-node communication. The downside is that the total number of tasks (processes) used by a program is limited by the size of the node on which it runs. A typical MPI script for such a program looks like this:

#!/bin/bash
#SBATCH --job-name=MPI_test
#SBATCH --mail-type=ALL
#SBATCH --mail-user=joe.user@email.ca
#SBATCH --output=STD.out
#SBATCH --error=STD.err
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=1
#SBATCH --time=30:00
#SBATCH --mem=1G
mpirun -np $SLURM_NTASKS ./mpi_program

The key option here is "-ntasks=8" which requests enough cores for 8 MPI tasks.

The "--nodes" and "--cpus-per-task" options need to be kept at 1 to indicated that all processes are to be run on a single node, and that each process is single-threaded (i.e. we are not doing any multi-threading on the MPI processes).

A specification of the number of processes in the mpirun line may be omitted as mpirun interfaces with SLURM and selects the proper number automatically from the "--ntasks" option.

Multi-threaded Jobs

Parallel jobs designed to run on a multi-core (shared-memory) system are usually "multi-threaded". Scheduling such a job requires to specify the number of cores being used to accommodate the threads.

OpenMP is the commonly set of compiler directives to facilitate the development of multi-threaded programs. A typical SLURM script for such a program looks like this:

#!/bin/bash
#SBATCH --job-name=OMPtest
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my.email@whatever.ca
#SBATCH --output=STD.out
#SBATCH --error=STD.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=30:00
#SBATCH --mem=1G
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK time ./omp-program 

When using an OpenMP program, the number of threads (and therefore the required number of cores) is specified via the environment variable OMP_NUM_THREADS which therefore appears in the script in front of the call to the program.We are setting it to the internal variable SLURM_CPUS_PER_TASK which is set through the "-cpus-per-task" option (to 4 in our case).

The "--nodes" and "--ntasks" options are kept at 1 to indicate a single main program running on a single node.

Multi-threaded programs that use different multi-threaded techniques (for instance, the Posix thread libraries) use a slightly different approach, but the principle is the same:

Specify the number of required cores through the "--cpus-per-task" option and pass that number to the program through the variable SLURM_CPUS_PER_TASK.

Hybrid Jobs

MPI distributed-memory and OpenMP shared-memory parallelism may be combined to obtain a "hybrid" program. This has to be done with great care to avoid race-conditions on the process-to-process communication. However, such programs are particularly useful when it is important to exploit the multi-core nature of the nodes in a cluster.

The following script works for simple run of a hybrid program on a single node, assuming each MPI process uses the same number of sub-threads:

#!/bin/bash
#SBATCH --job-name=OMP_test
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my.email@whatever.ca
#SBATCH --output=STD.out
#SBATCH --error=STD.err
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=4
#SBATCH --time=30:00
#SBATCH --mem=1000
OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpirun -np $SLURM_NTASKS ./hybrid-program 

This example would run the program "hybrid-program" with 8 MPI processes, each utilizing 4 threads for a total of 32. Note that the number of nodes (i.e. the --nodes option) is set to one to indicate that all cores need to be allocated on a single node. This setting should not be changed in the current cluster configuration.

GPU jobs

CAC has a small number of NVIDIA GP100 GPUs available for general use. To access these, add the following to your job script:

#SBATCH --partition=gpu
#SBATCH --gres gpu:1

The --partition=gpu flag sends your job to a partition with GPUs available. --gres gpu:1 requests a single GPU. To request additional GPUs (up to 3 can be requested per job), you can change this instruction to --gres gpu:3 (this would request 3 GPUs). GPUs are billed as 10 CPUs when performing fairshare/job priority calculations - using one will count as using either #GPUs x 10 or the #CPUs, whichever is higher. Note that most GPU-accelerated software will not be displayed by default when running module avail, make sure to use module spider softwarename to find and learn how to load GPU-specific software modules.

For an interactive job using GPUs, run the following (this example uses 1 GPU, 10 CPUs, and 40GB of memory for 8 hours):

salloc -p gpu --gres gpu:1 -c 10 --mem 40g -t 8:0:0

Here is an example job that would run the "deviceQuery" program from the NVIDIA CUDA developer samples:

#!/bin/bash
#SBATCH --cores-per-task=1
#SBATCH --mem=10g
#SBATCH --time=1:00
#SBATCH --partition=gpu
#SBATCH --gres gpu:1

#SBATCH --output=STD.out
#SBATCH --error=STD.err

module load cuda
module load cuda-samples
# display all allocated GPUs and their stats
deviceQuery
# show GPU bandwidth for 1 GPU only
bandwidthTest

The line "module load cuda-samples" adds to the PATH a directory containing pre-compiled CUDA sample programs. The output of these simple sample programs are directed to STD.out (standard output) and STD.err (standard error).

Accounts and Partitions

Accounts

A substantial part of our resources are allocated beforehand for large projects. This is handled through the scheduler using priorities and allocation limits. A more detailed account of allocations on the Frontenac cluster can be found on our Allocations Page.

Every user of the Frontenac cluster is issued a Default Account for the scheduler. This is done automatically at first login. It entitles the user to access the "Standard" partition of nodes. This partition contains a (somewhat variable) number of nodes. Most of these have 24 cores and 256 GB of memory. For details see this entry. If no partition and no account are specified, this default will be used. This account is also associated with a default priority, which is used to determine when a job gets scheduled if there is competition for resources.

Note: The scheduler is trying to maximize the utilization of scarce resources. Due to the relatively low priority of the default accounts, you have to expect long waiting times if many users are on the cluster. Some of the resources (for instance, nodes with large amounts of memory) may not be available to a default account at all.

If a user has been given an allocation (for instance, from an application to Compute Canada, an additional non-default account is issued. This is done "manually" and the account is only used if specified explicitly in the job script. Account specification is done through the SLURM -A or --account= option, for instance

#SBATCH --account=rac-2017-hpcg1234

An account specification consists of three parts:

  • The type of account and its associated allocation. Presently this may be "def" (for default), "rac" (for RAC allocation from Compute Canada), or "con" (for contributed systems). In the above example we are specifying a "RAC" type account, thus the "rac-"
  • The year of the associated allocation (2017 in the above example)
  • The name of the group. Typically this is "hpcg" followed by 4 digits. Since allocation limits usually apply on a group level, this needs to be specified, in the above example it's hpcg1234.

Note that if you are entitled to use a special allocation, you must specify the proper account or you will not be able to access the extended resources that go with it. Non-default accounts also receive a higher priority on shared resources, i.e. their jobs will be scheduled preferably if resources are sparse (as they usually are).

Partitions

The Frontenac cluster is split up into partly overlapping "partitions", i.e. group of nodes. There are currently two of these:

  • The standard partition is accessible by default accounts. It cannot be used from a non-default account. It consists largely of smaller nodes with 24 cores and 256 GB of memory.
  • The reserved partition can only be accessed by rac- and con- accounts (i.e. non-default ones). It contains large-memory and other nodes with an extended number of cores (40-144).

The two partitions are partially overlapping, i.e. some nodes may be accessed from either. However, for those nodes the non-standard accounts take precendence because of their higher priority, so that is a default account and a non-default account compete, the latter will be scheduled first.

Note: The partition must be specified and it must be compatible with the account type. To specify it, you use the -p or --partition option:

#SBATCH --account=rac-2017-hpcg1234
#SBATCH --partition=reserved

If no partition is specified, standard is assumed. Important: If you are using a "rac-" or "con-" account, you must specify the "reserved" partition, as the "standard" one is incompatible and the job will not be scheduled. This means that you always need to specify both the "--account" and the "--partition" option. Specifying only one of these will cause the job to remain on the queue indefinitely.

Migrating from other Schedulers

Sun Grid Engine

Most SGE commands (qsub, qstat, etc.) will work on SLURM, although you will need to rewrite your scripts to use #SBATCH directives instead of the #$ directives used by SGE. The command sge2slurm will convert a SGE job script to a SLURM job script.

PBS/TORQUE

SLURM can actually run PBS job scripts in many cases. Most PBS commands (qstat, qsub, etc.) will work on SLURM. The "pbs2slurm" script can be used to convert a PBS script to a SLURM one.


Using SSE3 Nodes

CAC has several older nodes that use SSE3 architecture (as opposed to AVX). These nodes may or may not work with your code but users are welcome to try. They currently are under utilized and may offer a solution during times of high usage of the regular nodes on the Frontenac Cluster. A different stack needs to be loaded prior to running jobs on these nodes. Simply run 'load-sse3' prior to submitting your slurm script.