Computations on the cluster are performed in so-called jobs. A job is a self-contained, user-defined task (or set of tasks). As user, you specify how many resources (nodes, cores, memory, etc.) your job needs and for how long you need them. You then queue the job for a specific partition and the scheduler software is going to start it as soon as enough resources are available.
The scheduler software on our systems is called SLURM. SLURM is operated with a number of console commands for queuing jobs, monitoring them and, if necessary, canceling them. If you want to know more about possible arguments for all the options presented to you on this page view the man pages for any of the commands, e.g. man sbatch
for the sbatch
command.
Waiting queues (called partitions in SLURM) differ from system to system. Usually, SLURM partitions differ in terms of hardware (e.g. GPU queues and non-GPU queues) and maximum allowed runtime per job.
SLURM queuing is designed to minimize waiting times for everyone. Thus, a priority value is assigned to each job. For example, while a job is waiting, its priority will increase continuously or if a user has recently used lots of resources, the priority of their next job will be lower.
You can find out the partitioning of our clusters with the sinfo
command and an overview of currently running jobs with squeue
. The output of both commands can be highly customized to filter and sort the output and display more information, you can see a full list with man sinfo
and man squeue
respectively. A basic overview of the available hardware can also be viewed from the corresponding System Information pages (Bender, Bonna, Marvin).
Tip: While it always depends on the system, typically partitions with shorter job time limits have a shorter waiting time. It therefore makes sense to put your jobs into the shortest queue that it fits into.
Remember that you can also use sinfo
to get the queue setup and job limits on each system.
Name | Purpose | Time Limit | Max nodes per job |
---|---|---|---|
A40devel |
Brief debugging test runs on the A40 nodes | 1 hour | 1 |
A40short |
Short jobs on the A40 nodes | 8 hours | 3 |
A40medium |
Longer jobs on the A40 nodes | 1 day | 1 |
A100devel |
Brief debugging test runs on the A100 nodes | 1 hour | 2 |
A100short |
Short jobs on the A100 nodes | 8 hours | 2 |
A100medium |
Longer jobs on the A100 nodes | 1 day | 1 |
The medium
queues are also limited to 1 concurrent job per user (additional jobs will simply be held in the queue until the current one is finished).
Name | Purpose | Time Limit | Max nodes per job |
---|---|---|---|
short |
Short jobs | 8 hours | 32 |
medium |
Medium-length jobs | 1 day | 16 |
long |
Long jobs | 7 days | 8 |
reservations |
Used only for special purposes | N/A | N/A |
Since Marvin hardware is very heterogeneous, there are a lot of queues. They differ by node type and queue type. The names all take the form <node type>_<queue type>
. For example, to run a job in the short
queue on the default CPU compute nodes (node type intelsr
), the command would be
sbatch --partition=intelsr_short ...
Here are all node types on Marvin (see the technical details):
Name | Purpose | Max nodes per job devel/short/medium/long |
---|---|---|
intelsr |
MPP Nodes (default CPU nodes) | 2 / 32 / 16 / 16 |
lm |
Large Memory Nodes | 1 / 1 / 1 / 1 |
vlm |
Very Large Memory Nodes | 1 / 1 / 1 / 1 |
sgpu |
Scalable (A100) GPU Nodes | 2 / 8 / 4 / 4 |
mlgpu |
Machine learning (A40) GPU Nodes | 2 / 4 / 2 / 2 |
And here are the queue types. These differ mostly by runtime, use sinfo
to show what the time limits are.
Name | Purpose | Time Limit |
---|---|---|
devel |
Use this for very short debugging jobs | 1 hour |
short |
Short jobs | 8 hours |
medium |
Medium-length jobs | 1 day |
long |
Long jobs | 7 days |
The main way to create a job is by writing a job script. In principle, a job script is just a regular Linux shell script with additional instructions for SLURM about what resources to allocate for the job. These instructions take the form #SBATCH --parameter=value
.
You as the person running the job should review which resources your job needs before you run it (remember your responsibilities). The three resources you need to think about basically always are:
--partition
(or -p
)--time
(or -t
)--ntasks
and --cpus-per-task
(and possibly --nodes
and --ntasks-per-node
for larger jobs)--gpus
--account
. This is not necessary on Bender. See also these instructions for Marvin and the Bonna wiki for Bonna.sshare -U
command shows to which accounts you belong.And other parameters as needed (commonly RAM with the --mem
option). You can see a full list of available job parameters at the official sbatch documentation. Most parameters have default options, meaning you do not necessarily need to set them.
Here is a very simple job script which will request a single GPU on Bender:
#!/bin/bash
#SBATCH --partition=A40devel
#SBATCH --time=0:05:00
#SBATCH --gpus=1
#SBATCH --ntasks=1
module load Python
python myscript.py
The sequence in the first line of this script is called a shebang. When the script gets executed, it tells Linux what program to interpret the script with, the bash shell in this case.
The part below the #SBATCH
instructions is a regular Linux shell script and can therefore contain any Linux command. In addition to actually launching your computation, you might for example need to do things like loading modules or creating output directories.
To queue a job simply call the sbatch
command on your job script, like this:
$ sbatch <script>
You can also add options to this sbatch
command identical to the ones you would use with the #SBATCH
instructions. If you do, these options will override their counterparts defined in the script. The hierarchy of options in decreasing precedence is as follows:
sbatch
callThis hierarchy allows you to leave out parameters that you do not care about. It also simplifies testing: you can put the parameters that a job script needs into the script, but you can override them for individual runs to test something.
Once you have queued a job with SLURM, it will be assigned an ID number. This job ID is used as an identifier with which you can interact with your jobs, for example in order to monitor them. You can find out which jobs you currently run, their IDs and other useful information with:
$ squeue --me
If you want to do CPU-intensive work interactively on a compute node, e.g. for visualization or debugging, a job script might not be possible. In these cases, please do not use the login nodes for CPU-intensive tasks over longer times, as you will slow down the login node for everybody else as well. In these cases we reserve the right to kill any process that causes high CPU load over a longer period.
Instead, we recommend to use an interactive job. These are like any other SLURM job except that only a console gets started and you can use that console interactively. You can start an interactive job with a command similar to the following:
$ srun --pty <other SLURM options> /bin/bash
Note two important details. First, srun
is used instead of sbatch
. Second, the --pty
option is used. Other SLURM options can be used as needed, like with any sbatch
call.
If a job has not finished running yet, you can view highly detailed information about it with:
$ scontrol show job <id>
While scontrol
is primarily an admin command, the show job
option can be extremely handy for regular users as well. Additionally, there is sacct
(see documentation), which is used to query the SLURM database (which contains information about past jobs). The sacct
command is a much more complex but also very powerful tool if you need to access information about old jobs after the fact.
You can cancel a specific job prematurely, i.e. before it would have ended by itself, with:
$ scancel <job ID>
This command can also take the --me
option in order to cancel all your currently running jobs.
By default, the output files of jobs are named slurm-<id>.out
and get placed where your current working directory was when you launched them. The name of the output file can be changed with the --output=<pattern>
option, see also the sbatch documentation.
Graphical processing units (GPUs) are specialized processors that can perform certain operations faster than CPUs with their massively parallel architectures. They are therefore also referred to as accelerators of which GPUs are only a subgroup.
sgpu
) consists of 32 nodes with 4 A100 GPUs each and is intended for large multi-GPU jobs.mlgpu
) consists of 24 nodes with 8 A40 GPUs each. It is intended for machine learning workloads.If you are on Marvin, you can use the third login node to simplify things, as its hardware configuration is closer to Marvin's GPU nodes. This is not required however. See our page about the third login node.
In order for a program to use the GPUs, the following conditions have to be met:
--partition
--gpus
In our experience it is very easy, especially as a newbie user, to not use the GPUs by accident despite thinking you did. This is quite wasteful, as GPUs are expensive and resource-intensive to operate. Please make sure that your software can see the GPUs and actually makes use of them. Remember you can use small test jobs and the
devel
queues for debugging.Tip: There is a simple test you can do: if you disable GPU support in your software and it does not run slower than before, then there is a GPU misconfiguration.
Here is an example for Bender:
#!/bin/bash
#SBATCH --partition=A40devel
#SBATCH --gpus=2
<Your code ...>
There are also a number of other parameters in the Slurm documentation that you can use to control GPU usage.
In order to solve a problem or process larger datasets faster, calculations can often be dissected into pieces and executed in parallel. However, parallelization is quite a big topic on its own and depends on the programming language you want to use, which is why we only briefly mention it here.
An introductory VOD tutorial to shared memory programming with the OpenMP API can be found on the HPC Wiki.
squeue
- shows all running jobs on the clustersinfo
- shows all parts of the cluster, check for the idle onessview
- a graphical view of the cluster usage (don't forget to use ssh -X)scontrol show job <jobid>
- shows all details of a job with the given job ID numberscancel <jobid>
- cancel running jobscancel -t PENDING
- cancel all pending jobsscontrol hold <jobid>
- put the job on holdscontrol release <jobid>
- releases the jobscontrol requeue <jobid>
- cancel and rerun a jobThese can be used inside of SLURM jobs to retrieve information about the job. See here for full list.
SLURM_NNODES
SLURM_JOBID
SLURM_NTASKS
SLURM_TASKS_PER_NODE
SLURM_NPROCS
SLURM_CPUS_ON_NODE
SLURM_JOB_NODELIST
On systems where you need to use SLURM accounts (Bonna and Marvin), you can view your associated SLURM accounts and your cluster usage with:
$ sshare
Checkpointing refers to the technique of writing out intermediate results. Implementing checkpoints in your job scripts will allow you to restart your jobs from an intermediate state if it fails, rather than having to re-run the entire computation. This saves you from wasting cluster resources and cuts down on the time you have to wait for your results. It also simplifies troubleshooting for you.
We highly recommend that you enable checkpointing wherever possible, or implement it if you write your own software. Since we never give any guarantee that the cluster will work, and there are many reasons a job might fail, this is highly advantageous. Examples for such reasons include exceeding the time limit or the allocated memory, node failures, or cluster downtimes.
You can request SLURM to send you an e-mail notification if certain events occur with:
$ sbatch --mail-type=<type> <script>
See man sbatch
for all possible arguments.
E-mail notifications are not currently functional on Bender.
By default, specified notifications will be sent to the e-mail address of the submitting user. You can change the receiving user with an additional argument for the --mail-user=<user>
option.
SLURM jobs inherit environment variables from the shell they were submitted from. This is often undesirable as it hampers reproducibility and can get in the way if the hardware that runs the job requires a different software stack than the login node (modules build for AMD vs. Intel).
The sbatch option --export=NONE
prevents this and ensures a clean environment. However, this setting also prevents the inheritance of the environment from the job's script to the job's steps. SLURM jobs are executed in (multiple) steps if srun
or mpirun
are used in the job script. To ensure the propagation of the job's environment to its steps, the option --export=NONE
should always be accompanied by unset SLURM_EXPORT_ENV
(or the equivalent export SLURM_EXPORT_ENV=ALL
).