This page contains the most basic information you should know as a new cluster user. You should read this entire page at the least before using our systems.
This page contains a basic overview of clusters, how to connect to them, how to use them, how to run computations on them and where to get help.
A cluster is essentially a large computer that is composed of many smaller inter-connected computers. These individual computers are called nodes. Much like regular PCs, each node has one or more CPUs which in turn have multiple cores. Each node also has its own RAM (random-access memory). Nodes may also have additional hardware like graphical processing units (GPUs).
Many things about a cluster are identical to any other (Linux) computer. There are, however, some differences.
Storage devices (hard drives etc.) on HPC clusters are usually centralized, meaning users have access to the same files from any node. On some clusters individual cluster nodes may have additional local hard drives (or sometimes SSDs), those usually serve special purposes and how to access them is cluster-specific.
The nodes are usually connected via a fast network (usually called an Interconnect), most commonly employing Infiniband, or occasionally Omnipath technology. This allows multiple nodes to function together more effectively than regular networked computers.
As you can see in the image below, you always connect to one of the login nodes. There you work interactively, usually via a Linux command-line interface (CLI). However, you do not run your resource-intensive calculations on the login nodes themselves. Instead, you (usually) queue them for computation on the compute nodes via a job scheduler, which on all our systems is a software called SLURM.
Never run computations on the login nodes! You will slow down the login nodes for everyone. Only ever run your actual computations via the job scheduling system.
Our systems are available to university members (both employees and students) free of charge. All you need to do is to register for HPC access by filling out the corresponding registration form.
Once you have access, connecting to our systems is done via the Secure Shell (SSH) protocol, which can be used under Windows, Linux, and macOS systems.
Our systems can only be reached from inside the university network or the university's VPN by default.
Please remember that you are sharing a limited resource with the entire university, with comparatively few technical restrictions. Not to mention that HPC systems have a very high power consumption and are expensive to operate.
That means:
Instead, please stick to the following rules of thumb regarding the use of the login nodes:
Things that are generally OK | Things that are not OK |
---|---|
Compiling software. | Running computations on the login nodes for hours and using a large number of cores. |
Doing brief debug runs with a few parallel processes to see if they start at all. | Running scripts on the login nodes because the waiting time for jobs is too long. |
Using screen sessions to save the state of your console between logins |
Leaving phantom processes (those that persist after logout) lying around forever (some software is known to do that). To check for phantom processes, use top -u <username> to see if there are any background processes in your name. |
Copying large files after taking steps to minimize the information that actually needs to be copied. | Blindly making multiple copies of your data for convenience. |
Delete data once you no longer need them. On Marvin, release workspaces when you are done with them.
Think ahead about how many resources your job actually needs and only reserve as much for each job.
However, time reserves are in order. If you do not know how long your job will run, give it a generous time reserve (e.g. between 1.5 and 2 times the expected runtime). It is also fine to run a few longer jobs to experiment with runtimes.
The HPC team will never interfere with the job priorization algorithm, no matter how urgent your paper deadline is.
Also, note that another user's job running before yours does not mean that job queuing went wrong. A number of factors go into the priority calculation and the algorithm may sometimes fill gaps with small jobs that fit into them.
We have neither the time nor the expertise to do that.
That said, we offer consulting services for both basic use and for advanced topics such as parallelization and performance optimization. You can get in contact with us at contact@hpc.uni-bonn.de.
They can be found here:
Our clusters, like almost all clusters, run on the Linux operating system.
If you do not know how to operate Linux, check out our Linux Introduction Course which is held at least once each semester, or our Linux video tutorial. This tutorial is part of the larger HPC Wiki which offers more tutorials and a lot of additional information.
Logging in is done via secure shell (SSH). The login data will be in your welcome e-mail. Depending on the system, they may contain an initial password which you will be asked to change upon first login.
We describe the login process and SSH here.
Note: As of September 25, 2024, logging in to Marvin will require uploading an SSH public key first. See our guide on how to do that. This only applies to Marvin, not all of our systems.
As mentioned, you do not run your computations on the login (front-end) nodes directly. Rather, you write a job script and put it into a waiting queue, where a job scheduling system will decide when and on which node it runs. On all our system, the job scheduling system is a software called SLURM.
When you create a job script, you will also have to decide what resources to request and for how long. The main resources are CPU cores, memory and number of GPUs. You also decide which of the many queues your job should go into.
See here how to create job scripts and here how to queue them and interact with the job scheduling system. The official SLURM documentation is here.
Like on any Linux system, you have a home directory on the cluster. Your home directory is limited in size.
Individual systems may have additional file systems for special purposes. They are decribed:
All our systems have a lot of centrally installed software that you can use right away. These are usually defined in the form of environment modules.
You can type module avail
to see an overview of all available environment modules, and hence all centrally installed software. How to use environment modules is described here and in the HPC Wiki.
There are various ways you can install and build your own software. Some common ways of doing that are described here.
Additionally, there are various package managers. You can use Conda with module load Miniforge3
or Easybuild with module load EasyBuild
. Both let you install software into your own home directory.
You do not run HPC computations from the command line directly. That is because anything you would type in the console would run on the login node.
Computations on the cluster are run in so-called jobs. In such a job, you define how many resources (CPU cores, RAM, GPUs) you need and for how long you need them. You also typically provide a job script which details what program(s) are to be run in it. You then put the job into a waiting queue and our scheduler SLURM decides, based on the size of the job and other factors, when to run it.
When you create a job script, you will also have to decide what resources to request and for how long. The main resources are CPU cores, memory and number of GPUs. You also decide which of the many queues your job should go into.
See here how to create job scripts and here how to queue them and interact with the job scheduling system. The official SLURM documentation is here.
This wiki only provides the most important information on how to interact with the HPC systems at the University of Bonn.
An overview of many important terms, concepts and technologies can be found at the HPC Wiki. It is built up continuously within the HPC.NRW project. The HPC Wiki also contains a number of interactive tutorials.
If you are looking for help on a specific Linux command, try the built-in help functions and especially the Linux man pages, which you can access with man <command>
. If the developer has written a man page for their software, which is usually the case, it is then displayed.
If none of the resources answer your question, you can send an e-mail to support@hpc.uni-bonn.de. The same e-mail addresses, or one of our system-specific support addresses (bender-support@hpc.uni-bonn.de, bonna-support@uni-bonn.de) can also be used for reporting problems and software installation requests.
The HRZ and the HPC/A Lab offer a variety of training courses concerning HPC and cluster usage. These courses are usually on eCampus by going to the HRZ category, then to the corresponding semester.
Additionally, we offer consulting for development and optimization of your software. Experts from multiple scientific disciplines are available to review your software or otherwise advise you in person. See an overview of our support services here.