Both Marvin and Bender have heterogeneous hardware. In particular, both systems have some nodes with Intel CPUs and some with AMD CPUs.
For compatibility reasons, all software in our Easybuild-based central installation locations is therefore installed twice. We call this the INTEL stack and the AMD stack, respectively.
In most cases, the software stack is switched out automatically for you. However there are occasionally situations where this does not work. We suggest workarounds in these cases further below.
Here is a list of which stack is relevant for which nodes:
Node type | Relevant Stack |
---|---|
login01 |
INTEL |
A40 |
INTEL |
A100 |
AMD |
Node type | Relevant Stack |
---|---|
login[01-02] |
INTEL |
login03 |
AMD |
intelsr |
INTEL |
lm |
INTEL |
vlm |
INTEL |
mlgpu |
AMD |
sgpu |
AMD |
The different node types in each system have some implications on your software and your job settings. You have to exercise more care when compiling and installing your software, you might have to check which module stack to load, and you might have to tell SLURM which environment variables to pass on and not to pass on when queuing a job.
On Marvin, you can make use of the third (GPU) login node to make working on the AMD-based compute nodes easier, see our page on the third login node.
Nevertheless, you are not required to use it, and everything below on this page applies to Marvin as well.
Cross-compilation exists and is an option open to you. However, it is often easier to ensure that you compile software on the same architecture that it will later run on.
There are two ways of ensuring this: if you are on Marvin, you can log in to the third login node and compile there. Both on Bender and on Marvin, you also have the option to start an interactive job on the same node type you intend to run your later jobs on, and compile from within that job.
Note that the same considerations also apply in some cases where you do not compile your own software. For example, if you use a package manager, e.g. conda
, that package manager will often attempt to detect what hardware it should install its software for. This can lead to issues with the Intel/AMD architectures, but also to problems like installing a non-GPU variant of a software package because Conda detects that there is no GPU present (when run on the login node).
Since Conda is available in both stacks, we recommend to use the corresponding package manager from each stack on the corresponding nodes. You might have to use conda init --reverse
, followed by conda init
to reset.
You tell which stack you are currently using by looking at the output of the module avail
command. Example output from Bender:
$ module avail
------------------------- /software/easybuild-INTEL_A40/modules/all --------------------------
Anaconda3/2022.05
Anaconda3/2022.10 (D)
ant/1.10.12-Java-11
(...)
Note that the topmost line shows you the base directory of the module path in question.
The software stacks (including module files and the software itself) are located in subdirectories under the following paths:
/software/easybuild-INTEL_A40/...
/software/easybuild-AMD_A100/...
/opt/software/easybuild-INTEL/...
/opt/software/easybuild-AMD/...
As mentioned before, typically the stack will be switched out depending on which node you are on. However you can do the switch yourself if necessary by using the module use
and module unuse
commands. For example, to switch from the INTEL to the AMD stack on Marvin, you would use the following commands:
module unuse /opt/software/easybuild-INTEL/modules/all
module use /opt/software/easybuild-AMD/modules/all
You can also see which directories are currently being used by looking at the MODULEPATH
environment variable (echo $MODULEPATH
).
Tip: if you get error messages talking about "illegal instruction", that is often a symptom that you are encountering the problem described here.
We have noticed that switching the stacks occasionally does not work properly, on both Bender and Marvin. The most common cause for this is interference between the environment variables and SLURM. LMod stores the location of the module files in an environment variable called MODULEPATH
. See also the page in the LMod documentation on these topics.
You can use the #SBATCH
option --export=NONE
in your job script. This will make sure that no environment variables get passed to the job. See the corresponding section in the SLURM documentation, and of course our wiki on job scripts.
Note that you can also use the commands module use
and module unuse
to manually control what gets put on the MODULEPATH
, see the LMod documentation here and here.