Instruction for users

The summer school cluster

The beluga-cloud resource are provided by Alliance Canada / Calcul Quebec and powered by the by Magic Castle HPC management software project.

Connection with the ssh client

Windows users, you should install a SSH client to assez the summer school compute cluster, such as MobaXterm or WSL.

( Linux and MacOS users, you can skip this step. Your laptop already have aterminal installed. )

Connection to the summer school cloud cluster is accessible by ssh. Open a terminal and enter (remplace XX by your user number):

ssh userXX@quantum2024.ccs.usherbrooke.ca

Then type your password. Note the character won’t appear but they are still entered.

It’s important to do not launch job directly on the login node, please follow the “sumitting interactive job” section.

Connection with JupyterHub

You can also reach the cluster with a web browser at the following link: https://quantum2024.ccs.usherbrooke.ca

Interacting with the cluster

Interacting with the cluster requires knowlodge of the Unix command line. Basic Unix commands to navigate through files and directories are:

If you are not use to Unix command, you can check the followin link: https://www.makeuseof.com/tag/a-quick-guide-to-get-started-with-the-linux-command-line/

There is a command reference if you just need a reminder: https://files.fosswire.com/2007/08/fwunixref.pdf

Loading software on your session

Each new session opened on the cluster comes by default with no softwares installed that must be loaded manually as modules. To search in the software stack if a particular software in the right version is available:

module spider <software_name>

Then, more details on how to load the specific version can be found with:

module spider <software_name>/<version>

Sometimes, multiple modules need to be loaded before the software itself (for example the dependencies of the software).

For example (and a complex one), to see if the software TRIQS is installed, which version is available and finally loading it:

module spider triqs #say TRIQS available in version 3.1.0
module spider triqs/3.1.0 #say we need StdEnv/2020 gcc/10.3.0 openmpi/4.1.1
module load StdEnv/2020 gcc/10.3.0 openmpi/4.1.1 #load dependencies
module load triqs #finally

Access summer school material

Each users have a personal “home” directory, the default one, where most of calculation should run:

/home/user99

If a software have many input and output on temporary file, you may benefit from “scratch” directory on SSD drive:

/scratch/user99

Lecturers may put some files in a shared directories containing summer school material in the /project/ directory. Example:

/project/soft/abinit

In /project there is one directory per software used in the summer school and one per lesson (example /project/nessi and /project/DMFT). These directories can be accessed from the internet at https://doc.quantum.2024.ccs.usherbrooke.ca

Submitting jobs to the compute cluster

Once log into the compute cluster, you have access to the login which is design to manipulate file and prepare your compute tasks (called jobs). NO jobs should be run on this login node (or they would be aborded and you will be warn). Instead, jobs should be submitted to a scheduler, which will redirect them to mor epowerful dedicated compute nodes. Jobs are submitted to the scheduler using a batch script which containts the ressources being asked (CPUs, RAM, time), followed by the sequence of command to run the compute tasks. See examples below, or the Alliance for Digital Research website for further examples.

Please remember that the ressources you have access is shared ressources. Ressources hardware configuration is: * 16x nodes, 240GB RAM, 32cores (Xeon Cascadelake 2.2GHz) * 10TB disk NFS shared

Therefore, each user have access to approximately 8 CPU cores and 60 GB of RAM max. Please consult the lecturer if you want to request more

Sumitting interactive job

During the hands-on, most of the time, you will use a “interactive job” with the salloc command. For example, this command will reserve 4 CPUs and 30GB of RAM for 3 hours on a “compute node”:

salloc --time=3:0:0 --cpus-per-task=4 --mem=30G

Note that, when you submit, the hostname will change from login1 to nodeXX, for example:

[user001@node18 ~]$

When you exit the shell, the job will terminate and you will go back to the “login” node.

Submitting a simple job with one CPU core

Create a empty file to write your scheduler (SLURM) script and start editting it using nano:

touch ./my_job.sh
nano ./my_job.sh

Edit your SLURM script. For example, to submit a ABINIT compute task defined in abinit_simulation.abi :

#!/bin/bash
#SBATCH --time=01:00:00     #time: hours:minutes:seconds
#SBATCH --mem=4GB           #memory requested

module load abinit/9.6.2 wannier90/3.0.1      #load software 
abinit ./abinit_simulation.abi    #path to my_job.abi relative to where the ABINIT input is
                                  #here: same directory

Note the scheduler give access to the compute node to your file in your home directory, so users don’t have to care about data transfert.

Submit of the previous SLURM script to the scheduler:

sbatch my_job.sh

Right after submittion, the scheduler gives you a job identifier for you to task down the results of the simulation.

After calculation, terminal output (some results for example) are redirected in the file slurm-<job_id>.out. You can read it with for example less slurm-<job_id>.out.

Submitting a threaded (OpenMP) job

A threaded (OpenMP) job is a job where each processor (or thread) shared the same memory. For example, during a matrix-vector product, each processor can be assign a part of the whole matrix to proceed. Threaded jobs are however limited to one compute node. The user have the responsibility to ask the requested number of threads to the scheduler. Submission is similar to a simple CPU job, but the SLURM script (step 2) now reads:

#!/bin/bash
#SBATCH --time=01:00:00     #time: hours:minutes:seconds
#SBATCH --cpus-per-task=8   #number of CPU requested
#SBATCH --mem=4GB           #memory requested and shared by all CPU cores

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #MANDATORY

module load abinit/9.6.2 wannier90/3.0.1      #load software 
abinit ./abinit_simulation.abi           #path to my_job.abi relative to where the script is

Submitting a distributed (MPI) job on multiple nodes

A distributed (MPI) job is different from a threaded job in the sense that each processor (or process) own a part of the memory for itself (it is no more shared). For example, during a matrix-vector product, the matrix is first decomposed in several parts, then each processor proceed its part, and finally the matrix is reassembled. Therefore, the job (the matrix for example) can be distributed on several compute nodes. Submission is similar to the two previous jobs, but the SLURM script (step 2) now reads:

#!/bin/bash
#SBATCH --time=01:00:00     #time: hours:minutes:seconds
#SBATCH --ntasls=8   #number of CPU requested
#SBATCH --mem-per-cpu=1024M #memory requested per CPU cores

module load abinit/9.6.2 wannier90/3.0.1      #load software 
srun abinit ./abinit_simulation.abi           #path to my_job.abi relative to where the script is

Interact with running jobs

squeue -u userXX gives a list of all the jobs submitted to the scheduler and their state
scancel <jobid> cancels a job
seff <jobid> returns statistics about the job (percentage memory and cpu used, efficiency, total runtime).

Getting your results

For each job submitted to the scheduler, a file slurm-<job_id>.out is created in which the output of the terminal is redirected. If new file are created during the job, they will be located according to the program default behavior (i.e. if a program creates and writes results in a new directory called results, this behavior is not changed by the scheduler).