The beluga-cloud resource are provided by Alliance Canada / Calcul Quebec and powered by the by Magic Castle HPC management software project.
Windows users, you should install a SSH client to assez the summer school compute cluster, such as MobaXterm or WSL.
( Linux and MacOS users, you can skip this step. Your laptop already have aterminal installed. )
Connection to the summer school cloud cluster is accessible by ssh. Open a terminal and enter (remplace XX by your user number):
ssh userXX@quantum2024.ccs.usherbrooke.ca
Then type your password. Note the character won’t appear but they are still entered.
It’s important to do not launch job directly on the login node, please follow the “sumitting interactive job” section.
You can also reach the cluster with a web browser at the following link: https://quantum2024.ccs.usherbrooke.ca
Interacting with the cluster requires knowlodge of the Unix command line. Basic Unix commands to navigate through files and directories are:
If you are not use to Unix command, you can check the followin link: https://www.makeuseof.com/tag/a-quick-guide-to-get-started-with-the-linux-command-line/
There is a command reference if you just need a reminder: https://files.fosswire.com/2007/08/fwunixref.pdf
Each new session opened on the cluster comes by default with no softwares installed that must be loaded manually as modules. To search in the software stack if a particular software in the right version is available:
module spider <software_name>
Then, more details on how to load the specific version can be found with:
module spider <software_name>/<version>
Sometimes, multiple modules need to be loaded before the software itself (for example the dependencies of the software).
For example (and a complex one), to see if the software TRIQS is installed, which version is available and finally loading it:
module spider triqs #say TRIQS available in version 3.1.0
module spider triqs/3.1.0 #say we need StdEnv/2020 gcc/10.3.0 openmpi/4.1.1
module load StdEnv/2020 gcc/10.3.0 openmpi/4.1.1 #load dependencies
module load triqs #finally
Each users have a personal “home” directory, the default one, where most of calculation should run:
/home/user99
If a software have many input and output on temporary file, you may benefit from “scratch” directory on SSD drive:
/scratch/user99
Lecturers may put some files in a shared directories containing summer school material in the /project/
directory. Example:
/project/soft/abinit
In /project
there is one directory per software used in the summer school and one per lesson (example /project/nessi
and /project/DMFT
). These directories can be accessed from the internet at https://doc.quantum.2024.ccs.usherbrooke.ca
Once log into the compute cluster, you have access to the login which is design to manipulate file and prepare your compute tasks (called jobs). NO jobs should be run on this login node (or they would be aborded and you will be warn). Instead, jobs should be submitted to a scheduler, which will redirect them to mor epowerful dedicated compute nodes. Jobs are submitted to the scheduler using a batch script which containts the ressources being asked (CPUs, RAM, time), followed by the sequence of command to run the compute tasks. See examples below, or the Alliance for Digital Research website for further examples.
Please remember that the ressources you have access is shared ressources. Ressources hardware configuration is: * 16x nodes, 240GB RAM, 32cores (Xeon Cascadelake 2.2GHz) * 10TB disk NFS shared
Therefore, each user have access to approximately 8 CPU cores and 60 GB of RAM max. Please consult the lecturer if you want to request more
During the hands-on, most of the time, you will use a “interactive job” with the salloc
command. For example, this command will reserve 4 CPUs and 30GB of RAM for 3 hours on a “compute node”:
salloc --time=3:0:0 --cpus-per-task=4 --mem=30G
Note that, when you submit, the hostname will change from login1 to nodeXX, for example:
[user001@node18 ~]$
When you exit
the shell, the job will terminate and you will go back to the “login” node.
nano
:touch ./my_job.sh
nano ./my_job.sh
abinit_simulation.abi
:#!/bin/bash
#SBATCH --time=01:00:00 #time: hours:minutes:seconds
#SBATCH --mem=4GB #memory requested
module load abinit/9.6.2 wannier90/3.0.1 #load software
abinit ./abinit_simulation.abi #path to my_job.abi relative to where the ABINIT input is
#here: same directory
Note the scheduler give access to the compute node to your file in your home directory, so users don’t have to care about data transfert.
sbatch my_job.sh
Right after submittion, the scheduler gives you a job identifier for you to task down the results of the simulation.
slurm-<job_id>.out
. You can read it with for example less slurm-<job_id>.out
.A threaded (OpenMP) job is a job where each processor (or thread) shared the same memory. For example, during a matrix-vector product, each processor can be assign a part of the whole matrix to proceed. Threaded jobs are however limited to one compute node. The user have the responsibility to ask the requested number of threads to the scheduler. Submission is similar to a simple CPU job, but the SLURM script (step 2) now reads:
#!/bin/bash
#SBATCH --time=01:00:00 #time: hours:minutes:seconds
#SBATCH --cpus-per-task=8 #number of CPU requested
#SBATCH --mem=4GB #memory requested and shared by all CPU cores
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK #MANDATORY
module load abinit/9.6.2 wannier90/3.0.1 #load software
abinit ./abinit_simulation.abi #path to my_job.abi relative to where the script is
A distributed (MPI) job is different from a threaded job in the sense that each processor (or process) own a part of the memory for itself (it is no more shared). For example, during a matrix-vector product, the matrix is first decomposed in several parts, then each processor proceed its part, and finally the matrix is reassembled. Therefore, the job (the matrix for example) can be distributed on several compute nodes. Submission is similar to the two previous jobs, but the SLURM script (step 2) now reads:
#!/bin/bash
#SBATCH --time=01:00:00 #time: hours:minutes:seconds
#SBATCH --ntasls=8 #number of CPU requested
#SBATCH --mem-per-cpu=1024M #memory requested per CPU cores
module load abinit/9.6.2 wannier90/3.0.1 #load software
srun abinit ./abinit_simulation.abi #path to my_job.abi relative to where the script is
squeue -u userXX
gives a list of all the jobs submitted to the scheduler and their statescancel <jobid>
cancels a jobseff <jobid>
returns statistics about the job (percentage memory and cpu used, efficiency, total runtime).For each job submitted to the scheduler, a file slurm-<job_id>.out
is created in which the output of the terminal is redirected. If new file are created during the job, they will be located according to the program default behavior (i.e. if a program creates and writes results in a new directory called results
, this behavior is not changed by the scheduler).