Clusters

Horizon

Compiling and using ARES/BORG on Horizon

Modules

module purge
module load gcc/7.4.0
module load openmpi/3.0.3-ifort-18.0
module load fftw/3.3.8-gnu
module load hdf5/1.10.5-gcc5
module load cmake
module load boost/1.68.0-gcc6
module load gsl/2.5
module load julia/1.1.0

Building

bash build.sh --use-predownload --use-system-hdf5 --use-system-gsl  --build-dir /data34/lavaux/BUILD_ARES --c-compiler gcc --cxx-compiler g++

Running

Jupyter on Horizon

Jupyter is not yet installed by default on the horizon cluster. But it offers a nice remote interface for people:

  • with slow and/or unreliable connections,

  • who wants to manage a notebook that can be annotated directly inline with Markdown, and then later converted to html or uploaded to the wiki with the figures included,

  • Use ipyparallel more efficiently

They are not for:

  • people who does not like notebooks for one reason or the other

Installation

We use python 3.5, here. Load the following modules;

module load intel/16.0-python-3.5.2 gcc/5.3.0

Then we are going to install jupyter locally:

pip3.5 install --user jupyter-client==5.0.1 jupyter-contrib-core==0.3.1 jupyter-contrib-nbextensions==0.2.8 jupyter-core==4.3.0 jupyter-highlight-selected-word==0.0.11 jupyter-latex-envs==1.3.8.4 jupyter-nbextensions-configurator==0.2.5

At the moment (22 June 2017), I am using the above versions but later may well work without problems.

Automatic port forwarding and launch of Jupyter instance

Jupyter can be cumbersome to start reliably, automatically and in a consistent fashion. Guilhem Lavaux has written two scripts (here) that can help in that regard. The first script (jupyter.sh) has to be left in the home directory on Horizon, it helps at starting a new jupyter job and reporting where it is located and how to contact it. The two scripts are here: . The second script has to be kept on the local station (i.e. the laptop of the user or its workstation). It triggers the opening of ssh tunnels, start jobs and forward ports. The second script (.horizon-env.sh) should be loaded from .bashrc with a command like source ${HOME}/.horizon-env.sh. After such steps are taken several things are possible. First to start a jupyter on horizon you may run juphorizon. It will give the following output:

~ $ juphoriz
Forwarding 10000 to b20:8888

Now you use your web-browser and connect to localhost:10000. You also know that your jupyter is on beyond20 (port 8888).

To stop the session do the following:

~ $ stopjup
Do you confirm that you want to stop the session ? [y/N]
y
Jupyter stopped

If you run it a second time you will get:

[guilhem@gondor] ~ $ stopjup
Do you confirm that you want to stop the session ? [y/N]
y
No port forwarding indication. Must be down.

which means that the port forwarding information has been cleared out and the script does not know exactly how to proceed. So it does nothing. If you still have a job queued on the system it is your responsability to close it off to avoid using an horizon node for nothing.

Two other commands are available:

  • shuthorizon, it triggers the shutdown of the tunnel to horizon. Be careful as no checkings are done at the moment. So if you have port forwarding they will be cancelled and you will have to set them up manually again.

  • hssh, this opens a new ssh multi-plex connection to horizon. It will not ask for your password as it uses the multiplexer available in ssh. Note that it is not possible to start an X11 forwarding using this.

IPyParallel

Now we need to install ipyparallel:

pip3.5 install --user ipyparallel
$HOME/.local/bin/ipcluster nbextension enable

Use this pbs template.

You have to put several files in your $HOME/.ipython/profile_default:

  • IPCluster configuration as ipcluster_config.py. This file indicates how to interact with the computer cluster administration. Notable it includes a link to aforementioned template for PBS. I have removed all the extra untouched configuration options. However in the original file installed by ipyparallel you will find all the other possible knobs.

  • IPCluster configuration as ipcontroller_config.py. This file is used to start up the controller aspect which talks to all engines. It is fairly minor as I have kept the controller on the login node to talk to engines on compute nodes.

  • IPCluster configuration as ipengine_config.py. This file is used to start up the engines on compute nodes. The notable option is to indicate to listen to any incoming traffic.

The documentation to ipyparallel is available from readthedocs here.

Once you have put all the files in place you can start a new PBS-backed kernel:

$ ipcluster start -n 16

With the above files, that will start one job of 16 cores. If you have chosen 32, then it would have been 2 MPI-task of 16 cores each one, etc.

To start using with ipyparallel open a new python kernel (either from ipython, or more conveniently from jupyter notebook):

import ipyparallel as ipp
c = ipp.Client()

Doing this will connect your kernel with a running ipyparallel batch instance. c will hold a dispatcher object from which you can instruct engines what to do.

IPyParallel comes with magic commands for IPython 3. They are great to dispatch all your commands, however you must be aware that the contexts is different from your main ipython kernel. Any objects has to be first transmitted to the remote engine first. Check that page carefully to learn how to do that.

MPIRUN allocation

These are tips provided by Stephane Rouberol for specifying finely the core/socket association of a given MPI/OpenMP computation.

# default is bind to *socket*
mpirun -np 40 --report-bindings /bin/true  2>&1 | sed -e 's/.*rank \([[:digit:]]*\) /rank \1 /' -e 's/bound.*://' | sort -n -k2 | sed -e 's/ \([[:digit:]]\) /  \1 /'

rank  0 [B/B/B/B/B/B/B/B/B/B][./././././././././.][./././././././././.][./././././././././.]
rank  1 [./././././././././.][B/B/B/B/B/B/B/B/B/B][./././././././././.][./././././././././.]
(...)
# we can bind to core
mpirun -np 40 --bind-to core --report-bindings /bin/true 2>&1 | sed -e 's/.*rank \([[:digit:]]*\) /rank \1 /' -e 's/bound.*://' | sort -n -k2 | sed -e 's/ \([[:digit:]]\) /  \1

rank  0 [B/././././././././.][./././././././././.][./././././././././.][./././././././././.]
rank  1 [./././././././././.][B/././././././././.][./././././././././.][./././././././././.]
(...)
# we can bind to core + add optimization for nearest-neighbour comms (put neighbouring ranks on the same socket)
mpirun -np 40  --bind-to core -map-by slot:PE=1 --report-bindings /bin/true 2>&1 | sed -e 's/.*rank \([[:digit:]]*\) /rank \1 /' -e 's/bound.*://' | sort -n -k2 | sed -e 's/ \([[:digit:]]\) /  \1

rank  0 [B/././././././././.][./././././././././.][./././././././././.][./././././././././.]
rank  1 [./B/./././././././.][./././././././././.][./././././././././.][./././././././././.]
# -----------------------------------------------------------
# case 2: 1 node, nb of ranks < number of cores (hybrid code)
# -----------------------------------------------------------

beyond08: ~ > mpirun -np 12 -map-by slot:PE=2   --report-bindings /bin/true  2>&1 | sort -n -k 4
[beyond08.iap.fr:34077] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34077] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34077] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././.][./././././././././.][./././././././././.][./././././././././.]
beyond08: ~ > mpirun -np 12 -map-by socket:PE=2   --report-bindings /bin/true  2>&1 | sort -n -k 4
[beyond08.iap.fr:34093] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34093] MCW rank 1 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././././././.][B/B/./././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34093] MCW rank 2 bound to socket 2[core 20[hwt 0]], socket 2[core 21[hwt 0]]: [./././././././././.][./././././././././.][B/B/./././././././.][./././././././././.]
beyond08: ~ > mpirun -np 12 -map-by socket:PE=2 --rank-by core --report-bindings /bin/true  2>&1 | sort -n -k 4
[beyond08.iap.fr:34108] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core 1[hwt 0]]: [B/B/./././././././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34108] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]]: [././B/B/./././././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34108] MCW rank 2 bound to socket 0[core 4[hwt 0]], socket 0[core 5[hwt 0]]: [././././B/B/./././.][./././././././././.][./././././././././.][./././././././././.]
[beyond08.iap.fr:34108] MCW rank 3 bound to socket 1[core 10[hwt 0]], socket 1[core 11[hwt 0]]: [./././././././././.][B/B/./././././././.][./././././././././.][./././././././././.]

Fighting the shared node curse

Horizon compute nodes are each made of a mother motherboard with 4 cpus setup on it. The physical access to the resources is transparently visible from any of the CPU. Unfortunately each memory bank is attached physically to a preferred CPU. For a typical node with 512 GB of RAM, each CPU gets 128 GB. If one of the CPU needs access to physical RAM space hosted by another CPU, then the latency is significantly higher. The Linux kernel wants to minimize this kind of problem so it will try hard to relocated the processes so that memory access is not delocalised, kicking out at the same time any computations already in progress on that cpu. This results in computations residing on some CPU to affect computations on another CPU.

The situation can be even worse if two computations are sharing the same CPU (which holds each N cores, 8 < N < 14). In that case the computations are fighting for CPU and memory resources. For pure computation that is generally less of a problem, but this case is not so frequent on computer designed to handle the analysis of large N-body simulations.

To summarise, without checking and allocating that your computations are sitting wholly on a CPU socket you may have catastrophic performance degradation (I have experienced a few times at least a factor 10).

There are ways of avoiding this problem:

  • check the number of cores available on the compute nodes and try your best to allocate a single CPU socket. For example, beyond40cores queue is composed of nodes of 10 cores x 4 cpus. You should then ask to PBS “-l nodes=1:beyond40cores:ppn=10”, which will give you 10 cores, i.e. a whole CPU socket.

  • think that if you need 256 GB, then you should use the 2 cpu sockets in practice. So allocate 2 N cores (as in the previous cases, we would need 20 cores, even if in the end only one CPU is doing computation).

  • Use numactl to get informed and enforce the resources allocation. For example, typing “numactl -H” on beyond08 gives the following:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 131039 MB
node 0 free: 605 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 131072 MB
node 1 free: 99 MB
node 2 cpus: 20 21 22 23 24 25 26 27 28 29
node 2 size: 131072 MB
node 2 free: 103 MB
node 3 cpus: 30 31 32 33 34 35 36 37 38 39
node 3 size: 131072 MB
node 3 free: 108 MB
node distances:
node   0   1   2   3
  0:  10  21  30  21
  1:  21  10  21  30
  2:  30  21  10  21
  3:  21  30  21  10

It states that the compute node is composed of 4 “nodes” (=CPU socket here). The logical CPU affected to each physical CPU is given by “node X cpus”. The first line indicate that the Linux kernel logical cpu “0 1 2 … 9” are affected to the physical CPU 0. At the same time the node 0 has “node 0 size” RAM physically attached. The amount of free RAM on this node is shown by “node 0 free”. Finally there is a node distance matrix. It tells the user how far are each node from each other in terms of communication speed. It can be seen that there may be up to a factor 3 penalty for communication between node 0 and node 2.

Scratch space

Occigen

Occigen is a CINES managed supercomputer in France. You need a time allocation on this to use it. Check https://www.edari.fr

Module setup

Compile with Intel

module purge
module load gcc/8.3.0
module load intel/19.4
# WARNING: openmpi 2.0.4 has a bug with Multithread, cause hangs
module load openmpi-intel-mt/2.0.2
module load intelpython3/2019.3
export OMPI_CC=$(which icc)
export OMPI_CXX=$(which icpc)

Then run:

bash build.sh --use-predownload --no-debug-log --perf --native  --c-compiler icc --cxx-compiler icpc --f-compiler ifort --with-mpi  --build-dir $SCRATCHDIR/ares-build-icc --cmake $HOME/.local/bin/cmake

Compile with gcc

module purge
module load gcc/8.3.0
# WARNING: openmpi 2.0.4 has a bug with Multithread, cause hangs
module load openmpi/gnu-mt/2.0.2
module load intelpython3/2019.3
export OMPI_CC=$(which gcc)
export OMPI_CXX=$(which g++)

Prerequisite

Download cmake >= 3.10.

wget https://github.com/Kitware/CMake/releases/download/v3.15.5/cmake-3.15.5.tar.gz

Be sure the above modules are loaded and then compile:

cd cmake-3.15.5
./configure  --prefix=$HOME/.local
nice make
make install

On your laptop run:

bash build.sh --download-deps
scp -r downloads occigen:${ARES_ROOT_ON_OCCIGEN}

Build

With intel

bash build.sh --use-predownload --no-debug-log --perf --native  --c-compiler icc --cxx-compiler icpc --f-compiler ifort --with-mpi  --build-dir $SCRATCHDIR/ares-build-icc --cmake $HOME/.local/bin/cmake

With gcc

bash build.sh --use-predownload --no-debug-log --perf --native  --c-compiler gcc --cxx-compiler g++ --f-compiler gfortran --with-mpi  --build-dir $SCRATCHDIR/ares-build-gcc --cmake $HOME/.local/bin/cmake

Imperial RCS

This page contains notes on how to compile and run ARES (and extensions) on Imperial Research Computing Services.

Gain access to Imperial RCS

See this page.

Copy configuration files

Copy the pre-prepared configuration files in your home, by cloning :

cd ~/
git clone git@bitbucket.org:florent-leclercq/imperialrcs_config.git .bashrc_repo

and typing:

cd .bashrc_repo/
bash create_symlinks.bash
source ~/.bashrc

Load compiler and dependencies

Load the following modules (in this order, and only these to avoid conflicts):

module purge
module load gcc/8.2.0 git/2.14.3 cmake/3.14.0 intel-suite/2019.4 mpi anaconda3/personal

You can check that no other module is loaded using:

module list

Prepare conda environment

If it’s your first time loading anaconda you will need to run (see this page):

anaconda-setup

In any case, start from a clean conda environment (with only numpy) to avoid conflicts between compilers. To do so:

conda create -n pyborg numpy
conda activate pyborg

Clone ARES and additional packages

Clone the repository and additional packages using as usual (see ARES Building):

mkdir ~/codes
cd ~/codes
git clone --recursive git@bitbucket.org:bayesian_lss_team/ares.git
cd ares
bash get-aquila-modules.sh --clone

If a particular release or development branch is desired, these additional lines (for example) must be run:

git checkout develop/2.1
bash get-aquila-modules.sh --branch-set develop/2.1

Note that ‘git branch’ should not be used. Once this is done, one should check to see whether the repository has been properly cloned, and the submodules are all in the correct branch (and fine). To do so, one should run:

bash get-aquila-modules.sh --status

The output will describe whether the cloned modules are able to link to the original repository.

If the root is not all well (for example, the error could be in cosmotool), try:

git submodule update

and check the modules status again

Compile ARES

Run the ARES build script using:

bash build.sh --with-mpi --c-compiler icc --cxx-compiler icpc --python

(for other possible flags, such as the flag to compile BORG python, type bash build.sh -h). Note: for releases <= 2.0, a fortran compiler was necessary: add --f-compiler ifort to the line above. One may have to predownload dependencies for ares: for this, add the

--download-deps

flag on the first use of build.sh, and add

--use-predownload

on the second (which will then build ares).

Then compile:

cd build
make

The ‘make’ command can be sped up by specifying the number of nodes, N, used to perform this:

cd build
make -j N

Run ARES example with batch script

The following batch script (job_example.bash) runs the example using mixed MPI/OpenMP parallelization (2 nodes, 32 processes/node = 16 MPI processes x 2 threads per core). Check this page for job sizing on Imperial RCS.

#!/bin/bash

# request bash as shell for job
#PBS -S /bin/bash

# queue, parallel environment and number of processors
#PBS -l select=2:ncpus=32:mem=64gb:mpiprocs=16:ompthreads=2
#PBS -l walltime=24:00:00

# joins error and standard outputs
#PBS -j oe

# keep error and standard outputs on the execution host
#PBS -k oe

# forward environment variables
#PBS -V

# define job name
#PBS -N ARES_EXAMPLE

# main commands here
module load gcc/8.2.0 intel-suite/2019.4 mpi
cd ~/codes/ares/examples/

mpiexec ~/codes/ares/build/src/ares3 INIT 2mpp_ares.ini

exit

As per Imperial guidance, do not provide any arguments to mpiexec other than the name of the program to run.

Submit the job via qsub job_example.bash. The outputs will appear in ~/codes/ares/examples.

Select resources for more advanced runs

The key line in the submission script is

#PBS -lselect=N:ncpus=Y:mem=Z:mpiprocs=P:ompthreads=W

to select N nodes of Y cores each (i.e. NxY cores will be allocated to your job). On each node there will be P MPI ranks and each will be configured to run W threads. You must have PxW<=Y (PxW=Y in all practical situations). Using W=2 usually makes sense since most nodes have hyperthreading (2 logical cores per physical core).

SNIC

These instructions are for building on Tetralith - variations for other systems may occur

Building at SNIC

Overview

  1. Ask for time

  2. Load modules

  3. Git clone the repo and get submodules

  4. Use build.sh to build

  5. Compile the code

  6. Cancel remaining time

Detailed Instructions

  1. interactive -N1 --exclusive -t 2:00:00
    
  2. module load git
    module load buildenv-gcc/2018a-eb
    module load CMake/3.15.2
    
  3. See instructions above

  4. bash build.sh --with-mpi --cmake /software/sse/manual/CMake/3.15.2/bin/cmake --c-compiler /software/sse/manual/gcc/8.3.0/nsc1/bin/gcc --cxx-compiler /software/sse/manual/gcc/8.3.0/nsc1/bin/g++ --debug
    

Note that these links are NOT the ones from the buildenv (as loaded before). These are “hidden” in the systems and not accessible from the “module avail”. If trying to compile with the buildenv versions the compilation will fail (due to old versions of the compilers)

  1. cd build
    make -j
    
  2. Find the jobID: squeue -u YOUR_USERNAME

Find the jobID from the response

scancel JOBID

Running on Tetralith

Use the following template:

#!/bin/bash
####################################
#     ARIS slurm script template   #
#                                  #
# Submit script: sbatch filename   #
#                                  #
####################################
#SBATCH -J NAME_OF_JOB
#SBATCH -t HH:MM:SS
#SBATCH -n NUMBER_OF_NODES
#SBATCH -c NUMBER_OF_CORES PER NODE (Max is 32)
#SBATCH --output=log.%j.out # Stdout (%j expands to jobId) (KEEP AS IS)
#SBATCH --error=error.%j.err # Stderr (%j expands to jobId) (KEEP AS IS)
#SBATCH --account=PROJECT-ID
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   ## you have to explicitly set this
mpprun ./PATH/TO/HADES3 INIT_OR_RESUME /PATH/TO/CONFIG/FILE.INI\