The OASIS Coupler Forum

  HOME

Oasis3mct with Slurm heterogeneous jobs

Up to Specific issues in real coupled models

Posted by Anonymous at March 28 2020

I am working on an EC-Earth-4/AWICM-3 prototype consisting of three components:

Two parallel (OIFS & FESOM) and on serial (runoff-mapper). FESOM is MPI parallelized, OIFS can be run either as MPI parallelized or as hybrid MPI/OpenMP parallel application.

Due to global communications the MPI scaling of OIFS is relativly limited and optimal performance at higher core numbers (1000+) is reached with OMP_NUM_THREADS=~12. The model is also memory intensive and prone to OOM crashes in MPI-only mode. Since FESOM does not contain any OMP pragmas, using global hybrid parallization would be quite wasteful. Therefore I would like to employ the slurm packjob option which affords control over the parallization scheme on an application level:

#SBATCH --ntasks=1 #SBATCH -p mpp #SBATCH packjob #SBATCH --ntasks=72 #SBATCH -p mpp #SBATCH packjob #SBATCH --ntasks=144 #SBATCH -p mpp srun -l --propagate=STACK,CORE --cpu-bind=verbose,cores -n1 --cpus-per-task=1 ./rnfmap.exe : --cpu-bind=verbose,cores -n6 --cpus-per-task=12 --export=ALL,OMP_NUM_THREADS=12 ./master.exe -v ecmwf -e pack : --cpu-bind=verbose,cores -n144 --cpus-per-task=1 ./fesom.x

The coupled model makes it through the MPI_init but crashes shortly thereafter as follows: 1: Fatal error in PMPI_Cart_create: Invalid argument, error stack: 1: PMPI_Cart_create(340).....: MPI_Cart_create(comm=0x84000004, ndims=2, dims=0xa8d9100, periods=0xe50a540, reorder=0, comm_cart=0xb1b6a18) failed 1: MPIR_Cart_create_impl(194): fail failed 1: MPIR_Cart_create(58)......: Size of the communicator (6) is smaller than the size of the Cartesian topology (71)

Is there some option that I can turn on, or am I in uncharted territory? I'm currently using Oasis3mct3 but I could upgrade to mct4 if you think this would help.

Best regards, Jan Streffing (AWI)

Posted by Anonymous at March 31 2020

Hi Jan,

at IPSL, we use heterogeneous jobs with SLURM. Our atmospheric model LMDZ is hybrid parallelized and our oceanic model NEMO is only MPI parallelized. I never managed to use slurm packjob option for heterogeneous runs (in my case, that works only with MPI parallelization for both executables). What we use is the --multi-prog option of srun to bind explicitly every process and task on the cores.

Example : we want to run coupled configuration LMDZ-NEMO with following specifications : LMDZ : 32 MPI x 4 OMP ; NEMO : 40 MPI

Header of the Job :

#SBATCH --nodes=5

#SBATCH --exclusive

#SBATCH --ntasks=72

# Number of MPI tasks then export SLURM_HOSTFILE=./hostlist

with hostlist file (name of node used per MPI process (one MPI process per line)) :

r3i3n33

r3i3n33

...

srun --cpu-bind=none --distribution=arbitrary --multi-prog ./run_file

with run_file : 0 taskset -c 0-4 ./script_lmdz.x.ksh 1 taskset -c 5-9 ./script_lmdz.x.ksh ... 32 taskset -c 0 ./script_opa.xx.ksh 33 taskset -c 1 ./script_opa.xx.ksh ...

with script_lmdz.x.ksh : #!/bin/ksh export OMP_PLACES=cores export OMP_NUM_THREADS=4 ./lmdz.x > out_lmdz.x.out.${SLURM_PROCID} 2>out_lmdz.x.err.${SLURM_PROCID}

and script_opa.xx.ksh : #!/bin/ksh ./opa.xx > out_opa.xx.out.${SLURM_PROCID} 2>out_opa.xx.err.${SLURM_PROCID}

I hope this helps ! Arnaud

Posted by Anonymous at April 1 2020

Thank you very much Arnaud.

I build version of my runscript that follows the procedure you developed for LMDZ, and it worked like a charm.

I also got in touch with Hendryk Bockelmann (DKRZ) who told me that he managed to run MPI ESM with packjob. So it does work somehow, but apparently just not quite like I did it. Something to keep in mind for the future.

Cheers, Jan

Posted by Anonymous at April 2 2020

Dear Jan,

Could you send an email to oasishelp@cerfacs.fr with the email address of Hendryk Bockelmann (DKRZ) ? So I could contact him to get the packjob command he is using to put it on the forum ?

Thanks, Best regards, Laure

Posted by Anonymous at April 3 2020

Hello Jan,

Thank you to Arnaud for this solution. I also would like to emphasise that a generic solution to this issue (a mixed MPI/OpenMP parallelism that could be different for each coupled component) is currently under testing with OASIS. The solution is based on the hwloc library and would also allow to dynamically change the MPI/OpenMP ratio, at any moment, during the simulation.

Eric

Posted by Anonymous at May 12 2020

Hi Laure,

I tried the heterogeneous job support (packjob) with slurm 19.05.5 and mpiesm-1.2.01p6-rc3. It was possible to launch some very interesting jobs combining different task-layouts, number of OpenMP-threads and resources used. For echam the OpenMP support does not work correctly but in principle SLURM allows to specify the needed CPUs. My sbatch header looks very easy, e.g.:

#SBATCH --job-name=testrun

#SBATCH --time=00:40:00

#SBATCH --account=foobar

# #SBATCH --partition=compute

#SBATCH --nodes=4

#SBATCH packjob

#SBATCH --partition=compute

#SBATCH --nodes=12

This allocates 4 nodes for the ocean and 12 nodes for the atmosphere.

Then I use srun like:

srun -l --cpu_bind=cores --distribution=block:block -n96 --cpus-per-task=2 $BIN_DIR/$OCEAN_EXE : --cpu_bind=cores --distribution=cyclic:cyclic -n144 --cpus-per-task=4 --export=ALL,OMP_NUM_THREADS=2 $BIN_DIR/$ATMO_EXE

no need to specify the mpmd (--multi-prog) style anymore.

Best, Hendryk
Reply to this