The OASIS Coupler Forum

  HOME

Problem with mpi task of atmos-ocean coupled model(using oasis3-mct)

Up to Specific issues in real coupled models

Posted by Anonymous at December 21 2022

Hello, I am using oasis3-mct to coupling atmosphere and ocean model (NEMO). After setting the basics and running the model, I have a problem with the mpi task. 
If we give the the same number of mpi tasks to both, or more mpi tasks to NEMO, the model will perform fine. However, if the number of mpi task in the NEMO is less than the atmosphere model (atmos core=4, ocean core=1~3), the model gets stop on mpi_waitall.

The line of code that is stopping is:

~/oasis3-mct/lib/mct/mct/m_Transfer.F90

Subroutine waitrecv_(aV, Rout, Sum)
.....
if(Rout%numratt.ge.1) then
       Call MPI_WAITALL(Rout%nprocs,Rout%rreqs,Rout%rstatus,ier)
       call if(ier /= 0) MP_perr_die(myname_,'MPI_WAITALL(reals)',ier)
endif
....

I look forward to the help of the oasis-mct developers.

thanks..

Posted by Anonymous at December 27 2022

Hi,
It is hard to believe that this is a problem with MCT or with the OASIS API as as they have been stable for quite some time (but one never knows!).
Ideally, could you set-up a toy coupled model (ie. no real models but realistic coupling exchanges) reproducing the problem that we could run on our side to try to understand the problem?
In the mean time, some questions to consider:
- We assume the models are run as 2 exectuables running concurrently on separate pes.  If not, then what is your setup?
- Does the behaviour change if you switch the order of model launching (i.e. swap "model1" and "model2" in the job launch command)?
- Does this happen on the first communication?
- When you change the task count, is something overlooked in the individual model setup or in the calls to the OASIS API?  Is the partition correctly expressed with the oasis_def_partition?  Are the coupling initialization calls correct?  Are you sure you are launching each model on the correct number of tasks? 

Let un know ...
 Sophie, for the OASIS3-MCT developers

Posted by Anonymous at December 29 2022

HI, Sophie.

Thanks for your kindly reply.

- We are running these models as below.
mpirun --hostfile ~/openmpi.hosts -np 15 nemo.exe : --hostfile ~/openmpi.hosts -np 15 ./atmos.exe

- Waiting occurs even if atmos.exe and nemo.exe are changed as shown below.
mpirun --hostfile ~/openmpi.hosts -np 15 nemo.exe : --hostfile ~/openmpi.hosts -np 15 atmos.exe
mpirun --hostfile ~/openmpi.hosts -np 15 atmos.exe : --hostfile ~/openmpi.hosts -np 15 nemo.exe
In both cases, the number of pe used in the nemo.exe model must not be less than the number of pe used in atmos.exe.

- Initialization of mpi communicator, partition and var definintion are successfully performed, and waiting occurs during get(NEMO->ATMOS) after the first put(ATMOS->NEMO).

- When changing the mpi task, there is no difference in the settings in the model code.
The atmos model uses an orange partition and is configured as belows.
 !! 1. PARTITION DEFINITION
        id_grid_size(1)=3              !orange partition
        id_grid_size(2)=latlen*2 !the upper corner global offset

        idx=3
        DO i = 1, latlen
          def_i=latstr+(i-1)
          latdef_=latdef(def_i)

          id_grid_size(idx)=(jdim-latdef_)*idim
          id_grid_size(idx+1)=idim
          id_grid_size(idx+2)=(latdef_-1)*idim
          id_grid_size(idx+3)=idim

          idx = idx + 4
        END DO

        CALL oasis_def_partition(part_id(cnt),id_grid_size(:),ierr)

The version of NEMO is 3.6, and the box partition definition in cpl_oasis3.F90 is used as it is.
  ! -----------------------------------------------------------------
      ! ... Define the partition
      ! -----------------------------------------------------------------

      paral(1) = 2                                              ! box partitioning
      paral(2) = jpiglo * (nldj-1+njmpp-1) + (nldi-1+nimpp-1)   ! NEMO lower left corner global offset
      paral(3) = nlei-nldi+1                                    ! local extent in i
      paral(4) = nlej-nldj+1                                    ! local extent in j
      paral(5) = jpiglo                                         ! global extent in x

      IF( ln_ctl ) THEN
         WRITE(numout,*) ' multiexchg: paral (1:5)', paral
         WRITE(numout,*) ' multiexchg: jpi, jpj =', jpi, jpj
         WRITE(numout,*) ' multiexchg: nldi, nlei, nimpp =', nldi, nlei, nimpp
         WRITE(numout,*) ' multiexchg: nldj, nlej, njmpp =', nldj, nlej, njmpp
      ENDIF

      CALL oasis_def_partition ( id_part, paral, nerror )

There seems to be no problem in the initialization process, and the model is performed according to the set number of mpi tasks.
There is no problem if both models have the same mpi task, or if the NEMO has more pes.

thanks...

joon,

Posted by Anonymous at December 29 2022

Hello Sophie.

I have additional issue.

The atm model requires the orange partition, but I tested it with a box partition.

After that it is running successfully, regardless of the mpi task.

It seems that there is a problem between the orange partition and the box partition in OASIS3-MCT v5.


thank you

joon,

Posted by Anonymous at December 29 2022

Hi Joon,

This is pretty weird. I guess the only thing to do at this point is to set up a toy model reproducing your problem so that we can run it and try to understand what happens.
Can you set up this toy model i.e. two "empty" codes (from the science point of view) that  define the same grids and partitions than your model, performs the same coupling exchanges and so reproduce the problem? If you do so, we could run it and try to fix the bug.
Let me know ...
  Regards,
  Sophie

Posted by Anonymous at December 30 2022

Hi, Sophie

I set up atm and ocn dummy models with the same grids and partitions for the coupled models.

After testing, I found out that the same problem occurs.

I fixed atm dummy model to 4 mpi tasks because the orange partition configuration  was complicated.

The ocn dummy model is set up as a box partition where you can set the mpi task to any number.

Please let me know the email address where I can send you the code.

Regars,

joon.

Posted by Anonymous at December 30 2022

Thanks.If it is not too big, you can send the toy model to oasishelp@cerfacs.fr .
 Regards,
 Sophie

Posted by Anonymous at December 30 2022

I just sent you a mail. thanks!!!

Please I hope it solves well.

Regards,

Joon.
Reply to this