The OASIS Coupler Forum

  HOME

NEMO+OASIS+WRF coupled model freezes

Up to Specific issues in real coupled models

Posted by Anonymous at April 11 2022

I use xios trunk (Revision: 2320) with NEMO trunk (Revision: 15770) in coupled mode.
I also use the last official version of oasis3-mct_4.0

The coupled model freezes after printing
 (oasis_unitget)        9999
 (oasis_unitget)        9999
 (oasis_unitget)        9999
 starting wrf task            0  of            1
 ---> prism_initxios.x           0

I tried to change call_oasis_enddef to false in iodef.xml. But it resulted in segmentation fault
 (oasis_unitget)        9999
 (oasis_unitget)        9999
 (oasis_unitget)        9999
 starting wrf task            0  of            1
---> prism_initxios.x           0
-> info : CServer : Register new Context : nemo_server
-> info : Register new Context : nemo
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
xios_server.exe    0000000000C9A82A  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B36B2DFA630  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B36B19D1071  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B36B19D2C47  PMPI_Iprobe           Unknown  Unknown
xios_server.exe    000000000064FD3C  Unknown               Unknown  Unknown
xios_server.exe    0000000000650690  Unknown               Unknown  Unknown
xios_server.exe    0000000000904D26  Unknown               Unknown  Unknown
xios_server.exe    0000000000653185  Unknown               Unknown  Unknown
xios_server.exe    000000000044EBB9  Unknown               Unknown  Unknown
xios_server.exe    0000000000CF7D96  Unknown               Unknown  Unknown
libc-2.17.so       00002B36B352F555  __libc_start_main     Unknown  Unknown
xios_server.exe    000000000044EACF  Unknown               Unknown  Unknown

What could be the possible reason for the model hanging?
I would appreciate any suggestions.

Posted by Anonymous at April 14 2022

Hi everyone, 

Sorry for not providing help, but in fact we are facing quite similar problem here of freezing models. 
We use NEMO new 4.2 version (revision 15557) with XIOS revision 2297, and oasis-mct_4.0 on Météo-France HPC. 

The results of our tests NEMO-XIOS/AROME or NEMO-XIOS/toymodel are the following: 

- when XIOS is used as a server (using_server = true) : the two models freeze somewhere after entering oasis_enddef; XIOS is somewhere after the exit of oasis_get_intercomm. 

- when XIOS is attached (using_server = false): the two coupled models start running but NEMO/XIOS exit with the following error at nitend
.... ....
In file "iccontext.cpp", function "void cxios_context_handle_create(xios::CContext **, const char *, int)",  line 54 -> Context ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@  unknown

terminate called after throwing an instance of 'xios::CException'
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
oceanx             0000000001CCA29E  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B6B811F9630  Unknown               Unknown  Unknown
libc-2.17.so       00002B6B8173E387  gsignal               Unknown  Unknown
libc-2.17.so       00002B6B8173FA78  abort                 Unknown  Unknown
libstdc++.so.6     00002B6B7E8C2A58  Unknown               Unknown  Unknown
libstdc++.so.6     00002B6B7E8CF646  Unknown               Unknown  Unknown
libstdc++.so.6     00002B6B7E8CF691  Unknown               Unknown  Unknown
libstdc++.so.6     00002B6B7E8CF8C4  Unknown               Unknown  Unknown
oceanx             00000000014956D0  Unknown               Unknown  Unknown
oceanx             0000000000A3049C  Unknown               Unknown  Unknown
oceanx             00000000004C9E6C  Unknown               Unknown  Unknown
oceanx             000000000044EB47  Unknown               Unknown  Unknown
oceanx             000000000044EA40  Unknown               Unknown  Unknown
oceanx             000000000044EA0E  Unknown               Unknown  Unknown
libc-2.17.so       00002B6B8172A555  __libc_start_main     Unknown  Unknown
oceanx             000000000044E929  Unknown               Unknown  Unknown

.... .... 
The toymodel ends fine. 

When using NEMO compiled without key_xios, the coupling works fine... 

So, we would also appreciate some inputs, if any... 

Sincerely, 
Cindy Lebeaupin Brossier

Posted by Anonymous at April 14 2022

Hi,
It is hard for us to tell why your model is hanging. However, we recently wrote a note on the Joint Usage of OASIS3-MCT and XIOS in climate models
https://oasis.cerfacs.fr/wp-content/uploads/sites/114/2022/02/Joint_usage_OASIS3-MCT_XIOS_2022.pdf

Maybe you can start by reading this note and making sure you do everything right?

 With best regards,
  Sophie

Posted by Anonymous at April 26 2022

HI again,
I have asked Yann Meurdesoif, who is XIOS developer (as this indeed looks more like an XIOS problem than an OASIS problem). 
Which MPI version are you using?
Yann says that your problem looks like the MPI Intel bug onMPI_Iprobe.
The workaround is to use the "release_mt" library
         source $I_MPI_ROOT/intel64/bin/mpivars.sh release_mt

In recent versions, the standard MPI library has been merged with the release_mt library but the bug fix from release_mt was not included. Tickets have been open at Intel. 
  Let me know if this helps ...
  Regards,
 Sophie

Posted by Anonymous at May 2 2022

Thank you for the suggestion.
My MPI version is
    Intel(R) MPI Library for Linux* OS, Version 2021.4 
    Build 20210831 (id: 758087adf)
    Copyright 2003-2021, Intel Corporation.

I switched to release_mt library but it didn't help.

But if NEMO is compiled without XIOS, then the coupled model runs successfully.
So, it seems that in my case XIOS causes NOW model to freeze.
Reply to this