Wrong maxloops in lib/psmile/src/mod_oasis_mpi.F90
Up to Bugs and debugs
Posted by Anonymous at August 20 2020
Hello .
=> in lib/psmile/src/mod_oasis_mpi.F90: the maxloops computation is wrong !!!
!>-----------------------------------------------
Log2 reduction of linp over tasks to root :
maxloops = int(sqrt(float(commsize+1)))+1
do m = 1,maxloops
kfac = 2**m
recvid = commrank + kfac/2 ! task to recv from if (mod(commrank,kfac) /= 0
!>........................................................
It must be calculated as a Log base 2 value, like this :
maxloops = int(log(float(commsize+1))/log(2.))+1
=> The bug is apparent for high number of core. For example for commsize=988 -> maxloops=32 -> kfac = 2**32 = 0 as is a 32bit integer . ... and mod(commrank,kfac) , give a divide by zero error
Juan ESCOBAR MésoNH , Support Team
Posted by Anonymous at September 16 2020
Thanks for pointing this out. This will be fixed soon. The original SQRT implementation is not ideal, but as long as the integer math does not overflow, the results will be identical. The original implementation should have leveraged the log base 2 calculation.
Posted by Anonymous at September 21 2020
Hi,
Sorry for jumping in, but the results are not identical.
Try calculating by hand for commsize=(4**n)-1
n commsize current escobar
1 3 3 3
2 15 5 5
3 63 9 7
4 255 17 9
5 1023 33 11
This assumes that int(sqrt(float(4**n))) is 2**n, and int(log(float(4**n))/log(2.)) is 2*n. Rounding may affect the result.
All the best, Paul Leopardi
Posted by Anonymous at September 22 2020
Sorry if I was unclear. The value of maxloops will not be identical, but the way the model behaves will be identical. Within the maxloops loop, there are checks to skip values of m that are larger than needed. So maxloops could also be set to the machine largest int (huge(i)) and the model would get the same result. So maybe what I should have written is "as long as the integer math does not overflow, the model results will be identical". I think the point is that this is a bug fix, it does correct that computation of maxloops, but it does not change answers. Thanks.
Posted by Anonymous at October 12 2020
This fix was merged onto the Oasis3-MCT master in September, 2020.
Thanks for bringing it to our attention.