ROMS not bit-reproducible: northern boundary?

General scientific issues regarding ROMS

Moderators: arango, robertson

Post Reply
Message
Author
k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

ROMS not bit-reproducible: northern boundary?

#1 Unread post by k.alexander »

I am running MetROMS (ROMS coupled with CICE) and since updating from ROMS 3.6 to the 3.7 trunk, the model has stopped being bit reproducible. i.e. if I run exactly the same simulation twice it gives me different answers. Not a good sign...

I'm almost positive that I've ruled out CICE or the coupling being the issue, by writing to ocean_his every timestep (there are 6 ocean timesteps to every CICE coupling timestep). The first ocean_his output (timestep 0) is always the same. But the second one (i.e. after the first full timestep) always diverges, even though CICE coupling doesn't happen until after the sixth ROMS timestep. There is an initial coupling call but I did a test run where ROMS ignores these fields from CICE, and it still wasn't bit-reproducible. So unless something really funny is going on with memory, I think the problem is with ROMS.

This is interesting - only some fields are affected in that first "real" output (i.e. time index 2) to ocean_his. Obviously if I leave it for long enough all variables diverge because they interact. But at that first timestep, this is what happens:

Affected (i.e. different between the two runs): zeta, ubar, vbar, u, w, temp, rho, Hsbl, AKv, AKt, AKs, shflux, lwrad, bustr, bvstr
Unaffected: m (ice shelf melt rate), salt, ssflux, swrad, sustr, svstr

I looked at this further and discovered the anomalies were only happening along the northern boundary: the northernmost 3 rows for temp, 8 cells for u and v, 9 for zeta. Again, just for the first timestep, the anomalies propogate south later on. This is why m was unaffected even though it depends on temperature, because the ice shelf cavities are so far south in the domain. For temperature it was just 2 cells affected off the east coast of Australia; for velocity it was most of the northern boundary (makes sense for it to propogate more, because of the 30 barotropic timesteps for every baroclinic).

I thought the issue might be my northern boundary sponge because I coded it myself. So I turned it off, and it was still not bit-reproducible, but temperature was affected in the northernmost 2 rows instead of 3, u 7 instead of 8, v 9 instead of 8, zeta 7 instead of 9.

Okay, so not the sponge. I kept it turned off anyway, for now. Next I set all the northern boundary conditions to closed in my .in file. Previously they were:

Code: Select all

   
   LBC(isFsur) ==   Per     Clo     Per     Cha         ! free-surface
   LBC(isUbar) ==   Per     Clo     Per     Cla         ! 2D U-momentum
   LBC(isVbar) ==   Per     Clo     Per     Fla         ! 2D V-momentum
   LBC(isUvel) ==   Per     Clo     Per     Cla         ! 3D U-momentum
   LBC(isVvel) ==   Per     Clo     Per     RadNud         ! 3D V-momentum
   LBC(isMtke) ==   Per     Clo     Per     Rad         ! mixing TKE

   LBC(isTvar) ==   Per     Clo     Per     RadNud \       ! temperature
                    Per     Clo     Per     RadNud         ! salinity
With all of the northern boundary options (rightmost of the 4 columns) set to "Clo", suddenly temperature was affected in the northernmost 26 rows! which included those cells off Australia as well as the 3 westernmost and 3 easternmost columns (i.e. along the periodic boundary). Also, salinity was affected whereas it wasn't before (northernmost 20 rows). 33 rows for u, 34 for v, 35 for zeta.

Then I tried setting all the northern boundary options to "Gra". Now it wasn't the northern boundary for temperature, but the periodic boundary: 7 columns on either side. For salinity, 2 columns on the west side and 1 on the east.

What is going on??!!

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#2 Unread post by k.alexander »

Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?
Last edited by k.alexander on Thu Mar 16, 2017 5:04 am, edited 1 time in total.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#3 Unread post by k.alexander »

Something else interesting: if I turn off UV_ADV, temperature and some other variables are now unaffected (i.e. bit-reproducible at least in time index 2 of ocean_his). Here is the full list:

Affected: ubar, u, v, w, AKt, AKs
Unaffected: m, zeta, vbar, temp, salt, rho, Hsbl, AKv, shflux, ssflux, lwrad, swrad, sustr, svstr, bustr, bvstr

ubar and u now only diverge at the very northernmost row (all depths of u are affected). v is affected at exactly one cell, one away from the northern boundary, 6 layers below the surface. w is affected at 4 cells in the northernmost 2 rows. AKt and AKs are affected in lots of cells in the northernmost 4 rows.

So at least this narrows down the part(s) of the code where this is originating from.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#4 Unread post by k.alexander »

I just switched to a single processor for ROMS (NtileI=1, NtileJ=1 instead of 32 and 16 as I usually have) and now all variables are bit reproducible.

So it is something about the parallel decomposition. Note that USE_MPI, DISTRIBUTE etc are still defined, so all the same calls are made to mp_exchange2d etc.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#5 Unread post by kate »

On only one tile, that call to mp_exchange2d isn't going to be doing a whole lot.

It sounds like time to be adding strategic print statements and running parallel/serial to be looking for output differences. I have been known to run the debugger with 1x4 tiling vs. 4x1 tiling in two windows. The last time I went down that path I beat back everything until answers did match, but that wasn't metroms. I guess I could try that with metroms for a few steps.

I know of a model where setting VERBOSITY = 9 makes it spill its guts, creating lots of output for comparing to later.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#6 Unread post by k.alexander »

I understand parallel vs serial will usually give output differences though? But parallel vs parallel with the same tiling should be the same.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#7 Unread post by kate »

No, they should *all* be the same.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#8 Unread post by kate »

I just ran a few steps of my branch of metroms with 1x10 and 10x1 tilings. After four steps, there are no diffs for T,S or u_eastward. This is for a pan-Arctic domain. What code is this, you ask? It's a metroms branch of my roms github repo on the ROMS side and a metroms branch of my CICE repo on the other side. I was going to fork the official github CICE once they put it out there because mine should be a fork of the official code. Well, maybe I'll just have to put mine on github because Someone is taking too long.

User avatar
m.hadfield
Posts: 521
Joined: Tue Jul 01, 2003 4:12 am
Location: NIWA

Re: ROMS not bit-reproducible: northern boundary?

#9 Unread post by m.hadfield »

k.alexander wrote:Possibly related issue: Since I updated to 3.7 I have also occasionally had segfaults that don't reproduce themselves. That is, it will segfault, I resubmit the job without even recompiling, and then it doesn't segfault. I understand segfaults that show up inconsistently are often related to variables not being initialised. Might also be causing the lack of bit reproducibility?
I trust you have tried running with bounds checking and the other debug checks turned on?

ROMS segfaults are often due to silly little issues: badly formed input files and/or incomplete checking by the routines that read them.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#10 Unread post by k.alexander »

So this is interesting, some parallel tilings are okay and others are not.

The following have exactly the same output as 1x1:
2x1
1x2
2x2
16x1
1x16
8x2
24x1
30x1
31x1
8x4
12x4
15x4
16x4
16x8

The following don't:
16x16
16x2
32x1
32x16 (this is what I usually run with)

I was confused as to why 16x4 and 16x8 were fine and 16x2 wasn't. So I ran 16x2 again and it was fine this time. And then a third time and it was not. And then 16x4 and 16x8 again and they were still fine. If it matters, I have Lm=1440 and Mm=428.

I think it must be something which is more likely to be triggered with increasing numbers of processors. If it has to do with an uninitialised variable it makes sense that it's inconsistent.

When I run with -check all (and 32x16 processors), first of all I get tons of these warnings during initialisation:

Code: Select all

forrtl: warning (402): fort: (1): In call to LOAD_L, an array temporary was created for argument #4
but no indication of what file they come from. Then during the first timestep, the code dies with this error:

Code: Select all

forrtl: severe (408): fort: (3): Subscript #2 of the array T has value -1 which is less than the lower bound of 0
The traceback points to line 77 in pre_step3d.f90. Here is that line:

Code: Select all

      CALL pre_step3d_tile (ng, tile,                                   &
     &                      LBi, UBi, LBj, UBj,                         &
     &                      IminS, ImaxS, JminS, JmaxS,                 &
     &                      nrhs(ng), nstp(ng), nnew(ng),               &
     &                      GRID(ng) % rmask,                           &
     &                      GRID(ng) % umask,                           &
     &                      GRID(ng) % vmask,                           &
     &                      GRID(ng) % pm,                              &
     &                      GRID(ng) % pn,                              &
     &                      GRID(ng) % Hz,                              &
     &                      GRID(ng) % Huon,                            &
     &                      GRID(ng) % Hvom,                            &
     &                      GRID(ng) % z_r,                             &
     &                      GRID(ng) % z_w,                             &
     &                      FORCES(ng) % btflx,                         &
     &                      FORCES(ng) % bustr,                         &
     &                      FORCES(ng) % bvstr,                         &
     &                      FORCES(ng) % stflx,                         &
     &                      FORCES(ng) % sustr,                         &
     &                      FORCES(ng) % svstr,                         &
     &                      MIXING(ng) % Akt,                           &
     &                      MIXING(ng) % Akv,                           &
     &                      MIXING(ng) % ghats,                         &
     &                      OCEAN(ng) % W,                              &
     &                      OCEAN(ng) % ru,                             &
     &                      OCEAN(ng) % rv,                             &
     &                      OCEAN(ng) % t,                              &
     &                      OCEAN(ng) % u,                              &
     &                      OCEAN(ng) % v)
I think it's complaining about OCEAN(ng) % t which is dimension (LBi:UBi,LBj:UBj,N(ng),3,NT(ng)) as stated in mod_ocean. So subscript 2 is LBj:UBj, and it's saying it has lower bound -1. But this doesn't make sense because in get_bounds LBj should have lower bound Jmin which is either 0 or 1 depending on the grid type, since my grid is not north-south periodic. Also, shouldn't OCEAN(ng)%t have been accessed before the first call to pre_step3d?

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#11 Unread post by k.alexander »

Update: this bounds error also happens when running with a single processor. I'm wondering if it's a red herring.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#12 Unread post by kate »

Maybe we need to ask which compiler? Is there another compiler you can try?

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#13 Unread post by k.alexander »

I'm using the mpif90 compiler from openmpi 1.8.4, which is a wrapper for ifort, using intel-fc version 12.1.9.293. However I used the same compiler for ROMS 3.6 which was bit-reproducible even for 32x16 tiling. It's possible that the intel-fc version has changed on the HPC cluster since then, but the sysadmins are usually really good about testing those things.

My compiler flags are:

Code: Select all

-r8 -i4 -align all -w -ftz -convert big_endian -assume byterecl -no-vec -xHost -g -traceback -O3
again, the same as for ROMS 3.6. And yes I have tried -O0, and it doesn't fix the problem.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#14 Unread post by k.alexander »

Just FYI, I will have no internet access for the next week (off to see the Great Barrier Reef while it's still there!) so there won't be any updates from me here.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#15 Unread post by kate »

I'm also using ifort as my default compiler for ROMS, but I have the option of trying gfortran as well. Do you?

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#16 Unread post by k.alexander »

I've never tried it, I expect it would involve a lot of messing with compiler flags, plus gfortran runs more slowly than ifort. I don't think the compiler is the issue anyway.

Serega.Osipov
Posts: 5
Joined: Thu Oct 01, 2015 4:41 am
Location: KAUST

Re: ROMS not bit-reproducible: northern boundary?

#17 Unread post by Serega.Osipov »

I've seen the same behavior with ROMS svn 797, i.e. sensitivity to the number of tiles. We also now that identical setups are not exactly reproducible on a old and new hardware.
As you've mentioned, serial runs are bit-reproducible, while DM (distributed memory parallelism) are not. The way I think about DM approach sensitivity to the decomposition is the following. Consider 1D array being averaged on multiple cores (say velocity profile). Having different order for the each element summation will result in slightly different answer due to truncation error. This error is small and present for each x and y. Take inverse matrix of that, which is often approximated with iterative methods, and now you have a much larger error. Keep integrating forward in time and this error will propagate. So the end result would be identical to the epsilon perturbation of the initial conditions. I'm not saying this is representative for ROMS, that is just a general idea.

ezaron
Posts: 16
Joined: Mon Oct 26, 2009 3:06 am
Location: Oregon State University

Re: ROMS not bit-reproducible: northern boundary?

#18 Unread post by ezaron »

There is some literature on maintaining bit-reproducible simulations under MPI, e.g.,
http://www.sciencedirect.com/science/ar ... 9114000507

While the lack of bit-level reproducibility is annoying, hopefully the divergence of solutions is not physically significant.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#19 Unread post by k.alexander »

Thanks for the replies - interesting to see that others are having the same issues. The explanation of how truncation errors propagate makes sense, however I don't think it applies here, because ROMS 3.6 was bit-reproducible for me (and the current version is bit-reproducible for certain parallel tilings but not others). Also, the initial conditions are always read the same way, because the first time index in output_his (before a full timestep has passed) is always the same, it diverges on #2.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#20 Unread post by k.alexander »

Just to confirm, I have svn 820. I took a look at the updates since then (the trunk is now at svn 839) and I don't see anything relevant, mainly just 4Dvar stuff that I have turned off anyway. I posted as a bug report now that we know it's not just me (i.e. Serega.Osipov has had the same problems).

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#21 Unread post by k.alexander »

I asked the advice of one of my colleagues who is a software engineer working with different ocean models, and he had this to say:
Those results are somewhat reassuring, if you can get any parallel run
to match the 1x1 results.

It is possible that your collectives are not reproducible, like
MPI_Reduce (used for sums, max, min, etc). They use binary trees over
the MPI ranks, like this:

sum = s1 + s2

s1 = s11 + s12
s2 = s21 + s22

s11 = rank 1 + rank 2
s12 = rank 3 + rank 4
s21 = rank 5 + rank 6
s22 = rank 7 + rank 8

and it may be that most of your layouts are "normal" and fit evenly over
the tree, but that the others have a slightly different layout.

It's possible that ROMS changed some of its collectives, and didn't
check this out. There are tricks to making these reproducible, e.g.
doing it manually with point-to-point. You also might be able to
control this within the MPI library.
Does this sound likely to any of the ROMS developers here?

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#22 Unread post by kate »

Doing gitk on distribute.F, the last real change was ticket 653 in 2014.

Ticket 584 changed calls to mpi_alltoall to mpi_allgather (2013).

There was ticket 554 in 2012 adding aggregate functions.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#23 Unread post by k.alexander »

For the ROMS 3.6 distribution that was bit-reproducible for me, distribute.F is stamped with

Code: Select all

!svn $Id: distribute.F 645 2013-01-22 23:21:54Z arango $
and yes, it uses mpi_alltoall rather than mpi_allgather.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#24 Unread post by kate »

All three of those tickets have something to do with moving to the nested code: 584, for instance.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#25 Unread post by k.alexander »

Hmmm, very interesting. That ticket says "We get identical nesting solutions in serial with partitions, shared-memory (OpenMP), and distributed-memory (MPI) for tile partitions of 1x1, 2x2, and 3x3." So some tile partitions have been tested but not an exhaustive list. Looks like even if nesting is off (as it is for my configuration) the kernel does not reduce to the way it was in 3.6.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#26 Unread post by kate »

Well, can you simply go back to the old distribute.F?

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#27 Unread post by k.alexander »

I will try that tomorrow and let you know what happens. Hopefully I can just revert that file without having everything else break.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#28 Unread post by k.alexander »

Turns out you can just switch to the old version of distribute.F and nothing breaks! Unfortunately it doesn't fix the problem, it's still not bit-reproducible.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#29 Unread post by k.alexander »

I finally got around to setting up standalone ROMS with no modifications. Same grid, forcing, boundary conditions as before, but no sea ice, no ice shelves, none of the other miscellaneous things I've messed with. It is the most recent version of the code as of today.

As before, ROMS is bit-reproducible for serial simulations (1x1) but not for certain parallel simulations (32x16, I didn't bother trying anything else).

So basically I have just confirmed the problem is actually in the ROMS source code and not any of the MetROMS modifications or my own modifications.

User avatar
kate
Posts: 4091
Joined: Wed Jul 02, 2003 5:29 pm
Location: CFOS/UAF, USA

Re: ROMS not bit-reproducible: northern boundary?

#30 Unread post by kate »

With that, you should be able to do "git bisect" to find out when things changed. You can get a git version of the history from "git svn clone" (which will take a while to download the whole history), or do the same operations manually from an svn repo (which will be slow at every step).

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#31 Unread post by k.alexander »

I seem to have resolved the problem by switching from openmpi 1.8.4 to openmpi 1.10.2. I haven't tested all possible tilings, of course, but it is now bit-reproducible for 32x16. I have confirmed this twice.

Comparing 32x16 to 1x1, all the variables are the same except for swrad on one patch of the periodic boundary. This is true right from the first timestep, and strangely enough doesn't seem to impact any of the other variables. So there is still something funny going on. But as long as I keep a consistent tiling it should be okay.

Given that ROMS 3.6 was bit-reproducible using openmpi 1.8.4 (at least for me), I expect some part of the kernel update to 3.7 newly implemented features that didn't work properly in older versions of openmpi. Or something like that.

User avatar
arango
Site Admin
Posts: 1368
Joined: Wed Feb 26, 2003 4:41 pm
Location: DMCS, Rutgers University
Contact:

Re: ROMS not bit-reproducible: northern boundary?

#32 Unread post by arango »

Well, it seems that there is a problem with the latest version of OpenMPI. I wonder what did they change. This kind of problem is difficult to track and frustrating. Compilers and libraries are known to have bugs. One thing to check is if the OpenMPI library was compiled correctly with the same compiler that you are using with ROMS.

Now, exact reproduction of application independent of parallel partitions is not possible if we have global reduction. Couple of years ago I posted the following :arrow: warning. Please check it out.

k.alexander
Posts: 54
Joined: Tue Jun 28, 2016 2:08 pm
Location: CCRC (UNSW), ARCCSS, ACE CRC

Re: ROMS not bit-reproducible: northern boundary?

#33 Unread post by k.alexander »

Thanks Hernan for the link to your warning, I hadn't seen it before. I'm not running with volume conservation enabled, but the same principles hold for global reductions.

Rafael_Santana
Posts: 12
Joined: Wed Jun 04, 2014 1:45 pm
Location: UFBA and REMO

Re: ROMS not bit-reproducible: northern boundary?

#34 Unread post by Rafael_Santana »

Hi all,

A free run (Exp1 - red line) was generated in Sep 2021. In March 2022, I ran Exp2 (blue line) turning on passive tracers (all needed cppdefs), and adding dye_01 to ini, bnd and river files. The experiments generated different results (fig attached - time series of total KE in cm^2/s^2).

I thought the modifications to ini, bnd and river files caused the difference. I turned off tracers and started/forced the model with same ini, bnd and river files from Exp2 (dye_01 included) and ran Exp3 (black line - under green line), which generated different results.

I ran Exp4 (green line) with tracers off (cppdefs) and ini, bnd and river files from Exp1 (without dye_01) and it reproduced Exp3 (cppdefs tracers off but river with bye_01). So, binarily different files (log below) generated the same results (they are similar at the end).

santanarc@maui02:~/small_grid/free_j_inert_20150501$ diff frc_rivers_northeast_shelf_fixed2.nc frc_rivers_northeast_shelf_fixed2_dye.nc
Binary files frc_rivers_northeast_shelf_fixed2.nc and frc_rivers_northeast_shelf_fixed2_dye.nc differ

Exp5 (cyan line) had the same configuration and same result as Exp2 (blue line). Even though, binary files differ (log below).

santanarc@maui02:~/small_grid$ diff free_j_inert_20150501/roms_avg_0100.nc free_j_inert3_20150501/roms_avg_0100.nc
Binary files free_j_inert_20150501/roms_avg_0100.nc and free_j_inert3_20150501/roms_avg_0100.nc differ

Am I doing something wrong, or could the different results be attributed to the libraries used or to the different flags used to compile ROMS?

We're using Intel(R) MPI Library for Linux* OS, Version 2019 Update 5 Build 20190806 (id: 7e5a4f84c) and ROMS/TOMS version 3.7.

Thanks,
Rafa
total_ke_time_series.png

jcwarner
Posts: 1204
Joined: Wed Dec 31, 2003 6:16 pm
Location: USGS, USA

Re: ROMS not bit-reproducible: northern boundary?

#35 Unread post by jcwarner »

i had a little trouble following your logic, but it is not really clear what is causing the differences in the tke.
Can you look at fields of temp, u, v, AKv, etc to see if there is something noticeably different in the runs? like something happening along the open boundary. Are there rivers (with flow) the same in all runs? the passive tracer should not affect the tke, but the Qriver would.
Also, it could be the compiler flags, if you use ffast-math for example it will reduce the accuracy of the computations to gain faster computational speed.
-j

Rafael_Santana
Posts: 12
Joined: Wed Jun 04, 2014 1:45 pm
Location: UFBA and REMO

Re: ROMS not bit-reproducible: northern boundary?

#36 Unread post by Rafael_Santana »

Thanks for your reply.

I looked at the fields of temp, u, v, ssh etc (not AKv yet) and the pattern is similar to what was seen in the TKE: 5 simulations with three slightly different results, and two being reproducible.

I didn't see strong differences at the boundaries, as well as near the rivers. River_transport is the same in all 103 points between the files. The river and bnd files are similar (copy) despite the addition of "dye_01" to them (matlab code below).

Exp2 (blue) and Exp3 (black) had the same ini and forcing files but had different results. I looked for differences in the .h flags and .in files in those experiments and the flags added were:

.h
#define DIAGNOSTICS_UV
#define T_PASSIVE
#define ANA_SPFLUX
#define ANA_BPFLUX

.in
NPT = 1
LtracerSrc == T T T ! temperature, salinity, inert

(screenshots of vimdiff attached)

% matlab code to add dye_01 to river file

path='/scale_wlg_nobackup/filesets/nobackup/niwa00020/santanarc/work/hikurangi/sim26/small_grid/free_j_inert_20150501/';
file_ini=[path,'frc_rivers_northeast_shelf_fixed2_dye.nc'];
r_temp=ncread(file_ini,'river_temp');
% creating dye with dimensions (time_fix, s_rho, river) ;
varname='river_dye_01';
nccreate(file_ini,varname,'Dimensions',{'river',size(r_temp,1),'s_rho',size(r_temp,2),'time_fix',1});
var_fin=r_temp.*0+1;
ncwrite(file_ini,varname,var_fin)
ncwriteatt(file_ini,varname,'time','time_fix');


I will keep looking for differences.

Thanks
Rafa
Attachments
vimdiff_dot_in.png
vimdiff_dot_h.png

aleeson
Posts: 6
Joined: Thu May 12, 2022 7:40 pm
Location: University of Washington

Re: ROMS not bit-reproducible: northern boundary?

#37 Unread post by aleeson »

Hi everyone,

I'm reviving this thread because I recently encountered a bit reproducibility issue, and I want to make sure this issue stays tracked.
I am currently conducting an experiment in which I vary nutrient concentrations from point sources. In theory, these changes should not alter the hydrodynamics, and I expected the velocity fields to be identical between the different runs. However, I observed some hydrodynamic deviations after just one day of run time. After some testing, I traced the issue to a difference in the number of cores and tile-partitioning used during the two runs. A summary of my key results are listed below:
  • Original run
    • Higher nutrient concentrations.
    • Run on 400 cores with 20x20 tile partitioning
  • Test case 1
    • Lower nutrient concentrations.
    • Run on 200 cores with 10x20 tile partitioning (different than original run)
    • Observed hydrodynamic deviations from original run.
  • Test case 2
    • Lower nutrient concentrations.
    • Run on 400 cores with 20x20 tile partitioning (same as original run)
    • Observed no hydrodynamic deviations from original run.
I set up all of these runs on the same supercomputer.
The figure below shows the difference in surface u of the test cases compared to the original run after one day of run time.


I am curious if anyone else has been having issues with bit reproducibility lately, and I'm eager to stay informed on any additional relevant updates.

Thanks,
Aurora

Post Reply